I currently have this python code (I'm using Apache Spark, but pretty sure that it doesn't matter for this question).
import numpy as np
import pandas as pd
from sklearn import feature_extraction
from sklearn import tree
from pyspark import SparkConf, SparkContext
## Module Constants
APP_NAME = "My Spark Application"
df = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
def train_tree():
# Do more stuff with the data, call other functions
pass
def main(sc):
cat_columns = ["Sex", "Pclass"]
# PROBLEM IS HERE
cat_dict = df[cat_columns].to_dict(orient='records')
vec = feature_extraction.DictVectorizer()
cat_vector = vec.fit_transform(cat_dict).toarray()
df_vector = pd.DataFrame(cat_vector)
vector_columns = vec.get_feature_names()
df_vector.columns = vector_columns
df_vector.index = df.index
# train data
df = df.drop(cat_columns, axis=1)
df = df.join(df_vector)
train_tree()
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
# Execute Main functionality
main(sc)
When I run it, I get the error: cat_dict = df[cat_columns].to_dict(orient='records') UnboundLocalError: local variable 'df' referenced before assignment
I find this puzzling because I am defining the variable df outside of the main function scope at the top of the file. Why would using this variable inside the function trigger this error? I have also tried putting the df variable definition inside the if __name__ == "__main__": statement (before the main function is called)
Now, obviously there are lots of ways I could solve this, but this is more about helping me to understand Python better. So I want to ask:
a) Why this error even occurs?
b) How best to solve it given that:
- I don't want to put the df definition inside the main function because I want to access it in other functions.
- I don't want to use a class
- I don't want to use a global variable
- I don't want to pass df around in function parameters