I have a MapReduce job defined in main.py, which imports the lib module from lib.py. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py
-mapper "./main.py map" -reducer "./main.py reduce"
-input input -output output
In my understanding, this should put both main.py and lib.py into the distributed cache folder on each computing machine and thus make module lib available to main. But it doesn't happen: from the log I see that files are really copied to the same directory, but main can't import lib, throwing ImportError.
Why does this happen and how can I fix it?
UPD. Adding the current directory to the path didn't work:
import sys
sys.path.append(os.path.realpath(__file__))
import lib
# ImportError
though, loading the module manually did the trick:
import imp
lib = imp.load_source('lib', 'lib.py')
But that's not what I want. So why does the Python interpreter see other .py files in the same directory, but can't import them? Note that I have already tried adding an empty __init__.py file to the same directory without effect.