hadoop, when run under spark, merges its stderr into stdout

Question

When I type

hadoop fs -text /foo/bar/baz.bz2 2>err 1>out

I get two non-empty files: err with

2015-05-26 15:33:49,786 INFO  [main] bzip2.Bzip2Factory (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & initialized native-bzip2 library system-native
2015-05-26 15:33:49,789 INFO  [main] compress.CodecPool (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2]

and out with the content of the file (as expected).

When I call the same command from Python (2.6):

from subprocess import Popen
with open("out","w") as out:
    with open("err","w") as err:
        p = Popen(['hadoop','fs','-text',"/foo/bar/baz.bz2"],
                  stdin=None,stdout=out,stderr=err)
print p.wait()

I get the exact same (correct) behavior.

However, when I run the same code under PySpark (or using spark-submit), I get an empty err file and the out file starts with the log messages above (and then it contains the actual data).

What am I doing wrong?

NB: the intent of the Python code is to give the output of hadoop fs -text to another program (i.e., passing stdout=PIPE to Popen), so please do not suggest hadoop fs -get. Thanks.

PS. When I run hadoop under time:

from subprocess import Popen
with open("out","w") as out:
    with open("err","w") as err:
        p = Popen(['/usr/bin/time','hadoop','fs','-text',"/foo/bar/baz.bz2"],
                  stdin=None,stdout=out,stderr=err)
print p.wait()

the time output correctly goes to err, but hadoop logs incorrectly go to out.

I.e., hadoop merges its stderr into its stdout when it runs under spark.

hadoop, when run under spark, merges its stderr into stdout

0 Answers0

Linked