None of the answers here addressed all of my needs. 
- No threads for stdout (no Queues, etc, either)
- Non-blocking as I need to check for other things going on
- Use PIPE as I needed to do multiple things, e.g. stream output, write to a log file and return a string copy of the output.
A little background: I am using a ThreadPoolExecutor to manage a pool of threads, each launching a subprocess and running them concurrency. (In Python2.7, but this should work in newer 3.x as well). I don't want to use threads just for output gathering as I want as many available as possible for other things (a pool of 20 processes would be using 40 threads just to run; 1 for the process thread and 1 for stdout...and more if you want stderr I guess)
I'm stripping back a lot of exception and such here so this is based on code that works in production. Hopefully I didn't ruin it in the copy and paste. Also, feedback very much welcome!
import time
import fcntl
import subprocess
import time
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# Make stdout non-blocking when using read/readline
proc_stdout = proc.stdout
fl = fcntl.fcntl(proc_stdout, fcntl.F_GETFL)
fcntl.fcntl(proc_stdout, fcntl.F_SETFL, fl | os.O_NONBLOCK)
def handle_stdout(proc_stream, my_buffer, echo_streams=True, log_file=None):
    """A little inline function to handle the stdout business. """
    # fcntl makes readline non-blocking so it raises an IOError when empty
    try:
        for s in iter(proc_stream.readline, ''):   # replace '' with b'' for Python 3
            my_buffer.append(s)
            if echo_streams:
                sys.stdout.write(s)
            if log_file:
                log_file.write(s)
    except IOError:
        pass
# The main loop while subprocess is running
stdout_parts = []
while proc.poll() is None:
    handle_stdout(proc_stdout, stdout_parts)
    # ...Check for other things here...
    # For example, check a multiprocessor.Value('b') to proc.kill()
    time.sleep(0.01)
# Not sure if this is needed, but run it again just to be sure we got it all?
handle_stdout(proc_stdout, stdout_parts)
stdout_str = "".join(stdout_parts)  # Just to demo
I'm sure there is overhead being added here but it is not a concern in my case. Functionally it does what I need. The only thing I haven't solved is why this works perfectly for log messages but I see some print messages show up later and all at once.