pad 6 0 prepends 6 and appends 0 seconds of silence, so I assume you only want to prepend silence.
You can do this with a while loop, for example in bash:
cat <<EOF > infile
audio1.wav 0
audio2.wav 2
audio3.wav 2
audio4.wav 4
EOF
while read fname len; do
sox $fname -p pad $len 0 | sox -m -p long.wav output.wav
done < infile
I would suggest keeping the files uncompressed until your done processing.
Be careful of clipping when mixing, one way to avoid it is to apply -6dB gain to both signals and normalize after, e.g.:
sox $fname -p pad $len 0 gain -6 | sox -m -p "| sox long.wav -p gain -6" output.wav gain -n