6

I have a huge data source that I'm filtering using some greps.

Here's basically what I'm doing right now:

#!/bin/bash
param1='something'
param2='another'
param3='yep'
echo $(avro-read /log/huge_data | grep $param1 | grep "$param2-" | grep $param3 | wc -l) / $(avro-read /log/ap/huge_data | grep $param1 | grep -v "$param2-" | grep $param3 | wc -l) | bc -l

Notice how I'm doing mostly the same filtering twice (a single difference the second time), taking the count of each, and dividing the final result. This is definitely a hacky thing to do, but I'd like to try and speed it up just a bit and only perform the initial filtering once without using a temp file.

I tried using a fifo, but I'm not sure if it's possible to have two processes in one script reading from it, as well as have a third process "wait" until both are done to compute the final result. I also looked into using tee, but again not sure how to synchronize the resulting sub processes.

EDIT: Solved this myself using https://superuser.com/a/561248/43649, but marked another suggestion as the answer.

Andrew
  • 351

3 Answers3

3

If you just want to avoid creating temporary files (or storing the output of grep in a variable), you can feed it to a for loop like this:

#!/bin/bash

IFS=$'\n'
yay=0
nay=0

for line in `avro-read /log/huge_data | grep $param1 | grep $param3`; do
    [[ $line =~ $param2- ]] && yay=$(($yay + 1)) || nay=$(($nay + 1))
done

echo $yay / $nay \* 100 | bc -l

unset IFS

I've created a modified version of the approach in your self-answer that won't require temporary files:

#!/bin/bash

(avro-read /log/huge_data | grep $param1 | grep $param3 | tee \
     >(echo yay=`grep -c "$param2-"`) \
     >(echo nay=`grep -vc "$param2-"`) \
     >/dev/null | cat ; echo 'echo $yay / $nay \* 100 | bc -l') | sh

The output of the individual grep -c commands and the echo command get printed as

yay=123
nay=456
echo $yay / $nay \* 100 | bc -l

to avoid race conditions1. Piping to sh executes the printed commands.

1 Whichever grep -c command finishes first will print the first line of output.

Dennis
  • 50,701
2

I ended up solving this like so:

#!/bin/bash
param1='something'
param2='another'
param3='yep'

avro-read /log/huge_data | grep $param1 | grep $param3 \
| tee \
>(grep "$param2-" | wc -l | tr -d '\n' > has_count) \
>(grep -v "$param2-" | wc -l | tr -d '\n' > not_count) \
> /dev/null

echo $(cat has_count | tr -d '\n') '/' $(cat not_count | tr -d 'n') '* 100' | bc -l

So rather than relying on a fifo or temp file, I used tee to split the stream into two separate processes that just output a count! This way I don't need to try and synchronize the two processes before trying to divide the counts.

Andrew
  • 351
0

Hm, zsh has a feature, called MULTIOS. Therewith it's possible to connect one process to two fifo's. If that's an option here a small demo:

#!/bin/zsh -f

setopt multios

mkfifo f1 f2 2> /dev/null

param1='something'
param2='another'
param3='yep'

{ avro-read /log/huge_data | grep $param1 | grep $param3 } > f1 > f2 &

( cat f1 | grep $param2 | wc -l > value1 ) &!
value2=$(cat f2 | grep -v $param2 | wc -l)

print $(( 1. * $( cat value1 ) / $value2 ))

rm value1

However, I could not figure out a way to get round the creation of the temporary file value1, which should be probably avoided as pointed out by Dennis. But perhaps you'll like this solution nevertheless.

mpy
  • 28,816