Counting duplicates lines from a stream

Question

I'm currently parsing apache logs with that command:

tail -f /opt/apache/logs/access/gvh-access_log.1365638400  | 
grep specific.stuff. | awk '{print $12}' | cut -d/ -f3 > ~/logs

The output is a list of domains:

www.domain1.com
www.domain1.com
www.domain2.com
www.domain3.com
www.domain1.com

In another terminal I then run this command:

watch -n 10 'cat ~/logs | sort | uniq -c | sort -n | tail -50'

The output is:

1023 www.domain2.com
2001 www.domain3.com
12393 www.domain1.com

I use this to monitor in quasi real time apache stats. The trouble is that logs get very big very fast. I don't need logs for any other purpose than uniq -c.

My question is: is there any way to avoid using a temporary file? I don't want to hand-roll my own counter in my language of choice, I'd like to use some awk magic if possible.

Note that since I need to use sort, I have to use a temp file in the process, because sorting on streams is meaningless (although uniq isn't).

score 0 · Answer 1 · answered Apr 11 '13 at 09:24

Although it might be pointing the obvious but, did you try this:

tail -f /opt/apache/logs/access/gvh-access_log.1365638400  | grep specific.stuff. | awk '{print $12}' | cut -d/ -f3 | sort | uniq | sort -n | tail -50

I know it is a long command line but it elimibates the creation of the intermediary file. If this is not working for you, could you please tell why, so that you can get more meaningful answers.

Counting duplicates lines from a stream

1 Answers1