Searching across multiple documents for common words

Question

I have the lyrics to a song. (.txt)

I also have lyrics to 50 other songs.

I'm looking for a way to analyse/search those 50 song lyrics with the lyrics to the first song, and find which one of the 50 is most similar to the first (based on shared words/vocabulary).

I'm sorry for layman's speak - this isn't my area of knowledge(!)

Any help or pointers would be much appreciated

Jack · Answer 1 · 2015-10-07T08:51:36.863

Here's my solution, I presumed that you only care how many words match rather how many times they match (E.g. 'Baby' 5 times in both songs is worth 5x as many 'points).

First:

cat songname.txt | sed ':a;N;$!ba;s/\n/ /g' | tr -cd '[[:alnum:]]\ ' | sed 's#\ \ #\ #g' | sed 's#\ #\n#g' | sort | uniq -i > songnamewords.txt

This turns all newlines into spaces, removes all non-alphanumeric characters (Commas), removes any double spaces, puts every word on a seperate line, sorts them and removes duplicate lines.

You need to do this to all the songs you want to compare, then secondly:

cat songname1words.txt songname2words.txt | sort | uniq -d | wc -l

This will give you a number of how many words matched.

I tried a few examples:

Maroon 5's Animals and Justin Bieber's Baby share 29 words.

Maroon 5's Animals and Opeth's Grand Conjuration share 10 words.

These are the kind of results you'd expect.

Also, here's how you would compare it against all other lyrics files:

a="songname1words.txt" && for f in *; do if [[ "$f" != "$a" ]]; then printf $(cat "$a" "$f" | sort | uniq -d | wc -l) && echo " - $f" | sort; fi; done

Where 'songname1words.txt' is the filename you want to compare them all against.

This compares all other text files against this one, skipping comparing itself to itself, it then sorts them all by score so that the number 1 match is at the top.

It gives output like this:

29 - bieberwords.txt

10 - opethwords.txt

Searching across multiple documents for common words

1 Answers1