Music Recognition and Signal Processing

Question

I want to build something similar to Tunatic or Midomi (try them out if you're not sure what they do) and I'm wondering what algorithms I'd have to use; The idea I have about the workings of such applications is something like this:

have a big database with several songs
for each song in 1. reduce quality / bit-rate (to 64kbps for instance) and calculate the sound "hash"
have the sound / excerpt of the music you want to identify
for the song in 3. reduce quality / bit-rate (again to 64kbps) and calculate sound "hash"
if 4. sound hash is in any of the 2. sound hashes return the matched music

I though of reducing the quality / bit-rate due to the environment noises and encoding differences.

Am I in the right track here? Can anyone provide me any specific documentation or examples? Midori seems to even recognize hum's, that's pretty awesomely impressive! How do they do that?

Do sound hashes exist or is it something I just made up? If they do, how can I calculate them? And more importantly, how can I check if child-hash is in father-hash?

How would I go about building a similar system with Python (maybe a built-in module) or PHP?

Some examples (preferably in Python or PHP) will be greatly appreciated. Thanks in advance!

Related questions: [Sound sample recognition library/code](http://stackoverflow.com/questions/2817663/sound-sample-recognition-library-code), [Acoustic fingerprint opensource?](http://stackoverflow.com/questions/2462410/acoustic-fingerprint-opensource). Good [wiki overview from MusicBrainz](http://wiki.musicbrainz.org/Fingerprinting). — Albert, Jan 15 '13 at 10:00
Here's an excellent option for Python: https://github.com/worldveil/dejavu — lollercoaster, Jul 21 '14 at 04:40

score 11 · Answer 1 · answered Jan 15 '10 at 15:20

I do research in music information retrieval (MIR). The seminal paper on music fingerprinting is the one by Haitsma and Kalker around 2002-03. Google should get you it.

I read an early (really early; before 2000) white paper about Shazam's method. At that point, they just basically detected spectrotemporal peaks, and then hashed the peaks. I'm sure that procedure has evolved.

Both of these methods address music similarity at the signal level, i.e., it is robust to environment distortions. I don't think it works well for query-by-humming (QBH). However, that is a different (yet related) problem with different (yet related) solutions, so you can find solutions in the literature. (Too many to name here.)

The ISMIR proceedings are freely available online. You can find valuable stuff there: http://www.ismir.net/

I agree with using an existing library like Marsyas. Depends on what you want. Numpy/Scipy is indispensible here, I think. Simple stuff can be written in Python on your own. Heck, if you need stuff like STFT, MFCC, I can email you code.

I knew I forgot something: similarity at the note level is often accomplished by computing a chromagram. For each time frame, you compute how much C, C#, D, D#, ..., A#, B is in the signal. Therefore, the chromagram has dimension 12-by-time. Many similarity methods use this. — Steve Tjoa, Jan 15 '10 at 15:23
Thanks for your input. Could you email me the code? I'm having a hard time trying to figure out this, some example code would be useful. — Alix Axel, Jan 21 '10 at 21:36

score 10 · Accepted Answer · answered Jan 14 '10 at 23:27

I worked on the periphery of a cool framework that implements several Music Information Retrieval techniques. I'm hardly an expert (edit: actually i'm nowhere close to an expert, just to clarify), but I can tell that that the Fast Fourier Transform is used all over the place with this stuff. Fourier analysis is wacky but its application is pretty straight-forward. Basically you can get a lot of information about audio when you analyze it in the frequency domain rather than the time domain. This is what Fourier analysis gives you.

That may be a bit off topic from what you want to do. In any case, there are some cool tools in the project to play with, as well as viewing the sourcecode for the core library itself: http://marsyas.sness.net

How can i detect if someone whistle to iPhone microphone..? – pankaj asudani Aug 30 '15 at 14:31 — pankaj asudani, Aug 30 '15 at 14:31

score 5 · Answer 3 · answered Jun 06 '14 at 18:09

I recently ported my audio landmark-based fingerprinting system to Python:

https://github.com/dpwe/audfprint

It can recognize small (5-10 sec) excerpts from a reference database of 10s of thousands of tracks, and is quite robust to noise and channel distortions. It uses combinations of local spectral peaks, similar to the Shazam system.

This can only match the exact same track, since it relies on fine details of frequencies and time differences - it wouldn't even match different takes, certainly not cover versions or hums. As far as I understand, Midomi/SoundHound works by matching hums to each other (e.g. via dynamic time warping), then has a set of human-curated links between sets of hums and the intended music track.

Matching a hum directly to a music track ("Query by humming") is an ongoing research problem in music information retrieval, but is still pretty difficult. You can see abstracts for a set of systems evaluated last year at the MIREX 2013 QBSH Results.

StackUnderflow · Answer 4 · 2010-01-14T23:47:10.407

MFCC extracted from the music is very useful in finding the timbrel similarity between songs.. this is most often used to find similar songs. As pointed by darren, Marsyas is a tool that can be used to extract MFCC and find similar songs by converting the MFCC in to a single vector representation..

Other than MFCC, Rhythm is also used to find song similarity.. There are few papers presented in the Mirex 2009

that will give you good overview of different algorithms and features that are most helpful in detecting music similarity.

Albert · Answer 5 · 2015-10-10T23:38:19.547

The MusicBrainz project maintains such a database. You can make queries to it based on a fingerprint.

The project exists already since a while and has used different fingerprints in the past. See here for a list.

The latest fingerprint they are using is AcoustId. There is the Chromaprint library (also with Python bindings) where you can create such fingerprints. You must feed it raw PCM data.

I have recently written a library in Python which does the decoding (using FFmpeg) and provides such functions as to generate the AcoustId fingerprint (using Chromaprint) and other things (also to play the stream via PortAudio). See here.

score 1 · Answer 6 · answered Jan 14 '10 at 23:30

Its been a while since i last did signal processing, but rather than downsampling you should look at frequency-domain representations (eg FFT or DCT). Then you could make a hash of sorts and search for the database song with that sequence in.

Tricky part is making this search fast (maybe some papers on gene search might be of interest). I suspect that iTunes also does some detection of instruments to narrow down the search.

score 1 · Answer 7 · answered Jan 14 '10 at 23:53

I did read a paper about the method in which a certain music information retrieval service (no names mentioned) does it - by calculating the Short Time Fourier transform over the sample of audio. The algorithm then picks out 'peaks' in the frequency domain i.e. time positions and frequencies that are particularly high amplitude, and uses the time and frequency of these peaks to generate a hash. Turns out the hash has surprising few collisions between different samples, and also stands up against approx 50% data loss of the peak information.....

score 1 · Answer 8 · answered Jan 17 '10 at 13:56

Currently I'm developing a music search engine using ActionScript 3. The idea is analyzing the chords first and marking the frames (it's limited to mp3 files at the moment) where the frequency changes drastically (melody changes and ignoring noises). After that I do the same thing to the input sound, and match the results with the inverted files. The matching one determines the matching song.

For Axel's method, I think you shouldn't worry about the query whether it's a singing or just humming, since you don't implement a speech recognition program. But I'm curious about your method which uses hash functions. Could you explain that to me?

score 0 · Answer 9 · answered Sep 15 '15 at 04:02

For query by humming feature, it is more complicate than the audio fingerprinting solution, the difficult comes from:

how to efficiently collect the melody database in real world application? many demo system use midi to build-up, but midi solution's cost is extremely not affordable for a company.
how to deal with the time variance, for example, user hum may fast or slow. use DTW? yes, DTW is a very good solution for dealing with time series with time variance, BUT it cost too much CPU-load.
how to make time series index?

Here is an demo query by humming open source project, https://github.com/EmilioMolina/QueryBySingingHumming, could be an reference.

Music Recognition and Signal Processing

9 Answers9

Linked