What is a good approach for extracting portions of speech from an arbitrary audio file?

Question

I have a set of audio files that are uploaded by users, and there is no knowing what they contain.

I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.

(I'm targeting a Linux environment, and developing on a Mac)

I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.

I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.

Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?

7 years later, I have the same question and am trying to solve the same problem. @stef . could you please share how to achieved this task. — DJ_Stuffy_K, Jan 31 '18 at 20:12
Ah gosh, sorry, we raised a ton of funding and threw people at it! — stef, Jan 31 '18 at 23:26
HaHa! glad to know that you had the necessary resources to see it through to completion! :) — DJ_Stuffy_K, Feb 01 '18 at 02:19

Andrea Spadaccini · Accepted Answer · 2011-03-31T15:16:35.613

EnergyDetector

For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.

It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.

You will then run EnergyDetector in this way:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output

If you use the configuration file that you find at the end of the answer, you need to put output.prm in prm/, and you'll find the segmentation in lbl/.

As a reference, I attach my EnergyDetector configuration file:

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.

A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element

Other VADs

I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.

@stef you are welcome. I hope that you did find it useful and succeeded with your task, that is difficult! — Andrea Spadaccini, Apr 02 '11 at 20:40
Hi I tried your instructions but I had an issue. I used a file that reported it was "RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 16000 Hz" Proceeding Energy based silence detection for [../output] (SegTools) The label format is LIARAL [ InvalidDataException 0x10f19b0 ] message = "Wrong header" — stackoverflow128, Feb 21 '14 at 09:46

score 3 · Answer 2 · answered Apr 24 '16 at 17:02

webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.

It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.

The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)

score 2 · Answer 3 · edited Sep 01 '21 at 17:53

2

Hi pyAudioAnalysis has a silence removal functionality.

In this library, silence removal can be as simple as that:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

Internally silence removal() follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.

Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610

edited Sep 01 '21 at 17:53

strikersps

338
3
16

answered Apr 17 '15 at 23:31

Theodore Giannakopoulos

151
2
3

1

That's a pretty neat library! Would you also be able to post a sample code snippet to fill out this answer a bit? – SJoshi Apr 17 '15 at 23:44
As @SJoshi said, it'd be grand if you could add some code showing how you'd use pyAudioAnalysis. That would really help those who see your answer to make use of it. – Wai Ha Lee Apr 17 '15 at 23:57
what if there is no silence in the audio but only speech and music segments. does pyaudio analysis handle that condition? Also it would be nice if you could add some code sir. – kRazzy R Dec 18 '17 at 15:26
yes it can if you train a "segment classifier" and then apply fix-sized segmentation – Theodore Giannakopoulos Dec 11 '19 at 09:36

score 0 · Answer 4 · answered Jul 21 '14 at 18:13

0

SPro and HTK are the toolkits you neeed. You can also see there implementation using the documentation of Alize Toolkit.

http://alize.univ-avignon.fr/doc.html

answered Jul 21 '14 at 18:13

Ashutosh Sharma

1
1

2

Much as I wrote for Theodore, having an example in your answer would improve it immensely. That way we're not completely reliant on a link. – SJoshi Apr 17 '15 at 23:59
Mr. Ashutosh, please add a complete answer as per stackoverflow recommended guidelines – kRazzy R Dec 18 '17 at 15:30

What is a good approach for extracting portions of speech from an arbitrary audio file?

4 Answers4

EnergyDetector

CMU Sphinx

Other VADs