Windows 10 speaker sound (voice) to text?

Question

I understand that I can voice control Windows 10, as well as I can create "voice to text" (dictate). Is there a way to simply display the speaker sound (in this very case my Spanish teacher speaking) as text?

It shall work a bit like YouTube "auto captions", simply displaying everything said as (Spanish) text.

Dictate works based on the MIC input, I would rather need to use the speaker output as source.
Dictate stops, I would need a permanent voice to text translation

Any way to configure Windows to do that? Or other solutions?

questionto42 · Answer 1 · 2021-04-21T13:24:59.247

Seems as if there is no Windows built-in program that can do that for now, although one can expect this in future, especially if the Windows assistant Cortana is already there, and with the Speech-To-Text app already available on a smaller scale.

Yet, for now, the "other solutions" are needed:

You need to search for an ASR (=STT) model, meaning "Automatic Speech Recognition" (=Speech-To-Text) model

A nice theoretical overview of ASR is at https://maelfabien.github.io/machinelearning/speech_reco/#.

As this question is about the practical side of it:

You will either need to buy a Speech-To-Text program - I have once bought Dragon NaturallySpeaking of the market leader "Nuance" that was sold in combination with a Philips VoiceTracer. This shall not advertise anything, it is just the way how I got my first Speech-To-Text program. I have never tested it, although doing that is still on my list :).
Or you need to search for a pretrained model / train a model yourself.

I will just tell how I searched for it, which is the main answer, not the exact links. StackExchange is rather not about dropping some products or links, which is deemed rather off-topic. I have not tested anything and I am not a professional user.

Searching for ASR models, I found three pretrained models at "Hugging Face", which is an AI community that offers the seemingly most relevant choice of models, good if I only want to find few but relevant results at first: https://huggingface.co/models?pipeline_tag=automatic-speech-recognition. Then I had a look at them in detail and found them to be trained on models which are publicly available on GitHub:

Two are based on ESPnet. Mind that ESPnet2 is going to come soon. A demo is available at https://github.com/espnet/espnet#asr-demo.
The Facebook model is based on wav2vec model at https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

Then we see here that everything starts and ends on GitHub, which should not surprise. On GitHub, you would want to search for ASR, STT, Automatic Speech Recognition, Speech-To-Text, and perhaps just "speech", as I did, sorting the results by stars, to find "Mozilla DeepSpeech" to be the most promising project: https://github.com/mozilla/DeepSpeech#project-deepspeech.

For Chrome, there is SpeechTexter which supports all of the various dialects of Spanish.

You should try the free version of Google Speech-to-Text.

Also, if you search with the right keywords and add your language, you will find models that are pretrained in your needed language, for example

"speech spanish" leads to https://github.com/luchovelez/SpeechRecognition
"deepspeech spanish" shows six results with few to no stars (which shall not say that they will not work): https://github.com/search?q=deepspeech+spanish&type=Repositories

If you go on searching like this, you will find more projects. You will usually not need any programming skills, the demos are more a copy and paste job. The only thing needed is to have the right programming framework at hand.

Mind that some models or programs need a chosen sample rate as input, for example 16 KHz. You will sometimes need to reformat your audio files or your audio input.

Horst Walter · Answer 2 · 2021-02-27T19:35:42.933

Here is what I am currently using:

I have used a software (in my case VOICEMEETER) which allows me to redirect my sound output to 2 devices. An extern software is used because Windows Mixer in my case is no option (Windows mixer "not mixing" with headset, but with another output device. Why?).
VOICEMEETER allows me to route back the output sound to a (virtual) input device. So I have now a VIRTUAL input device which reads back the output sound.
Next I set the Microphone in Google Chrome to that VIRTUAL input device
Hence I can use Google translate to create a transcript. This works with any sound, so I could also playback music or video. .

A litte summary:

My use case is that I want so see the transcript of my Spanish teacher speaking
I now can simply achieve that my going to "Google Translate" and press the MIC button
It is even possible for me to see the Spanish AND English text at the same time
I need VOICEMEETER because I still need to hear my teacher (Zoom conference) and redirecting the output at the same time
Windows mixer did not work for me, see the linked post
I have tried other apps like Firefox or Word dictate. The problem here is that I cannot change the MIC (it uses the DEFAULT input device), and I need the MIC itself for talking to my teacher. See Change microphone for Word/Outlook Dictate only (Win10)?
I am not affiliated in any way with VOICEMEETER, anyway kudos to those guys - nice UI and tool.

Shortcomings:

Google Translate has a word/duration limit - in my case it is irrelevant, but it might matter for other use cases
The solution is browser based so far

Legal FOO:

make sure you meet the legal requirements in your country, check if it is legal to create a transcript of a conference/audio/video call
Also check the Google etc.Terms/Conditions to verify if this approach is covered

Windows 10 speaker sound (voice) to text?

2 Answers2

Linked