Speaker diarization for 3+ speakers using Azure

Question

Does Azure's batch transcription support speaker diarization for more than 2 speakers?

I checked their Rest API documentation and didn't find anything relevant.

Are there other ways to do this using Azure cognitive services?

score 0 · Answer 1 · answered Jan 31 '22 at 03:23

I believe that diarization is limited to two parties. From the MS documentation on V2T batch transcription:

diarizationEnabled - Optional, false by default. Specifies that diarization analysis should be carried out on the input, which is expected to be a mono channel that contains two voices. Requires wordLevelTimestampsEnabled to be set to true. [emphasis added]

Source: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription

Cog services now supports Speaker Recognition, which verifies the voice print of known account holders, and may work fine for n>2 way-conversations, but that only works for known account holders with profiles.

score 0 · Answer 2 · answered Aug 02 '23 at 13:27

With API v3.1

To update Frank's answer, more than 2 speakers seems to be available without speaker recognition now with version 3.1 through the diarization property:

diarization - [..] You need to use this property when you expect three or more speakers. For two speakers setting diarizationEnabled property to true is enough. [..] The maximum number of speakers for diarization must be less than 36 and more or equal to the minSpeakers property.

From https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription-create?pivots=rest-api#request-configuration-options

As noted above, the older method of diarizationEnabled is still available for max. 2 speakers.

Speaker diarization for 3+ speakers using Azure

2 Answers2

With API v3.1