How to normalize audio with varying audio volume with FFmpeg?

Question

I have a .MOV file whose sound I would like to normalize. However, when I run the following command I see that the volume varies (mono to stereo and with varying max_volume levels) throughout the video:

ffmpeg -i video.avi -af "volumedetect" -f null /dev/null

I would like to normalize the whole file but am not sure how to do so. The output is as follows:

ffmpeg version N-50911-g9efcfbe Copyright (c) 2000-2013 the FFmpeg developers   built on Mar 13 2013
21:26:48 with gcc 4.7.2 (GCC)   configuration: --enable-gpl
--enable-version3 --disable-w32threads --enable-avisynth --enable-bzlib --enable-fontconfi g --enable-frei0r --enable-gnutls --enable-libass --enable-libbluray --enable-libcaca --enable-libfreetype --enable-libg sm --enable-libilbc --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --ena ble-libopus --enable-librtmp --enable-libschroedinger --enable-libsoxr --enable-libspeex --enable-libtheora --enable-lib twolame --enable-libvo-aacenc --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libxa vs --enable-libxvid --enable-zlib   libavutil      52. 19.100 / 52. 19.100   libavcodec     55.  0.100 / 55.  0.100   libavformat    55.  0.100 / 55.  0.100   libavdevice    54.  4.100 / 54.  4.100   libavfilter     3. 45.103 / 
3. 45.103   libswscale      2.  2.100 /  2.  2.100   libswresample   0. 17.102 /  0. 17.102   libpostproc    52.  2.100 / 52.  2.100 [mov,mp4,m4a,3gp,3g2,mj2 @ 020d9920] multiple edit list entries, a/v
desync might occur, patch welcome Input #0, mov,mp4,m4a,3gp,3g2,mj2,
from '.\IMG_2783.mov':   Metadata:
    major_brand     : qt
    minor_version   : 0
    compatible_brands: qt
    creation_time   : 2014-08-20 02:36:31   Duration: 00:06:06.43, start: 0.000000, bitrate: 10206 kb/s
    Stream #0:0(und): Audio: aac (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 78 kb/s
    Metadata:
      creation_time   : 2014-08-20 02:36:32
      handler_name    : Core Media Data Handler
    Stream #0:1(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720, 10116 kb/s, 30 fps, 30 tbr, 600 tbn, 12 00 tbc
    Metadata:
      creation_time   : 2014-08-20 02:36:32
      handler_name    : Core Media Data Handler Output #0, null, to '/dev/nullclear':   Metadata:
    major_brand     : qt
    minor_version   : 0
    compatible_brands: qt
    encoder         : Lavf55.0.100
    Stream #0:0(und): Video: rawvideo (I420 / 0x30323449), yuv420p, 1280x720, q=2-31, 200 kb/s, 90k tbn, 30 tbc
    Metadata:
      creation_time   : 2014-08-20 02:36:32
      handler_name    : Core Media Data Handler
    Stream #0:1(und): Audio: pcm_s16le, 44100 Hz, stereo, s16, 1411 kb/s
    Metadata:
      creation_time   : 2014-08-20 02:36:32
      handler_name    : Core Media Data Handler Stream mapping:   Stream #0:1 -> #0:0 (h264 -> rawvideo)   Stream #0:0 -> #0:1 (aac ->
pcm_s16le) Press [q] to stop, [?] for help [null @ 03e023a0] Encoder
did not produce proper pts, making some up. Input stream #0:0 frame
changed from rate:44100 fmt:fltp ch:2 chl:stereo to rate:44100
fmt:fltp ch:1 chl:mono [Parsed_volumedetect_0 @ 03e02ea0] n_samples:
3346432 [Parsed_volumedetect_0 @ 03e02ea0] mean_volume: -36.2 dB
[Parsed_volumedetect_0 @ 03e02ea0] max_volume: -0.0 dB
[Parsed_volumedetect_0 @ 03e02ea0] histogram_0db: 8
[Parsed_volumedetect_0 @ 03e02ea0] histogram_1db: 18
[Parsed_volumedetect_0 @ 03e02ea0] histogram_2db: 38
[Parsed_volumedetect_0 @ 03e02ea0] histogram_3db: 58
[Parsed_volumedetect_0 @ 03e02ea0] histogram_4db: 94
[Parsed_volumedetect_0 @ 03e02ea0] histogram_5db: 58
[Parsed_volumedetect_0 @ 03e02ea0] histogram_6db: 80
[Parsed_volumedetect_0 @ 03e02ea0] histogram_7db: 196
[Parsed_volumedetect_0 @ 03e02ea0] histogram_8db: 222
[Parsed_volumedetect_0 @ 03e02ea0] histogram_9db: 236
[Parsed_volumedetect_0 @ 03e02ea0] histogram_10db: 404
[Parsed_volumedetect_0 @ 03e02ea0] histogram_11db: 522
[Parsed_volumedetect_0 @ 03e02ea0] histogram_12db: 766
[Parsed_volumedetect_0 @ 03e02ea0] histogram_13db: 738 Input stream
#0:0 frame changed from rate:44100 fmt:fltp ch:1 chl:mono to rate:44100 fmt:fltp ch:2 chl:stereo [Parsed_volumedetect_0 @ 03834340]
n_samples: 626688 [Parsed_volumedetect_0 @ 03834340] mean_volume:
-39.6 dB [Parsed_volumedetect_0 @ 03834340] max_volume: -15.5 dB [Parsed_volumedetect_0 @ 03834340] histogram_15db: 17
[Parsed_volumedetect_0 @ 03834340] histogram_16db: 19
[Parsed_volumedetect_0 @ 03834340] histogram_17db: 25
[Parsed_volumedetect_0 @ 03834340] histogram_18db: 41
[Parsed_volumedetect_0 @ 03834340] histogram_19db: 88
[Parsed_volumedetect_0 @ 03834340] histogram_20db: 231
[Parsed_volumedetect_0 @ 03834340] histogram_21db: 439 Input stream
#0:0 frame changed from rate:44100 fmt:fltp ch:2 chl:stereo to rate:44100 fmt:fltp ch:1 chl:mono [Parsed_volumedetect_0 @ 08f34b00]
n_samples: 192512 [Parsed_volumedetect_0 @ 08f34b00] mean_volume:
-19.7 dB [Parsed_volumedetect_0 @ 08f34b00] max_volume: 0.0 dB [Parsed_volumedetect_0 @ 08f34b00] histogram_0db: 2048 Input stream
#0:0 frame changed from rate:44100 fmt:fltp ch:1 chl:mono to rate:44100 fmt:fltp ch:2 chl:stereo [Parsed_volumedetect_0 @ 08f34aa0]
n_samples: 769024 [Parsed_volumedetect_0 @ 08f34aa0] mean_volume:
-28.2 dB [Parsed_volumedetect_0 @ 08f34aa0] max_volume: -0.1 dB [Parsed_volumedetect_0 @ 08f34aa0] histogram_0db: 15
[Parsed_volumedetect_0 @ 08f34aa0] histogram_1db: 29
[Parsed_volumedetect_0 @ 08f34aa0] histogram_2db: 72
[Parsed_volumedetect_0 @ 08f34aa0] histogram_3db: 102
[Parsed_volumedetect_0 @ 08f34aa0] histogram_4db: 158
[Parsed_volumedetect_0 @ 08f34aa0] histogram_5db: 163
[Parsed_volumedetect_0 @ 08f34aa0] histogram_6db: 299 Input stream
#0:0 frame changed from rate:44100 fmt:fltp ch:2 chl:stereo to rate:44100 fmt:fltp ch:1 chl:mono [Parsed_volumedetect_0 @ 08f34800]
n_samples: 96256 [Parsed_volumedetect_0 @ 08f34800] mean_volume: -65.0
dB [Parsed_volumedetect_0 @ 08f34800] max_volume: -37.3 dB
[Parsed_volumedetect_0 @ 08f34800] histogram_37db: 2
[Parsed_volumedetect_0 @ 08f34800] histogram_38db: 0
[Parsed_volumedetect_0 @ 08f34800] histogram_39db: 2
[Parsed_volumedetect_0 @ 08f34800] histogram_40db: 10
[Parsed_volumedetect_0 @ 08f34800] histogram_41db: 8
[Parsed_volumedetect_0 @ 08f34800] histogram_42db: 4
[Parsed_volumedetect_0 @ 08f34800] histogram_43db: 14
[Parsed_volumedetect_0 @ 08f34800] histogram_44db: 12
[Parsed_volumedetect_0 @ 08f34800] histogram_45db: 12
[Parsed_volumedetect_0 @ 08f34800] histogram_46db: 20
[Parsed_volumedetect_0 @ 08f34800] histogram_47db: 16 Input stream
#0:0 frame changed from rate:44100 fmt:fltp ch:1 chl:mono to rate:44100 fmt:fltp ch:2 chl:stereo [Parsed_volumedetect_0 @ 08f34800]
n_samples: 533504 [Parsed_volumedetect_0 @ 08f34800] mean_volume:
-43.9 dB [Parsed_volumedetect_0 @ 08f34800] max_volume: -23.4 dB [Parsed_volumedetect_0 @ 08f34800] histogram_23db: 47
[Parsed_volumedetect_0 @ 08f34800] histogram_24db: 453
[Parsed_volumedetect_0 @ 08f34800] histogram_25db: 685 Input stream
#0:0 frame changed from rate:44100 fmt:fltp ch:2 chl:stereo to rate:44100 fmt:fltp ch:1 chl:mono [Parsed_volumedetect_0 @ 08f34980]
n_samples: 98304 [Parsed_volumedetect_0 @ 08f34980] mean_volume: -16.4
dB [Parsed_volumedetect_0 @ 08f34980] max_volume: -6.2 dB
[Parsed_volumedetect_0 @ 08f34980] histogram_6db: 28
[Parsed_volumedetect_0 @ 08f34980] histogram_7db: 290 frame=10993
fps=817 q=0.0 Lsize=N/A time=00:06:38.32 bitrate=N/A video:687kB
audio:68764kB subtitle:0 global headers:0kB muxing overhead
-100.000031% [Parsed_volumedetect_0 @ 08f34920] n_samples: 13807616 [Parsed_volumedetect_0 @ 08f34920] mean_volume: -28.3 dB
[Parsed_volumedetect_0 @ 08f34920] max_volume: 0.0 dB
[Parsed_volumedetect_0 @ 08f34920] histogram_0db: 188
[Parsed_volumedetect_0 @ 08f34920] histogram_1db: 532
[Parsed_volumedetect_0 @ 08f34920] histogram_2db: 1078
[Parsed_volumedetect_0 @ 08f34920] histogram_3db: 1416
[Parsed_volumedetect_0 @ 08f34920] histogram_4db: 1948
[Parsed_volumedetect_0 @ 08f34920] histogram_5db: 2805
[Parsed_volumedetect_0 @ 08f34920] histogram_6db: 4288
[Parsed_volumedetect_0 @ 08f34920] histogram_7db: 6068

score 0 · Answer 1 · edited Nov 24 '23 at 18:36

Coming as an audio guy who's done some reading:

You might want to approach this by splitting up these different sections with something like atrim, normalising (replaygain to find peak), and stitching back together with the other sections that have had the same processing applied.

If the dynamic range was much lower, you could have considered something like compand to reduce the dynamic level of the top end of dynamics, allowing the level of the whole file to be increased. Compand may well be something useful to do after normalisation anyway.

score 0 · Answer 2 · edited Nov 24 '23 at 18:40

0

I think your question has already been answered here.

It looks like your file has two stereo audio streams though... an AAC and a WAV stream. Pick the WAV stream as the source for your re-encode.

edited Nov 24 '23 at 18:40

Giacomo1968

58,727

answered Sep 08 '14 at 22:12

Tim

121

How to normalize audio with varying audio volume with FFmpeg?

2 Answers2