Compare two video files to find out which has best quality

Question

Suppose I have same video material encoded in two (or more) files. I'd like to run some utility on them which groundly pointed out which file is "best" in quality. "Groundly" means that I'd like to get report which compares different aspects (e.g. video resolution, video bitrate, audio sampling rate, audio bitrate, etc., etc.) one by one, and then some integral score which accounts for all of them.

That's about the functionality, but for that utility to be actually usable, it should be open-source and command-line.

slhck · Answer 1 · 2023-06-23T18:39:06.933

I work in video quality research, and it's hard to give a simple answer to the question "which video is better". Especially when you have two clips from the same source, but no access to the original source (e.g., two copies of a movie obtained somewhere “from the Internet”).

What you want, ideally, is a program that gives you a Mean Opinion Score (MOS) of a video, i.e. a number between 1 and 5, or between 0 and 100, which corresponds to the quality as perceived by a human being.

If you want practical tools, you can skip the next sections.

Intro: Why you cannot simply compare bitrate/resolution/etc.

Just comparing video resolution won't tell anything about the quality. In fact, it may be completely misleading. A 1080p movie rip at 700MB size might look worse than a 720p rip at 700MB, because for the former, the bitrate is too low, which introduces all kinds of compression artifacts.

The same goes for comparing bitrate at similar frame sizes, as different encoders can actually deliver better quality at less bitrate, or vice-versa. For example, a 720p 700MB rip produced with XviD will look worse than a 700MB rip produced with x264, because the latter is much more efficient.

It gets even more complicated if you have different resolutions involved. Humans are more annoyed by low-bitrate artifacts than lower resolution, so a 540p video can look better than a 720p video, depending on how it was encoded.

You would also have to define how a final "integral score" (the MOS) is composed of the individual quality factors. This heavily depends on several things, including but not limited to:

the type of videos you are comparing (cartoons, movies, news, etc.)
the original quality before they were encoded (you can encode a bad file with high bitrate and it won't make it better)

This is just speaking about pure compression quality. We're not even talking about different people would perceive the videos. Let's assume you have a friend who is watching movies because he or she enjoys crisp details and high motion resolution. They would be much more critical when seeing a low quality rip than a friend who is just watching movies for their content. They probably would not care about the quality so much, as long as the movie is funny or entertaining.

There are different types of video quality metrics!

There are several so-called video quality metrics, which can be classified according to which kind of information is used to determine the quality. In principle and very simply speaking, you distinguish between the following:

No-reference metrics – You have one video as input and want a quality score. In the case of comparing two clips without access to the original, you are looking for a no-reference metric, because you cannot access the original video. Such a metric will take one video and output one quality score. Here are some examples of problems a NR metric will detect (e.g. blurring, blockiness). There are, for various reasons, no accurate NR models that provide a single output quality score. And those NR metrics cannot easily differentiate between problems of an original video source (e.g. blurry phone camera) vs problems introduced by encoding (e.g. downscaling).
Full-reference metrics – You have an original video and an encoded video and want a quality score. For example, you could take a Blu-ray movie, then create two rips from it, and use a full-reference metric to estimate the quality loss and your rips. This will usually take a bit longer to compute, but it's much more accurate than NR metrics. Note that for FR metrics, both source and degraded video should have the same resolution — if not, you would have to upscale the degraded video first, so that its resolution matches the source.

Note that the above metrics look at video encoding quality, but there are also metrics that incorporate problems like initial loading times and stalling events when streaming video (e.g. ITU-T P.1203).

What tools can I use?

Here is a list of ready-to-use tools that you can use to test video quality metrics:

VMAF – Video Multi-Method Assessment Fusion by Netflix (more info here)
VQMT – Video Quality Measurement Tool by the EPFL in Lausanne, Switzerland – combines several metrics
MSU Video Quality Tool, a commercial software, combines several metrics
AVQT – Apple's own video quality tool, premiered at WWWDC 2021.
ITU-T P.1203 Implementation for analysis of HTTP streaming quality
ITU-T P.1204.3 Implementation for analysis of H.264, H.265 and VP9-encoded bitstreams
P.1204.3 Extensions (also called AVQBits) based on the P.1204.3 standard, but usable without bitstreams (not as straightforward to use due to Python dependencies; disclaimer: I am affiliated with the group)
FFMetrics, a Windows GUI for several video quality metrics available in FFmpeg
ffmpeg-quality-metrics, a Python wrapper for FFmpeg that calculates several video quality metrics (disclaimer: I wrote it myself)

Now what metrics are there?

PSNR, PSNR-HVS and PSNR-HVS-M

For starters, PSNR (Peak Signal-to-Noise Ratio) is a very simple-to-use but somewhat poor method of assessing video quality. It works relatively well for most applications and quick diagnostics, but it does not give a good estimation of how humans would perceive the quality.

In other words, only use PSNR when you have no other method to use. It's not accurate and performs quite poorly.

PSNR can be calculated frame-by-frame, and then you would for example average the PSNR of a whole video sequence to get the final score. Higher PSNR is better. ffmpeg can be used to calculate PSNR.

PSNR-HVS and PSNR-HVS-M are extensions of PSNR that try to emulate human visual perception, so they should be more accurate. VQMT and MSU can calculate PSNR, PSNR-HVS and PSNR-HVS-M between two videos.

SSIM, MS-SSIM

Structural Similarity (SSIM) is as easy to calculate as PSNR, and it delivers more accurate results, but still on a frame-by-frame basis. ffmpeg can calculate SSIM (use this command but replace psnr with ssim).

You can also use VQMT or MSU, or my Python tool. VQMT and MSU also include MS-SSIM, which gives better (i.e., more representative) results than SSIM, as well as a few other derivatives.

The results should be similar to PSNR, but SSIM is more accurate. Again, you need to compare a reference to a processed video for this to work, and both videos should be of the same resolution.

VMAF

Video Multi-Method Assessment Fusion by Netflix is a set of tools to calculate video quality based on some existing metrics, which are then fused by machine learning methods into a final score between 0 and 100. Netflix have explained the whole thing here:

[VMAF] predicts subjective quality by combining multiple elementary quality metrics. The basic rationale is that each elementary metric may have its own strengths and weaknesses with respect to the source content characteristics, type of artifacts, and degree of distortion. By ‘fusing’ elementary metrics into a final metric using a machine-learning algorithm - in our case, a Support Vector Machine (SVM) regressor - which assigns weights to each elementary metric, the final metric could preserve all the strengths of the individual metrics, and deliver a more accurate final score.

VMAF has become the de-facto standard in the video industry to gauge a video's quality. It performs much better than PSNR and SSIM, and it also takes into account video-specific properties like motion. Again, the videos that you compare have to have the same resolution; if not, upscale the lower-resolution one.

You can also use ffmpeg to calculate VMAF scores, or my own tool.

VQM

The Video Quality Metric was validated in the Video Quality Experts Group (VQEG) and is a good full-reference algorithm. You can download VQM for free or use the implementation from MSU.

When you register and download, you want to use the NTIA General Model or the Video Quality Model with Variable Frame Delay.

The model works well, but it has been trained at the beginning of the 21st century, so VMAF would obviously be the better choice.

AVQT

Apple have developed their own video quality tool (Advanced Video Quality Tool, AVQT), which works in a full-reference manner, so it requires an input video and a degraded version. You can only use this on Apple machines with Apple Silicon processors.

So far, only few details are known about how well this tool works. In contrast to other tools like VMAF, it can handle much larger display resolutions and even HDR. However, its accuracy has not yet been independently validated.

ITU P.1204.3

This is an ITU-T standard for bitstream-based evaluation of video quality. It is a short term video quality prediction model that uses full bitstream data to estimate video quality scores on a segment level (for segments of ~10 seconds length).

A reference implementation can be found on GitHub.

It has been shown to perform well, achieving similar performance of VMAF, even if no source video is available.

There are other models based on P.1204.3 which can be used without a bitstream, i.e. if you only have codec, bitrate, framerate and resolution information. This is more useful for analytics purposes where you don't have a video file at hand.

Other Metrics

PEVQ is a standardized full-reference metric under ITU-T J.246. It aims at multimedia signals, but not HD video.
VQuad-HD is another full-reference metric standardized as ITU-T J.341. Since it's newer, its better suited for HD video.

Both of them are commercial solutions, and you'll not find a software to download for them.

There are also some ITU standards on no-reference metrics, such as ITU-T P.1201 and ITU-T P.1202, which work with parameters from the bitstream for IPTV streaming. Those are, however, quite outdated and probably should not be used anymore.

ITU-T P.1203 can be used for adaptive streaming cases; it factors in initial loading delay, stalling, and quality switches over time, to generate an overall quality score. It is, however, not as accurate when you just want to compare two video clips. For that you should use a full reference model like VMAF or a more accurate bitstream-based model like P.1204.3.

Summary

If you just seek to compare simple objectively measurable criteria like:

Frame size
Bit rate
Frames per second
Video resolution

… a simple call to ffprobe input.mp4 should give you all the details you need at the beginning. You could then summarize this in a spreadsheet. Note that when you encode videos, x264 for example will log stuff like PSNR straight to a file if you need to, so you can use these values later.

As for how to weigh these criteria, you should probably emphasize the bitrate – but only if you know that the codec/resolution is the same. You could generally say that when you have two encodes of the same original, and both videos use x264, the one with higher bitrate will be better. When the bitrate is quite low, you might see that lower resolution videos look better than the higher-resolution ones, since the degradation due to upscaling is not as bad as the degradation due to low bitrate.

Comparing different codecs according to their bit rate is not possible unless you know more about the content and the individual encoding settings. Frame rate is a very subjective thing too and should be counted into your measurements if it is well below 25 Hz. For such inter-codec comparisons, metrics like VMAF or P.1204.3 would be suitable.

If you have no metrics available, use your eyes!

I am a big fan of VIVICT, a tool that allows you to 1:1 compare two clips, side by side, to see which one looks better. There are two versions, one for the browser, and one that works offline. Simply load two videos, and you get a slider with which you can choose the part of the video to compare.

A similar tool is video-compare. Here, videos are not required to be the same resolution, color format, container format, codec or duration etc. It also accepts FFmpeg commands, for example: video-compare -d -l "setpts=8.0*PTS" -r "setpts=8.0*PTS" videoA.mp4 videoB.mp4 will slow both videos down, so you can check for judder/motion blur with frame rate conversions etc.

To summarize, if you have no software tools at hand, heavily emphasize the bitrate if it's the only thing you have. Don't forget to use your eyes, too :)

jhulst · Answer 2 · 2011-09-22T18:22:40.810

0

I'm unaware of any tool which will give you a final recommendation or score, but using FFmpeg, you can output all the details you listed in the question.

On the command line, ffmpeg -i will list the information from the video. From there, you can write a script to parse the information and weight it as you see appropriate.

edited Sep 22 '11 at 18:22

answered Sep 22 '11 at 18:10

jhulst

514
3
5

score -1 · Answer 3 · answered Apr 13 '22 at 19:07

I am surprised that no one mentioned this yet, but I would like to comment on the FFmpeg recommendation. The FFmpeg filters suggested:

PSNR
SSIM
VMAF

only work if the two videos have the same resolution. This is a problem in my case, as I wanted to compare the same video, encoded two ways:

Resolution:1280x960 Bandwidth:3122000 Codecs:avc1.640028, mp4a.40.2
Resolution:720x544 Bandwidth:4619000 Codecs:avc1.640028, mp4a.40.2

One has higher resolution, the other has higher bitrate. I was curious which one would be the "better" quality, but you cannot use FFmpeg for this. This is explicitly mentioned in the documentation for PSNR [1] and SSIM [2]:

Both video inputs must have the same resolution and pixel format for this filter to work correctly.

VMAF [3] documentation doesnt mention it, but if you try it with videos of different resolution, you get this:

Width and height of input videos must be same.