2

I have some images (photos) and there are duplicates but no matter how I sort them they are scattered because of resolution and irregular naming.

I tried gm compare but can't figure out which metric to use or which values would indicate a match.

Heres examples of an image that looks exactly the same but the second one is 2x resolution (better quality):

gm compare -metric MAE "7920068.jpg" "7920034.jpg"
gm compare -metric MSE "7920068.jpg" "7920034.jpg"
gm compare -metric PAE "7920068.jpg" "7920034.jpg"
gm compare -metric PSNR "7920068.jpg" "7920034.jpg"
gm compare -metric RMSE "7920068.jpg" "7920034.jpg"

Image Difference (MeanAbsoluteError): Normalized Absolute ============ ========== Red: 0.1751787015 11480.3 Green: 0.1168407563 7657.2 Blue: 0.0029600541 194.0 Total: 0.0983265040 6443.8

Image Difference (MeanSquaredError): Normalized Absolute ============ ========== Red: 0.0910979679 5970.1 Green: 0.0274231091 1797.2 Blue: 0.0000203617 1.3 Total: 0.0395138129 2589.5

Image Difference (PeakAbsoluteError): Normalized Absolute ============ ========== Red: 1.0000000000 65535.0 Green: 0.7803921569 51143.0 Blue: 0.0784313725 5140.0 Total: 1.0000000000 65535.0

Image Difference (PeakSignalToNoiseRatio): PSNR ====== Red: 10.40 Green: 15.62 Blue: 46.91 Total: 14.03

Image Difference (RootMeanSquaredError): Normalized Absolute ============ ========== Red: 0.3018243991 19780.1 Green: 0.1655992426 10852.5 Blue: 0.0045123979 295.7 Total: 0.1987808163 13027.1

with graphicsmagick identify i found these values

          |image a        |image a @2x    |image b
Red:
  Minimum:|  0.00 (0.0000)|  0.00 (0.0000)|  0.00 (0.0000)
  Maximum:|255.00 (1.0000)|255.00 (1.0000)|255.00 (1.0000)
  Mean:   |175.81 (0.6894)|176.00 (0.6902)|117.79 (0.4619)
  Std Dev:| 65.59 (0.2572)| 65.73 (0.2577)| 61.55 (0.2414)
Green:
  Minimum:|  0.00 (0.0000)|  0.00 (0.0000)|  0.00 (0.0000)
  Maximum:|255.00 (1.0000)|255.00 (1.0000)|255.00 (1.0000)
  Mean:   |161.58 (0.6336)|162.47 (0.6371)| 99.07 (0.3885)
  Std Dev:| 71.14 (0.2790)| 71.26 (0.2794)| 64.94 (0.2547)
Blue:
  Minimum:|  0.00 (0.0000)|  0.00 (0.0000)|  0.00 (0.0000)
  Maximum:|255.00 (1.0000)|255.00 (1.0000)|255.00 (1.0000)
  Mean:   |153.59 (0.6023)|153.27 (0.6010)|104.50 (0.4098)
  Std Dev:| 71.65 (0.2810)| 71.67 (0.2811)| 60.09 (0.2357)

looks like i can use these values to compare, the image a files have very similar values compared to image b, just need to get a good threshold to indicate what might be a match

I'll use these images as an example:

  1. different image BOSS8
  2. subject image BOSS1
  3. subject image at half size BOSS12

and here's their output:

gm identify -verbose BOSS-1.jpg   
Image: BOSS-1.jpg
  Format: JPEG (Joint Photographic Experts Group JFIF format)
  Geometry: 591x1049
  Class: DirectClass
  Type: true color
  Depth: 8 bits-per-pixel component
  Channel Depths:
    Red:      8 bits
    Green:    8 bits
    Blue:     8 bits
  Channel Statistics:
    Red:
      Minimum:                     7.00 (0.0275)
      Maximum:                   255.00 (1.0000)
      Mean:                       89.97 (0.3528)
      Standard Deviation:         79.68 (0.3125)
    Green:
      Minimum:                    11.00 (0.0431)
      Maximum:                   255.00 (1.0000)
      Mean:                      108.55 (0.4257)
      Standard Deviation:         70.34 (0.2758)
    Blue:
      Minimum:                     8.00 (0.0314)
      Maximum:                   255.00 (1.0000)
      Mean:                      126.50 (0.4961)
      Standard Deviation:         68.28 (0.2678)
  Resolution: 72x72 pixels
  Filesize: 129.6Ki
  Interlace: No
  Orientation: Unknown
  Background Color: white
  Border Color: #DFDFDF
  Matte Color: #BDBDBD
  Page geometry: 591x1049+0+0
  Compose: Over
  Dispose: Undefined
  Iterations: 0
  Compression: JPEG
  JPEG-Quality: 93
  JPEG-Colorspace: 2
  JPEG-Colorspace-Name: RGB
  JPEG-Sampling-factors: 2x2,1x1,1x1
  Signature: 06a764225a290be783b0b3b90c72356f71b0032af8f58e88857c33d6e59b8ccc
  Profile-EXIF: 74 bytes
    Exif Offset: 26
    Color Space: 1
    Exif Image Width: 591
    Exif Image Length: 1049
  Tainted: False
  Elapsed Time: 0m:0.011805s
  Pixels Per Second: 50.1Mi

$ gm identify -verbose BOSS-1-50.jpg Image: BOSS-1-50.jpg Format: JPEG (Joint Photographic Experts Group JFIF format) Geometry: 296x525 Class: DirectClass Type: true color Depth: 8 bits-per-pixel component Channel Depths: Red: 8 bits Green: 8 bits Blue: 8 bits Channel Statistics: Red: Minimum: 7.00 (0.0275) Maximum: 255.00 (1.0000) Mean: 89.34 (0.3504) Standard Deviation: 78.83 (0.3091) Green: Minimum: 12.00 (0.0471) Maximum: 255.00 (1.0000) Mean: 107.87 (0.4230) Standard Deviation: 70.29 (0.2756) Blue: Minimum: 14.00 (0.0549) Maximum: 255.00 (1.0000) Mean: 125.77 (0.4932) Standard Deviation: 68.19 (0.2674) Resolution: 72x72 pixels Filesize: 44.2Ki Interlace: No Orientation: Unknown Background Color: white Border Color: #DFDFDF Matte Color: #BDBDBD Page geometry: 296x525+0+0 Compose: Over Dispose: Undefined Iterations: 0 Compression: JPEG JPEG-Quality: 93 JPEG-Colorspace: 2 JPEG-Colorspace-Name: RGB JPEG-Sampling-factors: 2x2,1x1,1x1 Signature: 2c12437d162d8bf92ad49497e2644ca3a5edd9d3c8947d44445a5923565123cc Profile-EXIF: 74 bytes Exif Offset: 26 Color Space: 1 Exif Image Width: 296 Exif Image Length: 525 Tainted: False Elapsed Time: 0m:0.002051s Pixels Per Second: 72.3Mi

$ gm identify -verbose BOSS-8.jpg
Image: BOSS-8.jpg Format: JPEG (Joint Photographic Experts Group JFIF format) Geometry: 584x1050 Class: DirectClass Type: true color Depth: 8 bits-per-pixel component Channel Depths: Red: 8 bits Green: 8 bits Blue: 8 bits Channel Statistics: Red: Minimum: 0.00 (0.0000) Maximum: 255.00 (1.0000) Mean: 91.51 (0.3589) Standard Deviation: 85.21 (0.3341) Green: Minimum: 0.00 (0.0000) Maximum: 255.00 (1.0000) Mean: 110.18 (0.4321) Standard Deviation: 83.58 (0.3278) Blue: Minimum: 0.00 (0.0000) Maximum: 255.00 (1.0000) Mean: 132.97 (0.5214) Standard Deviation: 87.69 (0.3439) Resolution: 72x72 pixels Filesize: 180.5Ki Interlace: No Orientation: Unknown Background Color: white Border Color: #DFDFDF Matte Color: #BDBDBD Page geometry: 584x1050+0+0 Compose: Over Dispose: Undefined Iterations: 0 Compression: JPEG JPEG-Quality: 93 JPEG-Colorspace: 2 JPEG-Colorspace-Name: RGB JPEG-Sampling-factors: 2x2,1x1,1x1 Signature: 9d12ad4d93d1c8d219d41ef9755984bcb151a8de502c70279aea4b69202c99d1 Profile-EXIF: 74 bytes Exif Offset: 26 Color Space: 1 Exif Image Width: 584 Exif Image Length: 1050 Tainted: False Elapsed Time: 0m:0.016498s Pixels Per Second: 35.4Mi

yarns
  • 145

2 Answers2

2

You can try normalizing the images by resizing them to have a square aspect ratio with a known resolution. Comparing the normalized images results in quite low values (~100) for the MSE metric:

$ gm convert -geometry 1000x1000! same-big.jpg norm-same-big.jpg
$ gm convert -geometry 1000x1000! same-small.jpg norm-same-small.jpg
$ gm convert -geometry 1000x1000! different.jpg norm-different.jpg

$ gm compare -metric mse norm-same-big.jpg norm-same-small.jpg Image Difference (MeanSquaredError): Normalized Absolute ============ ========== Red: 0.0015487693 101.5 Green: 0.0009830381 64.4 Blue: 0.0015041910 98.6 Total: 0.0013453328 88.2

$ gm compare -metric mse norm-same-big.jpg norm-different.jpg Image Difference (MeanSquaredError): Normalized Absolute ============ ========== Red: 0.0829284628 5434.7 Green: 0.0682458298 4472.5 Blue: 0.0753763994 4939.8 Total: 0.0755168974 4949.0

You could easily turn this into a script that takes two filenames, normalizes them, compares the normalized images, and then reports back the original filenames if the difference is close enough.

1

Not an answer, but other bases for comparison:

  • Use GraphicsMagick to get the image size, and compare the horizontal and vertical ratio. There could be a small delta due to rounding errors, but the ratio should be the same for an image rescaled to a different size. See also this for similar question.

  • Use ImageMagick to extract similar information, which may be more amenable to comparison.

  • Use exif tool to extract the EXIF data. If an image is made from another, with the option of retaining EXIF data, and if the original had that data, the data should be substantially the same in both.

    EXIF data