Review how different options will affect to Imagehash accuracy
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Zache
	Aug 15 2024, 10:06 AM

Description

The task is to write a benchmark for detecting the accuracy of imagehash.py's Phash using different options. The purpose is to see which options are most relevant for good accuracy when writing a version easily implemented in various programming languages and platforms.

There is multiple variables that differs in different pHash implementation.

For example
1.) Order of grayscaling and resizing (ie. resizing first or grayscaling first)
2.) Image scaling resampling
3.) use orthogonal normalization in Dct transformation or not (ie. OpenCV compatibility)
4.) Use median or average for finding cut-value for 0 or 1

Reference code

https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/__init__.py#L261

Benchmark rules for first test
1.) Select 1000 high resolution jpg images of photos and 4 scaled version for each (640, 800, 1024,2048) (https://petscan.wmcloud.org/?psid=29099516&format=json)

2.) generate pPhashes for all so that there is two sets for all photos. Save information which hashes are for original image.

Resizing image first

image = image.resize((img_size, img_size), ANTIALIAS).convert('L')

Grayscaling image first.

image = image.convert('L').resize((img_size, img_size), ANTIALIAS)

3.) Calculate results of tests for each option

count how many direct correct matches there is for each sets
count how many matches there is where scaled images are matching to other images for each sets
count how many matches there is where original images hashes are colliding with each other (ie. hash collisions) for each sets

Github project for the code

https://github.com/Wikimedia-Suomi/imagehash-benchmark

Event Timeline

Zache created this task.Aug 15 2024, 10:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 15 2024, 10:06 AM

Zache updated the task description. (Show Details)Aug 15 2024, 10:11 AM

Zache updated the task description. (Show Details)

Zache updated the task description. (Show Details)Aug 15 2024, 10:14 AM

Zache updated the task description. (Show Details)Aug 16 2024, 9:37 AM

Some stats

Summary:
Original images processed: 619
Algorithm: phash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2581
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: phash_simple
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2248
  Hash collisions (original vs. original): 6
  Hash collisions (scaled vs. scaled): 142
Algorithm: phash_resize_first
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2562
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: phash_resample_linear
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 633
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: phash_dct_ortho
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2574
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: ahash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2657
  Hash collisions (original vs. original): 2
  Hash collisions (scaled vs. scaled): 50
Algorithm: dhash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2257
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: whash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2818
  Hash collisions (original vs. original): 12
  Hash collisions (scaled vs. scaled): 300

Review how different options will affect to Imagehash accuracyOpen, Needs TriagePublicActions

Description

Event Timeline

Review how different options will affect to Imagehash accuracy
Open, Needs TriagePublic
Actions