Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Review how different options will affect to Imagehash accuracy
Open, Needs TriagePublic

Description

The task is to write a benchmark for detecting the accuracy of imagehash.py's Phash using different options. The purpose is to see which options are most relevant for good accuracy when writing a version easily implemented in various programming languages and platforms.

There is multiple variables that differs in different pHash implementation.

For example
1.) Order of grayscaling and resizing (ie. resizing first or grayscaling first)
2.) Image scaling resampling
3.) use orthogonal normalization in Dct transformation or not (ie. OpenCV compatibility)
4.) Use median or average for finding cut-value for 0 or 1

Reference code

Benchmark rules for first test
1.) Select 1000 high resolution jpg images of photos and 4 scaled version for each (640, 800, 1024,2048) (https://petscan.wmcloud.org/?psid=29099516&format=json)

2.) generate pPhashes for all so that there is two sets for all photos. Save information which hashes are for original image.

  • Resizing image first

image = image.resize((img_size, img_size), ANTIALIAS).convert('L')

  • Grayscaling image first.

image = image.convert('L').resize((img_size, img_size), ANTIALIAS)

3.) Calculate results of tests for each option

  • count how many direct correct matches there is for each sets
  • count how many matches there is where scaled images are matching to other images for each sets
  • count how many matches there is where original images hashes are colliding with each other (ie. hash collisions) for each sets

Github project for the code

Event Timeline

Zache updated the task description. (Show Details)

Some stats

Summary:
Original images processed: 619
Algorithm: phash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2581
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: phash_simple
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2248
  Hash collisions (original vs. original): 6
  Hash collisions (scaled vs. scaled): 142
Algorithm: phash_resize_first
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2562
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: phash_resample_linear
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 633
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: phash_dct_ortho
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2574
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: ahash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2657
  Hash collisions (original vs. original): 2
  Hash collisions (scaled vs. scaled): 50
Algorithm: dhash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2257
  Hash collisions (original vs. original): 0
  Hash collisions (scaled vs. scaled): 0
Algorithm: whash
  Original image hashes processed: 619
  Scaled image hashes processed: 3095
  Direct matches: 2818
  Hash collisions (original vs. original): 12
  Hash collisions (scaled vs. scaled): 300