The Curious Case of Java String HashCode
The Curious Case of Java String HashCode
You have 2 free stories left this month. Sign up and get an extra one for free.
Introduction
I have been programming in Java since the past 1.5 years. Recently, I was
experimenting with performance analysis of Java Data structures. To get
my hands dirty, I decided to play around with my favourite data structure
i.e HashSet. HashSet provides an O(1) lookup and inserts time. I
measured & compared the time taken to lookup random strings having
different sizes in a HashSet.
1 package performance;
2
3 import com.google.common.base.Stopwatch;
4
5 import java.util.*;
6
7 public class HashCodePerformance {
8 public static void main(String[] args) {
9 Set<String> stringHashSet = new HashSet<>();
10 stringHashSet.add("London");
11 stringHashSet.add("Mumbai");
12 stringHashSet.add("NewYork");
13 List<String> stringsToSearch = Arrays.asList("f5a5a608", "48abre7a6 i8a5r507",
14 "7e50bc488 pl43fvf1p 65", "e843r6f1p vfvdfv vdvdg vgbgd ", "38aeaf9a6");
15 for (String string : stringsToSearch) {
16 Stopwatch timer = Stopwatch.createStarted();
17 for (int index=0; index < 10000000; ++index) {
18 stringHashSet.contains(string);
19 }
20 System.out.println("Search String \"" + string + "\" time taken " + timer.st
21 }
22 }
23 }
Let’s have a look at the performance of the lookup done for random
strings:-
We come across an interesting observation. The first & the last string
lookup (highlighted in red) takes almost 3x to 4x time than the middle
three strings. Even though the length of the middle three strings is more,
still the lookup is far more efficient. This implies HashSet lookup is
independent of the string’s length.
HashCode Design
In Java, every object has a hashCode() function. HashSet invokes this
function to determine the object index. Let’s revisit the example where
were analysing the performance of string lookup and see the value of
hashCodes for the random strings.
We can see that the outlier strings have hashCode as 0. Now, its time to
dig into some code & glance at the implementation.
In the older versions of JDK 1.0+ and 1.1+, hashCode function for
strings sampled every nth character. The downside of this approach was
many strings mapped to the same hash and resulted in collisions.
In Java 1.2, the below algorithm is used. This is a bit slow but helps in
avoiding collisions.
It can be seen from the above code that when hashCode is called for the
first time, the default value of the variable hash will be 0 and line 3–9 will
be executed. Subsequent calls to hashCode() will not execute line 3–9 if
the hash is non-zero.
If the hash of a string is 0, then the hash computation will be done every
time the function is called. Now, it is clear why the lookup of a few strings
takes more time than others.
Any CS101 student would recommend using a boolean flag that will be
set after the first computation and would skip computations in the
subsequent calls.
You might now be wondering why didn’t Java developers think of this
optimization in the first place or why wasn’t this patched in the later
releases of Java?
We can think of the below cases where the string can hash to zero:-
2. Empty String
Currently, only strings that hash to 0 are impacted. Let’s say we fix the
hashCode by adding a boolean. Overall, we won’t see any huge impact on
the performance of a real-world system. It might result in 0.000010 %
improvement in speed. It’s analogous to saying we optimized a task that
could be finished in 1hr to 59 min 59 sec 7 milliseconds.
So, we are better off not making any change in the hashCode() by
introducing a new variable as there is no significant speed gain.
drumwood boulderhead
References:-
https://stackoverflow.com/questions/2310498/why-doesnt-strings-
hashcode-cache-0
https://www.geeksforgeeks.org/equals-hashcode-methods-java/
WRIT T EN BY
Animesh Gaitonde Follow
Creating the T-Rex Updating Your Unity Four shortcuts for Handling Errors With
game with Flutter and Project to Watson SDK nding bugs in a large or Swift Result Type
Flame for Unity 3.1.0 (and Core unfamiliar codebase Navdeep Singh in Better
Renan C. Araujo SDK 0.2.0) Adam Dudley in T he Startup Programming
Amara Graham
SQL for Data Scientists, First steps to SQL! Guix Packaging by Angular Flex-Layout:
in 6 Minutes or Less tugce satir in Analytics Vidhya Example Flexbox and Grid Layout
Andre Ye in Analytics Vidhya Jethro Cao in T he Startup for Angular Component
Suguru Inatomi in Angular In
Depth