Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
47 views

The Curious Case of Java String HashCode

The document discusses the performance of lookups in a Java HashSet for strings of different lengths. It finds that lookups for some strings are much slower than others. This is because Java's default String hashCode implementation caches the hashCode value, but recomputes it from the characters if the initial value is 0. Strings that hash to 0 therefore have slower lookups. To optimize this, a flag could be used to skip recomputation after the first call to hashCode.

Uploaded by

NikolaMilosevic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

The Curious Case of Java String HashCode

The document discusses the performance of lookups in a Java HashSet for strings of different lengths. It finds that lookups for some strings are much slower than others. This is because Java's default String hashCode implementation caches the hashCode value, but recomputes it from the characters if the initial value is 0. Strings that hash to 0 therefore have slower lookups. To optimize this, a flag could be used to skip recomputation after the first call to hashCode.

Uploaded by

NikolaMilosevic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Programming Become a member Sign in Get started

You have 2 free stories left this month. Sign up and get an extra one for free.

The curious case of Java String


HashCode
Animesh Gaitonde Follow
Oct 11, 2019 · 5 min read

Introduction
I have been programming in Java since the past 1.5 years. Recently, I was
experimenting with performance analysis of Java Data structures. To get
my hands dirty, I decided to play around with my favourite data structure
i.e HashSet. HashSet provides an O(1) lookup and inserts time. I
measured & compared the time taken to lookup random strings having
different sizes in a HashSet.

Following is the code snippet that was written by me:-

1 package performance;
2
3 import com.google.common.base.Stopwatch;
4
5 import java.util.*;
6
7 public class HashCodePerformance {
8 public static void main(String[] args) {
9 Set<String> stringHashSet = new HashSet<>();
10 stringHashSet.add("London");
11 stringHashSet.add("Mumbai");
12 stringHashSet.add("NewYork");
13 List<String> stringsToSearch = Arrays.asList("f5a5a608", "48abre7a6 i8a5r507",
14 "7e50bc488 pl43fvf1p 65", "e843r6f1p vfvdfv vdvdg vgbgd ", "38aeaf9a6");
15 for (String string : stringsToSearch) {
16 Stopwatch timer = Stopwatch.createStarted();
17 for (int index=0; index < 10000000; ++index) {
18 stringHashSet.contains(string);
19 }
20 System.out.println("Search String \"" + string + "\" time taken " + timer.st
21 }
22 }
23 }

JavaPerformance.java hosted with ❤ by GitHub view raw

Let’s have a look at the performance of the lookup done for random
strings:-

Performance of string lookup

We come across an interesting observation. The first & the last string
lookup (highlighted in red) takes almost 3x to 4x time than the middle
three strings. Even though the length of the middle three strings is more,
still the lookup is far more efficient. This implies HashSet lookup is
independent of the string’s length.

To understand this unusual behaviour of HashSet, let’s get back to the


basics and understand the fundamentals.

HashSet- Internal Working


Java HashSet internally uses an array of lists to perform an O(1)
insertion, lookup and deletion. HashSet first calculates the hash of the
object to determine the array index where the object will be stored.
Later, the object is stored at the calculated index. The same principle is
applied during lookup & deletion.

How HashSet/HashMap stores objects

Accessing an array element is O(1) operation, so the only overhead is


calculating the hash of the object. Hence, the hash function needs to be
optimal to avoid any performance impact.

Furthermore, the output of the Hash function should have a uniform


distribution. In case of collisions, the length of lists at a given index will
keep growing and worst-case complexity will become O(n).

HashSet/HashMap collisions as a result of non-uniform hashing

HashCode Design
In Java, every object has a hashCode() function. HashSet invokes this
function to determine the object index. Let’s revisit the example where
were analysing the performance of string lookup and see the value of
hashCodes for the random strings.

HashCode of random strings in the example

We can see that the outlier strings have hashCode as 0. Now, its time to
dig into some code & glance at the implementation.

In the older versions of JDK 1.0+ and 1.1+, hashCode function for
strings sampled every nth character. The downside of this approach was
many strings mapped to the same hash and resulted in collisions.

In Java 1.2, the below algorithm is used. This is a bit slow but helps in
avoiding collisions.

Calculation of String’s HashCode

1 public int hashCode() {


2 int h = hash;
3 if (h == 0 && value.length > 0) {
4 char val[] = value;
5
6 for (int i = 0; i < value.length; i++) {
7 h = 31 * h + val[i];
8 }
9 hash = h;
10 }
11 return h;
12 }

StringHashCode.java hosted with ❤ by GitHub view raw

It can be seen from the above code that when hashCode is called for the
first time, the default value of the variable hash will be 0 and line 3–9 will
be executed. Subsequent calls to hashCode() will not execute line 3–9 if
the hash is non-zero.

It can be inferred that the hashCode() function uses a caching approach


where the hash is calculated only the first time its called and later calls
will get the same calculated value.

If the hash of a string is 0, then the hash computation will be done every
time the function is called. Now, it is clear why the lookup of a few strings
takes more time than others.

Overcoming the performance penalty


The above HashCode computation performs poorly for strings that hash
to 0. How can we optimize this?

Any CS101 student would recommend using a boolean flag that will be
set after the first computation and would skip computations in the
subsequent calls.

1 public int hashCode() {


2 int h = hash;
3 if (!computed && value.length > 0) {
4 char val[] = value;
5
6 for (int i = 0; i < value.length; i++) {
7 h = 31 * h + val[i];
8 }
9 hash = h;
10 computed = true;
11 }
12 return h;
13 }

HashCodeOptimized.java hosted with ❤ by GitHub view raw

You might now be wondering why didn’t Java developers think of this
optimization in the first place or why wasn’t this patched in the later
releases of Java?

Why not fix HashCode?


As per the implementation, the following is the formula of the hash of
any String having ’n’ characters.

hash = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

here s[n] is the nth character in the string

This hash function provides uniform distribution of hash across the


range of integers. It implies that the probability of a string hashing to 0 is
1 in ²³² strings.

We can think of the below cases where the string can hash to zero:-

1. String only contains 0 (Unicode character)

2. Empty String

3. Hash code is 0 as a result of integer overflow

In real-world applications, the probability of getting such kind of strings


is minuscule.

Fixing the hashCode implementation is trivial & Java developers might


have thought about it. However, there is no added advantage in fixing it.

Currently, only strings that hash to 0 are impacted. Let’s say we fix the
hashCode by adding a boolean. Overall, we won’t see any huge impact on
the performance of a real-world system. It might result in 0.000010 %
improvement in speed. It’s analogous to saying we optimized a task that
could be finished in 1hr to 59 min 59 sec 7 milliseconds.

So, we are better off not making any change in the hashCode() by
introducing a new variable as there is no significant speed gain.

English Strings that hash to 0


I took a list of 20k English Dictionary words and tried combining the
words to check if they hash to zero. When I considered single meaningful
English words, none of them hashed to zero. Combination of two or
more words resulted in a zero hash.

Few are some examples of sentences (meaningful words)that hash to


zero:-

carcinomas motorists high

zipped daydreams thunderflashes

where inattentive agronomy

drumwood boulderhead

HashSet or HashMaps showcase the best behaviour in getting, setting or


deleting the value. However, there are cases where the performance can
get impacted due to unusual behaviour.

References:-

https://stackoverflow.com/questions/2310498/why-doesnt-strings-
hashcode-cache-0

https://www.geeksforgeeks.org/equals-hashcode-methods-java/

Data Structures Programming Java Algorithms Computer Science

280 claps 2 responses

WRIT T EN BY
Animesh Gaitonde Follow

Software Engineer. Interested in Distributed Computing,


Data Science, Cryptocurrencies & Stock Markets.

More From Medium

Creating the T-Rex Updating Your Unity Four shortcuts for Handling Errors With
game with Flutter and Project to Watson SDK nding bugs in a large or Swift Result Type
Flame for Unity 3.1.0 (and Core unfamiliar codebase Navdeep Singh in Better
Renan C. Araujo SDK 0.2.0) Adam Dudley in T he Startup Programming
Amara Graham

SQL for Data Scientists, First steps to SQL! Guix Packaging by Angular Flex-Layout:
in 6 Minutes or Less tugce satir in Analytics Vidhya Example Flexbox and Grid Layout
Andre Ye in Analytics Vidhya Jethro Cao in T he Startup for Angular Component
Suguru Inatomi in Angular In
Depth

Discover Medium Make Medium yours Become a member


Welcome to a place where words matter. On Medium, Follow all the topics you care about, and we’ll deliver the Get unlimited access to the best stories on Medium — and
smart voices and original ideas take center stage - with no best stories for you to your homepage and inbox. Explore support writers while you’re at it. Just $5/month. Upgrade
ads in sight. Watch

About Help Legal

You might also like