Warning
This experiment is at the bleeding edge of my understanding of HotSpot's JIT compiler and also the jmh benchmarking tool. It could contain code errors, statistical errors, and basic misunderstandings so if you spot any mistakes or bad assumptions please let me know :)
Background
About a year ago I was reading The Well Grounded Java Developer and got interested in learning about bytecode and the JIT compiler so I started writing a tool to make it easier to understand the wieldy XML of hotspot.log. That tool is now an open source project called JITWatch and has been accepted as an AdoptOpenJDK project.
The latest feature I am working on is a suggestion tool whose purpose is to identify the low-hanging fruit when it comes to optimising your Java code so that the JIT compiler can do the best possible job.
One of the HotSpot C2 (Server) JIT compiler's optimisation tricks is to inline methods to save the invocation cost and it will always try to inline any method consisting of 35 bytecodes or less (use VM switch -XX:MaxInlineSize=n to alter this).
Another optimisation is to look at the invocation count of a method to determine its "hotness" (TODO look up the threshold for hotness) and if a hot method is 320 bytescodes or less (VM-dependent, use VM switch -XX:FreqInlineSize=n to alter) then it will also be inlined even though it is more than 35 bytecodes long.
The log error message when HotSpot cannot inline a hot method is:
<inline_fail reason='hot method too big'/>
which appears after the <call> tag for a <method> in the <phase name="parse"> section of a <task> tag.
This looked like a good place to start making suggestions - highlighting where hot methods failed to be inlined due to their size (specified in the "bytes" attribute of the <method> tag (bytes here means the number of bytecode instructions).
Using the biggest hotspot.log I could generate (from the behemoth that is Eclipse, naturally) I started searching for 'hot method too big' failures and was surprised to see them in many core Java classes. From here I decided to focus on performance critical core classes such as sorting code and this led me to look at java.util.ComparableTimSort
The next step was to write a simple sorting example that I could benchmark using OpenJDK's jmh tool (by Aleksey Shipilev) which I had wanted to try out for some time.
The design for the test method is to populate a 1,000,000 element list with random Integer values and then measuring the time taken to sort the list. I used a seeded java.util.Random to ensure the same elements were inserted into the list each time.
Here is my jmh test class:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.GenerateMicroBenchmark;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class PerfTestSort
{
@GenerateMicroBenchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testSort() {
List<Integer> list = new ArrayList<Integer>();
int count = 1000000;
Random seededRandom = new Random(12345678);
for (int i = 0; i < count; i++) {
list.add(seededRandom.nextInt());
}
Collections.sort(list);
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + PerfTestSort.class.getSimpleName() + ".*")
.forks(1)
.build();
new Runner(opt).run();
}
}
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.GenerateMicroBenchmark;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class PerfTestSort
{
@GenerateMicroBenchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testSort() {
List<Integer> list = new ArrayList<Integer>();
int count = 1000000;
Random seededRandom = new Random(12345678);
for (int i = 0; i < count; i++) {
list.add(seededRandom.nextInt());
}
Collections.sort(list);
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(".*" + PerfTestSort.class.getSimpleName() + ".*")
.forks(1)
.build();
new Runner(opt).run();
}
}
Running the testSort() method once outside of jmh with hotspot.log output enabled I was able to see that in the compile for the method java.util.ComparableTimSort.sort(Object[], int, int) the call to method java.util.ComparableTimSort.mergeAt(int) failed to be inlined for reason 'hot method too big'.
The relevant part of the hotspot.log file is:
<task compile_id='41' compile_kind='osr' method='java/util/ComparableTimSort sort ([Ljava/lang/Object;II)V' bytes='179' count='1' backedge_count='14563' iicount='1' osr_bci='58' stamp='11.106'>
<phase name='parse' nodes='3' live='3' stamp='11.106'>
<type id='636' name='void'/>
<klass id='730' name='[Ljava/lang/Object;' flags='1041'/>
<type id='634' name='int'/>
<klass id='729' name='java/util/ComparableTimSort' flags='0'/>
<method id='731' holder='729' name='sort' return='636' arguments='730 634 634' flags='8' bytes='179' iicount='1'/>
...
<method id='747' holder='729' name='mergeAt' return='636' arguments='634' flags='2' bytes='356' compile_id='39' compiler='C2' iicount='11887'/>
<call method='747' count='1826' prof_factor='1' inline='1'/>
<uncommon_trap method='747' bci='312' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='248' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='145' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='123' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='66' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='32' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='14' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<inline_fail reason='hot method too big'/>
<phase name='parse' nodes='3' live='3' stamp='11.106'>
<type id='636' name='void'/>
<klass id='730' name='[Ljava/lang/Object;' flags='1041'/>
<type id='634' name='int'/>
<klass id='729' name='java/util/ComparableTimSort' flags='0'/>
<method id='731' holder='729' name='sort' return='636' arguments='730 634 634' flags='8' bytes='179' iicount='1'/>
...
<method id='747' holder='729' name='mergeAt' return='636' arguments='634' flags='2' bytes='356' compile_id='39' compiler='C2' iicount='11887'/>
<call method='747' count='1826' prof_factor='1' inline='1'/>
<uncommon_trap method='747' bci='312' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='248' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='145' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='123' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='66' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='32' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<uncommon_trap method='747' bci='14' reason='unloaded' action='reinterpret' index='78' klass='740'/>
<inline_fail reason='hot method too big'/>
I could see that at 356 bytecodes it was only slightly too long for the 320 bytecode limit for a hot method.
I had the idea that perhaps by splitting mergeAt() into 2 smaller methods then this hot code could be inlined.
The next steps were to
1) Unzip the JDK's src.zip
2) Modify the method java.util.ComparableTimSort.mergeAt(int) to produce less than 320 bytecodes
3) Compile the modified ComparableTimSort class
4) Unjar the JRE's lib/rt.jar
5) Replace java/util/ComparableTimSort class with my new version
6) Create a new rt.jar containing the modified class
7) Benchmark my sorting test method using jmh and compare the original rt.jar to the one containing the split mergeAt method
8) Also compare whether the same result could be obtained just by specifying higher values for -XX:MaxInlineSize and -XX:FreqInlineSize
Here is my modified java.util.ComparableTimSort.mergeAt(int)
/**
* Merges the two runs at stack indices i and i+1. Run i must be
* the penultimate or antepenultimate run on the stack. In other words,
* i must be equal to stackSize-2 or stackSize-3.
*
* @param i stack index of the first of the two runs to merge
*/
@SuppressWarnings("unchecked")
private void mergeAt(int i) {
assert stackSize >= 2;
assert i >= 0;
assert i == stackSize - 2 || i == stackSize - 3;
int base1 = runBase[i];
int len1 = runLen[i];
int base2 = runBase[i + 1];
int len2 = runLen[i + 1];
assert len1 > 0 && len2 > 0;
assert base1 + len1 == base2;
/*
* Record the length of the combined runs; if i is the 3rd-last
* run now, also slide over the last run (which isn't involved
* in this merge). The current run (i+1) goes away in any case.
*/
runLen[i] = len1 + len2;
if (i == stackSize - 3) {
runBase[i + 1] = runBase[i + 2];
runLen[i + 1] = runLen[i + 2];
}
stackSize--;
/*
* Find where the first element of run2 goes in run1. Prior elements
* in run1 can be ignored (because they're already in place).
*/
int k = gallopRight((Comparable<Object>) a[base2], a, base1, len1, 0);
assert k >= 0;
base1 += k;
len1 -= k;
if (len1 == 0)
return;
// move remainder to a separate method
mergeAt2(base1, base2, len1, len2, a);
}
@SuppressWarnings("unchecked")
private void mergeAt2(int base1, int base2, int len1, int len2, Object[] a) {
/*
* Find where the last element of run1 goes in run2. Subsequent elements
* in run2 can be ignored (because they're already in place).
*/
len2 = gallopLeft((Comparable<Object>) a[base1 + len1 - 1], a,
base2, len2, len2 - 1);
assert len2 >= 0;
if (len2 == 0)
return;
// Merge remaining runs, using tmp array with min(len1, len2) elements
if (len1 <= len2)
mergeLo(base1, len1, base2, len2);
else
mergeHi(base1, len1, base2, len2);
}
* Merges the two runs at stack indices i and i+1. Run i must be
* the penultimate or antepenultimate run on the stack. In other words,
* i must be equal to stackSize-2 or stackSize-3.
*
* @param i stack index of the first of the two runs to merge
*/
@SuppressWarnings("unchecked")
private void mergeAt(int i) {
assert stackSize >= 2;
assert i >= 0;
assert i == stackSize - 2 || i == stackSize - 3;
int base1 = runBase[i];
int len1 = runLen[i];
int base2 = runBase[i + 1];
int len2 = runLen[i + 1];
assert len1 > 0 && len2 > 0;
assert base1 + len1 == base2;
/*
* Record the length of the combined runs; if i is the 3rd-last
* run now, also slide over the last run (which isn't involved
* in this merge). The current run (i+1) goes away in any case.
*/
runLen[i] = len1 + len2;
if (i == stackSize - 3) {
runBase[i + 1] = runBase[i + 2];
runLen[i + 1] = runLen[i + 2];
}
stackSize--;
/*
* Find where the first element of run2 goes in run1. Prior elements
* in run1 can be ignored (because they're already in place).
*/
int k = gallopRight((Comparable<Object>) a[base2], a, base1, len1, 0);
assert k >= 0;
base1 += k;
len1 -= k;
if (len1 == 0)
return;
// move remainder to a separate method
mergeAt2(base1, base2, len1, len2, a);
}
@SuppressWarnings("unchecked")
private void mergeAt2(int base1, int base2, int len1, int len2, Object[] a) {
/*
* Find where the last element of run1 goes in run2. Subsequent elements
* in run2 can be ignored (because they're already in place).
*/
len2 = gallopLeft((Comparable<Object>) a[base1 + len1 - 1], a,
base2, len2, len2 - 1);
assert len2 >= 0;
if (len2 == 0)
return;
// Merge remaining runs, using tmp array with min(len1, len2) elements
if (len1 <= len2)
mergeLo(base1, len1, base2, len2);
else
mergeHi(base1, len1, base2, len2);
}
Here are the jmh results using 100 warmup loops and 100 iterations per test result.
1) Original rt.jar and standard VM options
Result : 367027.592 ±(99.9%) 1027.549 us/op
Statistics: (min, avg, max) = (363215.462, 367027.592, 374771.395), stdev = 3029.753
Confidence interval (99.9%): [366000.042, 368055.141]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 367027.592 1027.549 us/op
Statistics: (min, avg, max) = (363215.462, 367027.592, 374771.395), stdev = 3029.753
Confidence interval (99.9%): [366000.042, 368055.141]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 367027.592 1027.549 us/op
2) Original rt.jar and -XX:MaxInlineSize=1024 -XX:FreqInlineSize=1024
Result : 362610.675 ±(99.9%) 994.842 us/op
Statistics: (min, avg, max) = (359110.426, 362610.675, 372784.590), stdev = 2933.314
Confidence interval (99.9%): [361615.834, 363605.517]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 362610.675 994.842 us/op
Statistics: (min, avg, max) = (359110.426, 362610.675, 372784.590), stdev = 2933.314
Confidence interval (99.9%): [361615.834, 363605.517]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 362610.675 994.842 us/op
3) Modified rt.jar and standard VM options
Result : 358166.104 ±(99.9%) 882.070 us/op
Statistics: (min, avg, max) = (354702.777, 358166.104, 367035.542), stdev = 2600.803
Confidence interval (99.9%): [357284.034, 359048.174]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 358166.104 882.070 us/op
Statistics: (min, avg, max) = (354702.777, 358166.104, 367035.542), stdev = 2600.803
Confidence interval (99.9%): [357284.034, 359048.174]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 358166.104 882.070 us/op
4) Modified rt.jar and -XX:MaxInlineSize=1024 -XX:FreqInlineSize=1024
Result : 364510.037 ±(99.9%) 1086.795 us/op
Statistics: (min, avg, max) = (361420.264, 364510.037, 373596.578), stdev = 3204.439
Confidence interval (99.9%): [363423.243, 365596.832]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 364510.037 1086.795 us/op
Statistics: (min, avg, max) = (361420.264, 364510.037, 373596.578), stdev = 3204.439
Confidence interval (99.9%): [363423.243, 365596.832]
Benchmark Mode Samples Mean Mean error Units
o.o.j.s.PerfTestSort.testSort avgt 100 364510.037 1086.795 us/op
Interpretation of results
The best performance was obtained using the split mergeAt() method with standard VM options (358166 us/op) 2.41% faster than unmodified code.
With the modified code, altering the inlining parameters so that all methods of ComparableTimSort could be inlined led to worse performance (364510 us/op) only 0.68% faster than unmodified code.
With the unmodified rt.jar, increasing the inlining parameters led to better performance (362610 us/op) 1.20% faster than unmodified code on default inlining parameters.
Further work
If these results are correct, show the JITWatch proof that the split methods were inlined (they are) and see if performance can be improved further by splitting other large hot methods in ComparableTimSort (gallopLeft and gallopRight etc.).
Check whether these kind of optimisations are no longer relevant as JDK8 core libraries will be rewritten to use streams and lambdas.