[GSoC 2024] Statistical Analysis of LLVM IR Compilation with Clang

andrewka · March 8, 2024, 8:42pm

Description of the project: The clang compiler performs numerous passes and optimizations (if possible) on source code files when it is invoked for compilation. These operations constitute the compilation time taken for source codes, which can vary depending on the code. Ultimately, the control over this time belongs to the programmer when writing code and the compiler implementers when designing the compilation pipeline. However, the programmer cannot forsee every interaction the compiler has with code, and the implementer cannot forsee every possible code the compiler interacts with. Having a large dataset of code available allows for statistical analysis of clang and how certain code influences compilation times. The LLVM ComPile dataset provides readily available LLVM IR bitcode files to be compiled into object code. Recording the compile time of these files and other relevant features, such as their text segment size, can yield insight into the distribution of the files and the data. Furthermore, analyzing the outliers in this data can pinpoint the specific features of the compiler which influence the outlying compile times. This information can be further analyzed to determine what features of the given code files influenced the compiler to invoke those compiler features.

Expected result: Provide LLVM compiler developers more insight into which parts of the compiler contribute to compile time. Programmers can learn how to write code, particularly large codes, which take advantage of clang and its optimizations without inflating compile time.
Statistics are avilable for users to analyze the performance of the clang compiler on various codes.

Skills: Programmatic data analysis (Python, R, etc.), using LLVM tools, knowledge of compiler pipeline is a bonus
Project size: Medium (175-200 hours)
Mentors: Johannes Doerfert, Aiden Grossman

jdoerfert · March 9, 2024, 9:52pm

Thanks @andrewka for proposing this.

I know you started to look into this, could you describe what you’ve found so far, e.g., share some partial results?

andrewka · March 11, 2024, 8:06pm

Yes, here is a result of plotting the compile times for C, C++ LLVM IR bitcode files. The corresponding text-segment size is taken from the object files after compilation.

O3 Optimization

Chart1200×742 51.2 KB

This shows that the compile times follow a sort of linear trend but there are outliers which require further analysis to determine what influences their compile times.

pogo59 · March 11, 2024, 8:18pm

Well… when plotted on a log-log graph. Looks to me like there’s an elbow at around 10,000 bytes.

andrewka · March 19, 2024, 8:53pm

Yes, you are correct. From these charts, I can more specifically say that the relationship is a polynomial one with a positive degree.
Note: O3 Optimization is used.

This is a polynomial of degree 2.
Chart1200×742 52.9 KB

This is a polynomial of degree 2 as well.

y_lc · March 20, 2024, 12:53pm

So for a text-segment size around 4000000 we can expect a negative compile time?

y_lc · March 20, 2024, 4:15pm

What about something affine:
0.3 + 3e-5 * x

i.e. roughly (startup time) + (linear in the size)

jdoerfert · March 20, 2024, 4:28pm

Assuming text size is a measure of input size, I’d assume a function shape like:

a + b * x + c * x * log(x) + min(d, f(x))

To account for the linear, and N log(N) algorithms, plus the more expensive ones that have cut-offs.

Topic		Replies	Views
Instrumentation of Clang/LLVM for Compile Time GSoC	21	2936	March 18, 2023
Regarding Instrumentation of Clang/LLVM for Compile Time GSoC	8	1155	March 21, 2022
llvm and clang are getting slower LLVM Dev List Archives	37	178	April 1, 2016
[GSoC2025] Advanced symbol resolution and reoptimization for clang-repl GSoC	3	180	March 31, 2025
Instrumenting C/C++ programs LLVM Dev List Archives	9	269	November 17, 2011

[GSoC 2024] Statistical Analysis of LLVM IR Compilation with Clang

Related topics