Description of the project: The clang compiler performs numerous passes and optimizations (if possible) on source code files when it is invoked for compilation. These operations constitute the compilation time taken for source codes, which can vary depending on the code. Ultimately, the control over this time belongs to the programmer when writing code and the compiler implementers when designing the compilation pipeline. However, the programmer cannot forsee every interaction the compiler has with code, and the implementer cannot forsee every possible code the compiler interacts with. Having a large dataset of code available allows for statistical analysis of clang and how certain code influences compilation times. The LLVM ComPile dataset provides readily available LLVM IR bitcode files to be compiled into object code. Recording the compile time of these files and other relevant features, such as their text segment size, can yield insight into the distribution of the files and the data. Furthermore, analyzing the outliers in this data can pinpoint the specific features of the compiler which influence the outlying compile times. This information can be further analyzed to determine what features of the given code files influenced the compiler to invoke those compiler features.

Expected result: Provide LLVM compiler developers more insight into which parts of the compiler contribute to compile time. Programmers can learn how to write code, particularly large codes, which take advantage of clang and its optimizations without inflating compile time.
Statistics are avilable for users to analyze the performance of the clang compiler on various codes.

Skills: Programmatic data analysis (Python, R, etc.), using LLVM tools, knowledge of compiler pipeline is a bonus
Project size: Medium (175-200 hours)
Mentors: Johannes Doerfert, Aiden Grossman

1 Like

Thanks @andrewka for proposing this.

I know you started to look into this, could you describe what you’ve found so far, e.g., share some partial results?

Yes, here is a result of plotting the compile times for C, C++ LLVM IR bitcode files. The corresponding text-segment size is taken from the object files after compilation.

O3 Optimization


This shows that the compile times follow a sort of linear trend but there are outliers which require further analysis to determine what influences their compile times.

Well… when plotted on a log-log graph. Looks to me like there’s an elbow at around 10,000 bytes.

Yes, you are correct. From these charts, I can more specifically say that the relationship is a polynomial one with a positive degree.
Note: O3 Optimization is used.


This is a polynomial of degree 2.

This is a polynomial of degree 2 as well.

1 Like

So for a text-segment size around 4000000 we can expect a negative compile time? :wink:

What about something affine:
0.3 + 3e-5 * x

i.e. roughly (startup time) + (linear in the size)

Assuming text size is a measure of input size, I’d assume a function shape like:

a + b * x + c * x * log(x) + min(d, f(x))

To account for the linear, and N log(N) algorithms, plus the more expensive ones that have cut-offs.