Description of the project: The clang compiler performs numerous passes and optimizations (if possible) on source code files when it is invoked for compilation. These operations constitute the compilation time taken for source codes, which can vary depending on the code. Ultimately, the control over this time belongs to the programmer when writing code and the compiler implementers when designing the compilation pipeline. However, the programmer cannot forsee every interaction the compiler has with code, and the implementer cannot forsee every possible code the compiler interacts with. Having a large dataset of code available allows for statistical analysis of clang and how certain code influences compilation times. The LLVM ComPile dataset provides readily available LLVM IR bitcode files to be compiled into object code. Recording the compile time of these files and other relevant features, such as their text segment size, can yield insight into the distribution of the files and the data. Furthermore, analyzing the outliers in this data can pinpoint the specific features of the compiler which influence the outlying compile times. This information can be further analyzed to determine what features of the given code files influenced the compiler to invoke those compiler features.
Expected result: Provide LLVM compiler developers more insight into which parts of the compiler contribute to compile time. Programmers can learn how to write code, particularly large codes, which take advantage of clang and its optimizations without inflating compile time.
Statistics are avilable for users to analyze the performance of the clang compiler on various codes.
Skills: Programmatic data analysis (Python, R, etc.), using LLVM tools, knowledge of compiler pipeline is a bonus
Project size: Medium (175-200 hours)
Mentors: Johannes Doerfert, Aiden Grossman