
A DataComp and Bespoke Labs community effort to curate the best open reasoning datasets.
Our first goal is to curate a reasoning dataset to train state of the art small reasoning models that surpass DeepSeek-R1-Distill-32B and DeepSeek-R1-Distill-7B on math and code reasoning benchmarks.
Latest 32B Results
Model | Data | AIME24 | AIME25 | AMC23 | MATH500 | GPQA-D | LCBv2 |
---|---|---|---|---|---|---|---|
OpenThinker2-32B | ✅ | 76.7 | 58.7 | 94.0 | 90.8 | 64.1 | 72.5 |
OpenThinker-32B | ✅ | 68.0 | 49.3 | 95.5 | 90.6 | 63.5 | 68.6 |
R1-Distill-32B | ❌ | 74.7 | 50.0 | 96.5 | 90.0 | 65.8 | 72.3 |
Light-R1-32B | ✅ | 74.7 | 58.0 | 96.0 | 90.4 | 62.0 | 56.0 |
S1.1-32B | ✅ | 59.3 | 42.7 | 91.5 | 87.4 | 62.0 | 58.7 |
Latest 7B Results
Model | Data | AIME24 | AIME25 | AMC23 | MATH500 | GPQA-D | LCBv2 |
---|---|---|---|---|---|---|---|
OpenThinker2-7B | ✅ | 50.0 | 33.3 | 89.5 | 88.4 | 49.3 | 55.6 |
OpenThinker-7B | ✅ | 31.3 | 23.3 | 74.5 | 83.2 | 42.9 | 38.0 |
R1-Distill-7B | ❌ | 57.3 | 33.3 | 92.0 | 89.6 | 47.3 | 48.4 |
OlympicCoder-7B | ✅ | 20.7 | 15.3 | 63.0 | 74.8 | 25.3 | 55.4 |
OpenR1-Math-7B | ✅ | 48.7 | 34.7 | 88.5 | 87.8 | 21.2 | 9.5 |
The numbers reported in the table above are evaluated with our open-source tool Evalchemy. Our models are trained on OpenThoughts-114k and OpenThoughts2-1M. Data generation code is available on Github.
About us
We are a team of researchers and engineers from Stanford, University of California Berkeley, University of Washington, Bespoke Labs, UT Austin, Juelich Supercomputing Center (JSC), LAION, UCLA, UNC Chapel Hill, and Toyota Research Institute united around building the best datasets (and thus the best models). See our previous works at datacomp.ai and mlfoundations.
Open Thoughts is supported by Bespoke Labs, NSF IFML, UT Austin Machine Learning Lab, Juelich Supercomputing Center, Toyota Research Institute, Lambda Labs.