World’s Largest Dataset Of Olympiad-Level Maths Problems Created By MIT

You would be amazed to know that the countries competing in the International Mathematical Olympiad arrive with a booklet of their best, most original problems every year.

Normally, these booklets get shared among delegations, then quietly disappear.

Creating World’s Largest Collection Of Olympiad-level Math Problems

It appears that these booklets have never been collected systematically and cleaned, so that they can be made available.

This is done but not for AI researchers testing the limits of mathematical reasoning, and not for the students around the world training for these competitions largely on their own.

But now, it is done by the researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and HUMAIN.

When it comes to MathNet, it is the largest high-quality dataset of proof-based math problems ever created, and it is not closed.

So far, it is five times larger than the next biggest dataset of its kind comprising more than 30,000 expert-authored problems and solutions spanning 47 countries, 17 languages, and 143 competitions.

Now, they are planning to represent this work at the International Conference on Learning Representations (ICLR 2026) in Brazil later this month.

Mathnet – Making It Available To Everyone

If you are wondering about why MathNet differs is not only its size but its breadth as previous Olympiad-level datasets draw almost exclusively from competitions in the United States and China.

MathNet spans dozens of countries across six continents, covers 17 languages, includes both text and image-based problems and solutions, and spans four decades of competition mathematics.

With this initiative, their goal is to capture the full range of mathematical perspectives and problem-solving traditions that exist across the global math community, not just the most visible ones.

An MIT Ph.D. student, Shaden Alshammari said, “Every country brings a booklet of its most novel and most creative problems,” who is also the lead author on the paper, adding, “They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online.”

This is not a simple task as building MathNet required tracking down 1,595 PDF volumes totaling more than 25,000 pages, spanning digital documents and decades-old scans in more than a dozen languages.

Interestingly, they could acquire a significant portion of that archive from an unlikely source – Navid Safaei, who is a longtime IMO community figure and co-author who had been collecting and scanning those booklets by hand since 2006.

It appears that his personal archive formed much of the backbone of the dataset.

It is noteworthy here that the sourcing matters as much as the scale due to the fact that most existing math datasets pull problems from community forums like Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition booklets.

As we know, the solutions in those booklets are expert-written and peer-reviewed, as they often run to multiple pages, with authors walking through several approaches to the same problem.

Due to this depth, it gives AI models a far richer signal for learning mathematical reasoning than the shorter, informal solutions typical of community-sourced datasets.

This simply means that the dataset is genuinely useful for students, especially anyone preparing for the IMO or a national competition.

Now, they have access to a centralized, searchable collection of high-quality problems and worked solutions from traditions around the world.

One such student, Alshammari, who competed in the IMO as a student herself said, “I remember so many students for whom it was an individual effort. No one in their country was training them for this kind of competition.”

Adding that “We hope this gives them a centralized place with high-quality problems and solutions to learn from.”

Image Source


Comments are closed.