Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Codeforces

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Code Forces Rating Predictor

Satvik Ramaprasad Raghavan G V Ananth Shreekumar


satviksunkam.ramaprasad@iiitb.org raghavan.gv@iiitb.org ananth.shreekumar@iiitb.org
IMT2016008 IMT2016099 IMT2016129

Abstract—Competitive Programming is a mind sport II. DATASET


where programmers are expected to solve challenging
problems in a given timeframe. It is known for involving Since this is a new problem, there was no dataset
problem statements that require creative thinking on the available which captured all the necessity “features”. We
part of the programmers. Competitive programmers are had to generate our own dataset using a combination
tested on their months of hard work and persistence, of the APIs provided by Codeforces and scraping some
alongside their ability to solve a problem and build a code.
pages of their website. Given the time consuming nature
This sport has been gaining popularity across the students
of crawling/scraping, we decided to first select data
of India since half a decade
This paper is a report on a project that was conducted attributes from which we can extract features.
to predict a user’s rating on Code Forces - A Competitive
Programming Platform - using concepts from Machine A. User Dataset
Learning. We propose to create a model that will predict
a user’s rating based on the problems that he has solved. Using Codeforces API, we obtained raw user data of
We solve two problems - given a user’s data, predict his all users on the platform. There were a total of 158K
maximum rating using regression approaches, and given a users from several countries. Given the time constraint,
user’s data, predict his rank using classification techniques. we decided to restrict ourselves to Indian users. Upon
Rank is determined by his rating. filtering, we obtained a total of 11.5K Indian users.
I. I NTRODUCTION We grouped users based on their organization. The total
number of organizations turned out to be 959. We again
Programming is one the most sought-after skills in the filtered only on organizations which had more than 20
World today. However, a good programmer must also users. After all the filters, the dataset was reduced to 83
know algorithms to solve problems and to come up with organizations and 6.4K users.
his own algorithms for an efficient solution.

Code Forces is a website where people can sign up Data attributes collected
and hone their skills in programming and algorithms in • Country
a language of their choice. The submitted solutions are • Organization
run through a set of test cases to check their correctness. • Handle
Codeforces also conducts weekly contests, during which • Timestamp of user creation
the contestants are expected to code up a solution to • Contributions
the given problems. The programmer’s rating is affected • Number of Friends
based on criteria - the number of problems solved, • Rating
the time taken to solve, and the number of wrong • Max Rating
submissions for a problem during the contest. • Rank

In this paper, we also present a novel approach on B. User - Submissions Dataset


using the information about the quantity and quality of
the problems that he has practiced all along and we want Using Codeforces API, we obtained the total list of
to predict the current rating of the user. submissions by the user. This was used to perform a
join between each user and problem submission data.
By doing this we give users a vague prediction of what This was crucial to aggregate and extract features from
his / her current rating will be according to her practice. the submission data.
C. Problem Dataset
Using Codeforces API, we obtained raw problem data
of all problems on the platform. There were a total of
4637 problems with varying difficulties. Since difficulty
level of the problems were not provided by the API, we
had to scrape it from the website.

Data attributes collected


• contestId
• index
• name
• type
• tags
• Difficulty

III. F EATURE E NGINEERING


We performed a join on the user submission data and
problem data based on problem id. We then performed
statistical aggregation on the result.

We computed the mean, median and the mean of the


top 20% of the problem’s difficulty solved by the user.
The total number of problems solved successfully was
also extracted as a feature.

We classified problems into buckets based on their dif-


ficulty score and performed a count. We further classified
problems based on their problem tags and performed a
count.

Data Features Extracted A. Rating Predictor


• avg difficulty Rating prediction is a regression problem. We built
• avg difficulty20 Linear Regression, Random Forest and XG Boost clas-
• median difficulty sifiers on the dataset.
• duration
B. Rank Predictor
• tags (set)
Rank prediction is a classification problem. We built
• difficulty buckets (set)
Logistic Regression, Random Forest and XG Boost clas-
• problem count
sifiers on the dataset.
• contest count
VI. R ESULTS AND A NALYSIS
IV. DATA V ISUALISATION
A. Rating Predictor
One of the major consequences of this project was
that we could empirically prove that a person generally Since rating prediction is a regression problem, we
has a higher rating if he has solved more problems than used R2 score and RMS Error as our metric. XGBoost
his peers. We plotted the rating alongside the number of algorithm gave our best score with a R2 score of 0.82
problems solved for the users in the dataset to achieve and RMS Error of 75.11. Codeforces on an average gives
this. around −100 to +100 rating change to a contestant after
every contest based on relative performance. Since our
V. M ODELS AND T RAINING RMS Error is around 75 points we could certainly tell
We tried multiple models to get an idea of which that we are only one or two contests behind the actual
performs better. rating of the user.
VIII. F UTURE S COPE
1. All the countries in the World could be included to
obtain a better dataset to build a model.
2. We could try to predict a user’s rating sometime in
the future if he continues at the same level of effort.
ACKNOWLEDGMENT
The authors would like to thank professors for their
continued guidance, support and encouragement. Further
the authors would like to thank the teaching assistants
for their valuable advice and encouragement.
R EFERENCES
[1] Codeforces api documentation. [Online]. Available:
https://codeforces.com/api/help
[2] “Scikit-learn documentation.” [Online]. Available: http://scikit-
learn.org/stable/documentation.html
[3] “Pandas documentation.” [Online]. Available:
https://pandas.pydata.org/pandas-docs/stable/
[4] “Numpy documentation.” [Online]. Available:
https://docs.scipy.org/doc/
[5] “Xgboost documentation.” [Online]. Available:
https://xgboost.readthedocs.io/en/latest/
[6] “Xgboost documentation.” [Online]. Available:
https://readthedocs.org/projects/beautiful-soup-4/

B. Rank Predictor
Rank prediction is classification problem where the
rank of the given user is predicted. As we clearly can
see that predicting a user as ’expert’ (category - 4) when
he is ’candidate master’ (category - 5) is not a very bad
result as they both are very close to each other in the
ranking graph. But since the model considers these both
as completely independent categories we won’t be able
to get a good classification. The best score that we could
get was using logistic regression which gave us the mean
score of 0.604.

VII. I SSUES FACED

Data collection turned out to be tricky and time


consuming. Network issues slowed down our data crawl-
ing. Moreover codeforces blocked our IP addresses for
repeated requests. We resorted to using VPNs to get
through the restriction. We also tried parallizing our
requests using multiple threads to speed up the process.

You might also like