Kaggle is one of the largest online communities for data scientists specifically known for their competitions where participants aim to solve data science challenges. Kaggle has a long history of varying types of competitions from different areas such as medicine, finance, scientific research, or sports focusing on different types of data and prediction problems such as tabular data, time series, NLP, or computer vision.
Report
Share
Report
Share
1 of 38
More Related Content
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
1. How to get into Kaggle?
Philipp Singer & Dmitry Gordeev
Vienna Data Science Meetup Vienna,
Dec 5th 2019
2. Who we are
● Philipp
○ Data scientist at UNIQA
○ PhD in CS at TU Graz
○ Profound experience in ML research and applications
○ Kaggle competition master currently ranked 36th
● Dmitry
○ Data scientist at UNIQA
○ Master’s degree in data mining
○ In-depth experience of ML applications in financial institutes
○ Kaggle competition grandmaster currently ranked 34th
● Competing successfully together on Kaggle for 1 year: The Zoo
2
3. What is Kaggle?
● “Your home for Data Science”
○ Online community of data scientists and machine learners
○ Founded in 2010
○ Acquired by Google in 2017
● Data science competitions
● Share notebooks, datasets, and discussions
● Courses and tutorials
● Free notebook infrastructure with CPUs and GPUs
3
4. How big is Kaggle
● The most popular ML competition platform
● The largest ML community
125 000+ users
350 completed competitions
up to 10 000 users per competition
Usually 20,000 $ - 100,000 $ prize fund
4
9. Competitions on Kaggle
● Usually hosted by companies or research institutes
● Main goal: prediction
● Wide range of different types of competitions
○ Different types of domains (e.g., financial, medical, sports, …)
○ Different types of data (e.g., tabular, nlp, image, videos, time-series, …)
○ Different types of objectives (e.g., classification, regression, segmentation, …)
○ Different goals of competitions (featured, research, playground, in-class)
● Built-in progression system with medals and ranks
● Top spots usually receive prize money
9
13. ● Started competing under the team name “The Zoo” exactly one year ago
● Little prior experience on Kaggle
● Participated in 7 competitions
● Strategy: diversify types of competitions for learning purposes
The Zoo
13
15. Quora
Develop models that identify
and flag insincere questions.
1 306 122 labelled
questions
6.2% insincere questions
4 037 teams
2 hours to fit and predict
15
16. Quora - sincere/insincere
How can I become a data scientist?
How come Trump is so stupid?
Is it possible for a vegan who does crossfit to go 10 minutes without telling
someone about it?
Everytime I slap myself in the face, it hurts. How can I prevent this?
16
23. LANL Earthquake Prediction
Predict the time remaining before
laboratory earthquakes occur
from real-time seismic data.
629 145 480 data points
4 200 trainings segments
4 540 teams
30 minutes to fit and predict
23
25. LANL - solution
● Derived handful of features from the data capturing peaks
and volatility of the acoustic signal
● Combination (ensemble) of two state-of-the-art modeling approaches
○ Gradient Boosting Regression Trees
○ Neural Network (Deep Learning)
● Novel statistical data adjustment to account for different earthquake cycles
25
27. APTOS Blindness Detection
Detect diabetic retinopathy to
stop blindness before it's too late!
3 662 retina images
0 - 4 retinopathy levels
2 943 teams
15 000 evaluation images
27
Diabetic retinopathy is the leading cause of blindness in
the working-age population of the developed world. It is
estimated to affect over 93 million people.
29. APTOS - solution
● Careful image pre-processing to remove any
kind of bias (e.g., device)
● Combination of several current best deep
neural networks
● Models are pre-trained on large collection of
image data (imagenet + extra retina images)
29
31. Quiz
● Did I have relevant experience to enter this competition?
31
Data: Atomic elements (H for hydrogen, C for carbon
etc.) and their X, Y, Z cartesian coordinates.
Task: Develop an algorithm that can predict the
magnetic interaction between two atoms in a
molecule.
32. Why should you start on Kaggle?
● Doing is the best way to learn
● Get in touch with data and use cases
outside your main domain
● Keep up-to-date with state-of-the-art methods
● Learn from others
● Measure yourself and know where you stand
● Hardware and software is provided by Kaggle
32
34. How can you start on Kaggle?
● Don’t be afraid! Just do it!
● Overcome self-handicapping behavior
● You gain points regardless of the result
● “Getting started” competitions
● Pick a competition that sounds exciting to you, don’t be afraid to pick one
where you have no prior experience
● Research similar previous competitions and read solutions
● Follow published notebooks and discussions
34
36. How to approach a competition?
● Choose a programming language (usually python or R)
● Understand the problem setting, get a feeling for the data and the metric
● Exploratory Data Analysis (EDA)
● Implement basic script / notebook from scratch doing training and prediction
OR just fork someone’s model ;-)
● Think hard about robust CV setup
● Keep up-to-date on discussions and developments of competition
● Experiment a lot and iterate quickly
36
38. Thanks!
Get in touch with us! We are open to any inquiries.
me@philippsinger.com
dott1718@gmail.com
@ph_singer @dott1718
38Vienna Data Science Meetup Vienna,
Dec 5th 2019