Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
How to get into Kaggle?
Philipp Singer & Dmitry Gordeev
Vienna Data Science Meetup Vienna,
Dec 5th 2019
Who we are
● Philipp
○ Data scientist at UNIQA
○ PhD in CS at TU Graz
○ Profound experience in ML research and applications
○ Kaggle competition master currently ranked 36th
● Dmitry
○ Data scientist at UNIQA
○ Master’s degree in data mining
○ In-depth experience of ML applications in financial institutes
○ Kaggle competition grandmaster currently ranked 34th
● Competing successfully together on Kaggle for 1 year: The Zoo
2
What is Kaggle?
● “Your home for Data Science”
○ Online community of data scientists and machine learners
○ Founded in 2010
○ Acquired by Google in 2017
● Data science competitions
● Share notebooks, datasets, and discussions
● Courses and tutorials
● Free notebook infrastructure with CPUs and GPUs
3
How big is Kaggle
● The most popular ML competition platform
● The largest ML community
125 000+ users
350 completed competitions
up to 10 000 users per competition
Usually 20,000 $ - 100,000 $ prize fund
4
Kaggle survey results
5
Kaggle survey results
6
Kaggle survey results
7
Kaggle survey results
8
Competitions on Kaggle
● Usually hosted by companies or research institutes
● Main goal: prediction
● Wide range of different types of competitions
○ Different types of domains (e.g., financial, medical, sports, …)
○ Different types of data (e.g., tabular, nlp, image, videos, time-series, …)
○ Different types of objectives (e.g., classification, regression, segmentation, …)
○ Different goals of competitions (featured, research, playground, in-class)
● Built-in progression system with medals and ranks
● Top spots usually receive prize money
9
Competition medals
10
User ranking + titles
11
How competitions usually work
12https://mc.ai/pseudo-labeling/
● Started competing under the team name “The Zoo” exactly one year ago
● Little prior experience on Kaggle
● Participated in 7 competitions
● Strategy: diversify types of competitions for learning purposes
The Zoo
13
Our Journey
14
Quora
Develop models that identify
and flag insincere questions.
1 306 122 labelled
questions
6.2% insincere questions
4 037 teams
2 hours to fit and predict
15
Quora - sincere/insincere
How can I become a data scientist?
How come Trump is so stupid?
Is it possible for a vegan who does crossfit to go 10 minutes without telling
someone about it?
Everytime I slap myself in the face, it hurts. How can I prevent this?
16
Quora - solution
17
Quora - final standings
18
Santander
19
Identify which customers will
make a specific transaction in
the future
200 000 transactions
8 802 teams
2 months duration
Santander - the mysterious data
20
Santander - solution
21
Santander - final standings
22
LANL Earthquake Prediction
Predict the time remaining before
laboratory earthquakes occur
from real-time seismic data.
629 145 480 data points
4 200 trainings segments
4 540 teams
30 minutes to fit and predict
23
LANL - the physics
24
LANL - solution
● Derived handful of features from the data capturing peaks
and volatility of the acoustic signal
● Combination (ensemble) of two state-of-the-art modeling approaches
○ Gradient Boosting Regression Trees
○ Neural Network (Deep Learning)
● Novel statistical data adjustment to account for different earthquake cycles
25
LANL - final standings
26
APTOS Blindness Detection
Detect diabetic retinopathy to
stop blindness before it's too late!
3 662 retina images
0 - 4 retinopathy levels
2 943 teams
15 000 evaluation images
27
Diabetic retinopathy is the leading cause of blindness in
the working-age population of the developed world. It is
estimated to affect over 93 million people.
APTOS
28
https://www.eyeops.com/contents/our-services/eye-diseases/diabetic-retinopathy; https://www.vequill.com/how-to-cure-temporary-blindness/
APTOS - solution
● Careful image pre-processing to remove any
kind of bias (e.g., device)
● Combination of several current best deep
neural networks
● Models are pre-trained on large collection of
image data (imagenet + extra retina images)
29
APTOS - final standings
30
Quiz
● Did I have relevant experience to enter this competition?
31
Data: Atomic elements (H for hydrogen, C for carbon
etc.) and their X, Y, Z cartesian coordinates.
Task: Develop an algorithm that can predict the
magnetic interaction between two atoms in a
molecule.
Why should you start on Kaggle?
● Doing is the best way to learn
● Get in touch with data and use cases
outside your main domain
● Keep up-to-date with state-of-the-art methods
● Learn from others
● Measure yourself and know where you stand
● Hardware and software is provided by Kaggle
32
Easy start
33
How can you start on Kaggle?
● Don’t be afraid! Just do it!
● Overcome self-handicapping behavior
● You gain points regardless of the result
● “Getting started” competitions
● Pick a competition that sounds exciting to you, don’t be afraid to pick one
where you have no prior experience
● Research similar previous competitions and read solutions
● Follow published notebooks and discussions
34
Learn from the community
35
How to approach a competition?
● Choose a programming language (usually python or R)
● Understand the problem setting, get a feeling for the data and the metric
● Exploratory Data Analysis (EDA)
● Implement basic script / notebook from scratch doing training and prediction
OR just fork someone’s model ;-)
● Think hard about robust CV setup
● Keep up-to-date on discussions and developments of competition
● Experiment a lot and iterate quickly
36
Try more, fail fast
37
Baseline
model
Final
model
Thanks!
Get in touch with us! We are open to any inquiries.
me@philippsinger.com
dott1718@gmail.com
@ph_singer @dott1718
38Vienna Data Science Meetup Vienna,
Dec 5th 2019

More Related Content

How to get into Kaggle? by Philipp Singer and Dmitry Gordeev

  • 1. How to get into Kaggle? Philipp Singer & Dmitry Gordeev Vienna Data Science Meetup Vienna, Dec 5th 2019
  • 2. Who we are ● Philipp ○ Data scientist at UNIQA ○ PhD in CS at TU Graz ○ Profound experience in ML research and applications ○ Kaggle competition master currently ranked 36th ● Dmitry ○ Data scientist at UNIQA ○ Master’s degree in data mining ○ In-depth experience of ML applications in financial institutes ○ Kaggle competition grandmaster currently ranked 34th ● Competing successfully together on Kaggle for 1 year: The Zoo 2
  • 3. What is Kaggle? ● “Your home for Data Science” ○ Online community of data scientists and machine learners ○ Founded in 2010 ○ Acquired by Google in 2017 ● Data science competitions ● Share notebooks, datasets, and discussions ● Courses and tutorials ● Free notebook infrastructure with CPUs and GPUs 3
  • 4. How big is Kaggle ● The most popular ML competition platform ● The largest ML community 125 000+ users 350 completed competitions up to 10 000 users per competition Usually 20,000 $ - 100,000 $ prize fund 4
  • 9. Competitions on Kaggle ● Usually hosted by companies or research institutes ● Main goal: prediction ● Wide range of different types of competitions ○ Different types of domains (e.g., financial, medical, sports, …) ○ Different types of data (e.g., tabular, nlp, image, videos, time-series, …) ○ Different types of objectives (e.g., classification, regression, segmentation, …) ○ Different goals of competitions (featured, research, playground, in-class) ● Built-in progression system with medals and ranks ● Top spots usually receive prize money 9
  • 11. User ranking + titles 11
  • 12. How competitions usually work 12https://mc.ai/pseudo-labeling/
  • 13. ● Started competing under the team name “The Zoo” exactly one year ago ● Little prior experience on Kaggle ● Participated in 7 competitions ● Strategy: diversify types of competitions for learning purposes The Zoo 13
  • 15. Quora Develop models that identify and flag insincere questions. 1 306 122 labelled questions 6.2% insincere questions 4 037 teams 2 hours to fit and predict 15
  • 16. Quora - sincere/insincere How can I become a data scientist? How come Trump is so stupid? Is it possible for a vegan who does crossfit to go 10 minutes without telling someone about it? Everytime I slap myself in the face, it hurts. How can I prevent this? 16
  • 18. Quora - final standings 18
  • 19. Santander 19 Identify which customers will make a specific transaction in the future 200 000 transactions 8 802 teams 2 months duration
  • 20. Santander - the mysterious data 20
  • 22. Santander - final standings 22
  • 23. LANL Earthquake Prediction Predict the time remaining before laboratory earthquakes occur from real-time seismic data. 629 145 480 data points 4 200 trainings segments 4 540 teams 30 minutes to fit and predict 23
  • 24. LANL - the physics 24
  • 25. LANL - solution ● Derived handful of features from the data capturing peaks and volatility of the acoustic signal ● Combination (ensemble) of two state-of-the-art modeling approaches ○ Gradient Boosting Regression Trees ○ Neural Network (Deep Learning) ● Novel statistical data adjustment to account for different earthquake cycles 25
  • 26. LANL - final standings 26
  • 27. APTOS Blindness Detection Detect diabetic retinopathy to stop blindness before it's too late! 3 662 retina images 0 - 4 retinopathy levels 2 943 teams 15 000 evaluation images 27 Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
  • 29. APTOS - solution ● Careful image pre-processing to remove any kind of bias (e.g., device) ● Combination of several current best deep neural networks ● Models are pre-trained on large collection of image data (imagenet + extra retina images) 29
  • 30. APTOS - final standings 30
  • 31. Quiz ● Did I have relevant experience to enter this competition? 31 Data: Atomic elements (H for hydrogen, C for carbon etc.) and their X, Y, Z cartesian coordinates. Task: Develop an algorithm that can predict the magnetic interaction between two atoms in a molecule.
  • 32. Why should you start on Kaggle? ● Doing is the best way to learn ● Get in touch with data and use cases outside your main domain ● Keep up-to-date with state-of-the-art methods ● Learn from others ● Measure yourself and know where you stand ● Hardware and software is provided by Kaggle 32
  • 34. How can you start on Kaggle? ● Don’t be afraid! Just do it! ● Overcome self-handicapping behavior ● You gain points regardless of the result ● “Getting started” competitions ● Pick a competition that sounds exciting to you, don’t be afraid to pick one where you have no prior experience ● Research similar previous competitions and read solutions ● Follow published notebooks and discussions 34
  • 35. Learn from the community 35
  • 36. How to approach a competition? ● Choose a programming language (usually python or R) ● Understand the problem setting, get a feeling for the data and the metric ● Exploratory Data Analysis (EDA) ● Implement basic script / notebook from scratch doing training and prediction OR just fork someone’s model ;-) ● Think hard about robust CV setup ● Keep up-to-date on discussions and developments of competition ● Experiment a lot and iterate quickly 36
  • 37. Try more, fail fast 37 Baseline model Final model
  • 38. Thanks! Get in touch with us! We are open to any inquiries. me@philippsinger.com dott1718@gmail.com @ph_singer @dott1718 38Vienna Data Science Meetup Vienna, Dec 5th 2019