Data playground for improving machine learning skills using Kaggle datasets.
To create the mlp
environment run:
conda env create -f environment.yml
What if scientists could anticipate volcanic eruptions as they predict the weather? While determining rain or shine days in advance is more difficult, weather reports become more accurate on shorter time scales. A similar approach with volcanoes could make a big impact. Just one unforeseen eruption can result in tens of thousands of lives lost. If scientists could reliably predict when a volcano will next erupt, evacuations could be more timely and the damage mitigated.
Enter Italy's Istituto Nazionale di Geofisica e Vulcanologia (INGV), with its focus on geophysics and volcanology. The INGV's main objective is to contribute to the understanding of the Earth's system while mitigating the associated risks. Tasked with the 24-hour monitoring of seismicity and active volcano activity across the country, the INGV seeks to find the earliest detectable precursors that provide information about the timing of future volcanic eruptions.
Data size is 31.25 GB and contains 8953 files.
Download the data zip file directly from Kaggle by running the following code within the data/
directory:
kaggle competitions download -c predict-volcanic-eruptions-ingv-oe
The data zip file can then be unzipped via:
unzip predict-volcanic-eruptions-ingv-oe.zip
For the data zip file to download successfully, please ensure your ~/.kaggle
folder contains a valid Kaggle API token kaggle.json
.
If not, please create a new token from within your Kaggle account settings, then move the token from the Downloads
folder to the ~/.kaggle
folder.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each exercise includes the following:
data/
: contains the dataset description, train and test sets, and sample prediction CSV file
real_data/
: contains the full dataset, and/or kaggle leaderboard score distribution
model.ipynb
: sample ML workflow using Jupyter Notebook
This is a fun competition aimed at helping you get started with machine learning. While the dataset is publicly available on the internet, looking up the answers defeats the entire purpose. So seriously, don't do that.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
score.py
is a python script for evaluating final model performance on the test set, callable within each exercise directory via:
python ../score.py -f [prediction csv filepath]