Exercises 3
Exercises 3
Exercises 3
Use hue for the "island" column, what can you see in the data
this way?
Create now another pair plot, and use hue for the "species" –
column
Is there a difference in the distributions when using hue
for the islands instead?
Bonus task: How much does the "sex"-column affect the
result? (MALE / FEMALE)
Create a scatter plot for bill_length_mm and flipper_length_mm,
use species as hue (try also island as hue)
Which provides the more different result, species or
island?
Note: since this is an actual dataset from the internet, the data is in quite a
rough format.
If you take a look at the data, you will notice three columns have NaN –
values. You can either remove these columns all together, or you can fill
the missing values with average values of that column. For example, for
the Fish -column you could do something like this:
df['Fish'].fillna((df['Fish'].mean()), inplace=True)
Do this same operation for the two other columns with missing values.
In this data, the date column is a bit difficult to use, since it's not
completely in numerical format. Split the Month-column so, that you
have two different columns: Month and Year
o For month, use a numeric format 1-12
o For year, use the full year 1990-2020
o Check out the examples in Moodle, this one is a bit tricky, but
very neat to know how it's done!
After the cleanups, create a correlation matrix of the data. Create a
heatmap on the correlations as well.
Which grocery stands out? (i.e. there seems to be one grocery item
whose price doesn't follow other groceries at all)
Bonus task:
Sometimes data can also reflect history! For example, "Corn" and some
other foods seem to have a notable peak in its price in one of the years.
Find this year, and Google if you can find a reason for the price peak (for
example, search: "us corn expensive XXXX". Can you find any other food
that has relation to a real world event in a similar in this dataset?
Tip: use pivot table and heatmap! Remember also, the first year in the
dataset might show as blank white, that means there's no data. You
should also split the date
4. csv-data, pandas and seaborn, mobile phone data, regression plots
Remember to also check out the confidence interval, if it's wide around the
regression line = there's usually fluctuation within the values. Narrow line
=> the linear connection is quite evident.
Bonus task: Use regression plot for ScreenSize and Price, without and
with hue on Brand
o Compare the correlations with and without hue. Is there a
difference?
Advanced extra exercises:
How about pickup and dropoff times, should they be modified? From
taxi point of view, is the weekday and time of day (morning, day,
evening, night) more interesting than the actual dates?
These are just ideas, you're free to come up with your own ideas
regarding the data!
4. Try out any of the previous examples and exercises by using any or many
of the following additional plotting libraries:
Plotly
https://plotly.com/python/getting-started/
Bokeh
https://docs.bokeh.org/en/latest/docs/user_guide.html
5. Try out any of the datasets below, or find yourself an interesting csv-
dataset from kaggle.com!
Use all your skills in NumPy, pandas and seaborn, and find out features
in the data.
Some interesting datasets, examples (you can find you own too!):
https://www.kaggle.com/anamvillalpando/world-happiness-
ranking
https://www.kaggle.com/sakshigoyal7/credit-card-customers
https://www.kaggle.com/lucabasa/dutch-energy
https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-
160k-tracks?select=data_w_genres.csv
https://www.kaggle.com/kboghe/android-apps-
metadata?select=Android+apps+csv.csv
https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehis
tory?select=coin_Bitcoin.csv
Note: These datasets can be quite rough to handle at first, feel free to
ask tips from your instructor if some dataset interests you!