Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Exercises 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Exercise set 3 – seaborn –module and data exploration

Put all your exercises (Jupyter Notebook/Python-files) in your course Git-project.


Use either code comments or Jupyter Notebook markdown (text) to document which
exercise you are doing and what a certain code section does!

NOTE! Answer the questions within the exercises either as


code comments or markdown text.
1. In this exercise, use the 'penguins' dataset from seaborn!
(sns.load_dataset('penguins'))
 Create a pair plot of the data
 What correlations can you immediately see?
 Bonus extra task: check out the correlation matrix for this
dataset too

 Use hue for the "island" column, what can you see in the data
this way?

 Find the amount of penguins on each island by using pandas


(value_counts()). Which island is different compared to others?
 Small extra task: visualize the counts with a bar plot

 Create now another pair plot, and use hue for the "species" –
column
 Is there a difference in the distributions when using hue
for the islands instead?
 Bonus task: How much does the "sex"-column affect the
result? (MALE / FEMALE)
 Create a scatter plot for bill_length_mm and flipper_length_mm,
use species as hue (try also island as hue)
 Which provides the more different result, species or
island?

 Use box plots, violin plots or swarm plots:


 inspect the following information:
 flipper_length_mm
 bill_length_mm
 body_mass_g
 hue = island, x= species
 What interesting insights/findings can you see in the data
this way?
 Or in other words: "how the data works regarding
the variables above?"
2. In this exercise, use the 'mpg' dataset from seaborn!
(sns.load_dataset('mpg')). mpg stands for "miles per gallon", which is a
common way to represent fuel consumption in USA
 Clean up the data
 Create a new column: "liters_per_100km" , which converts
the mpg to liters per 100 km
 You can Google the conversion formula easily: "miles
per gallon to liters per 100km" or "How do you
convert MPG to l 100km?"

 Remove the original mpg –column after this

 Remove the "name" –column

 After this, create a correlation matrix. There are two


columns that do not correlate as much as the others,
remove these two from the dataset (mention also which
columns you decided to remove and why)
 Create also a heatmap of the correlation matrix
after you remove the two unneeded columns

 There are three different columns that are strongly


connected to the car's efficiency (both power and
consumption), select one of them and remove the others
from the dataset
 Which column is the best selection to indicate the
car’s efficiency, and why? (cylinders, horsepower or
displacement/engine size)
 Remember: look at the big picture, also compare
how well the variables correlate to other variables
than just the consumption
 Finally, use the pair plot and hue (origin)
 Which origin country tends have bigger fuel consumption
in cars?
 Which is generally the origin with lowest consumption?
(more specific plots might be a good idea here, for example:
box plot, scatter plot etc., pandas functions are helpful too!)
 What other features the cars seem to have that result into
bigger or lower consumption?
3. csv-data, pandas and seaborn, groceries data

This dataset has been downloaded from kaggle.com.


Download the "groceries.csv" from Moodle.

Load the data by using pandas. (read_csv() etc.)

Note: since this is an actual dataset from the internet, the data is in quite a
rough format.

 If you take a look at the data, you will notice three columns have NaN –
values. You can either remove these columns all together, or you can fill
the missing values with average values of that column. For example, for
the Fish -column you could do something like this:

df['Fish'].fillna((df['Fish'].mean()), inplace=True)

Do this same operation for the two other columns with missing values.

 In this data, the date column is a bit difficult to use, since it's not
completely in numerical format. Split the Month-column so, that you
have two different columns: Month and Year
o For month, use a numeric format 1-12
o For year, use the full year 1990-2020
o Check out the examples in Moodle, this one is a bit tricky, but
very neat to know how it's done!
 After the cleanups, create a correlation matrix of the data. Create a
heatmap on the correlations as well.

Which grocery stands out? (i.e. there seems to be one grocery item
whose price doesn't follow other groceries at all)

 Which groceries seem to correlate to each other's prices? What do


they have in common?

Bonus task:

Sometimes data can also reflect history! For example, "Corn" and some
other foods seem to have a notable peak in its price in one of the years.
Find this year, and Google if you can find a reason for the price peak (for
example, search: "us corn expensive XXXX". Can you find any other food
that has relation to a real world event in a similar in this dataset?

Tip: use pivot table and heatmap! Remember also, the first year in the
dataset might show as blank white, that means there's no data. You
should also split the date
4. csv-data, pandas and seaborn, mobile phone data, regression plots

This dataset has been downloaded from kaggle.com and modified.


Download the "mobilephones.csv" from Moodle. Load the data by using
pandas. (read_csv() etc.)

Regression plots (lmplot() in seaborn) are often extremely useful in finding


more insight and "hidden connections" in your data.

Remember to also check out the confidence interval, if it's wide around the
regression line = there's usually fluctuation within the values. Narrow line
=> the linear connection is quite evident.

With the mobile phone data, do the following:

 Use regression plot for RAM and Price, no hue


o What is the correlation based on the regression line?
 Use regression plot for RAM and Price, hue on Brand
o How is the correlation different when compared to without
hue?
 Use regression plot for BatteryCapacity and Price, no hue
o What is the correlation based on the regression line?
 Use regression plot for BatteryCapacity and Price, hue on Brand
o How is the correlation different when compared to without
hue?

 Bonus task: Use regression plot for ScreenSize and Price, without and
with hue on Brand
o Compare the correlations with and without hue. Is there a
difference?
Advanced extra exercises:

1. seaborn: Try out Boxenplot and/or Dendrogram in any of the previous


dataset exercises. Do these plot types provide some interesting info on
the data? (Google for examples on these plot types)
 You may also consider some other plots, for example:
displot, catplot, relplot

2. Use the "titanic" dataset from the seaborn datasets


(sns.load_dataset('titanic')). Find out the features of a typical person
that survived or did not survive the sinking of Titanic.
3. Use the "taxis" dataset from the seaborn datasets
(sns.load_dataset('taxis')). Find out any correlations or interesting
behaviors based on any columns in the data (color, payment,
pickup_borough, dropoff_borough etc.)

Notes and ideas to try out:


Consider removing the pickup_zone and dropoff_zone, since there are
way too many alternatives. Borough is the larger area in question, which
can be helpful while grouping data (hue!)

How about pickup and dropoff times, should they be modified? From
taxi point of view, is the weekday and time of day (morning, day,
evening, night) more interesting than the actual dates?

These are just ideas, you're free to come up with your own ideas
regarding the data!
4. Try out any of the previous examples and exercises by using any or many
of the following additional plotting libraries:

 Matplotlib (this is the most common in addition to seaborn,


especially regarding machine learning, recommended to learn)
https://matplotlib.org/stable/tutorials/index.html#introductory

 Plotly
https://plotly.com/python/getting-started/

 Bokeh
https://docs.bokeh.org/en/latest/docs/user_guide.html
5. Try out any of the datasets below, or find yourself an interesting csv-
dataset from kaggle.com!

Use all your skills in NumPy, pandas and seaborn, and find out features
in the data.

Was there something that is surprising in the dataset? What


interesting correlations did you find?

Some interesting datasets, examples (you can find you own too!):

 https://www.kaggle.com/anamvillalpando/world-happiness-
ranking
 https://www.kaggle.com/sakshigoyal7/credit-card-customers
 https://www.kaggle.com/lucabasa/dutch-energy
 https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-
160k-tracks?select=data_w_genres.csv
 https://www.kaggle.com/kboghe/android-apps-
metadata?select=Android+apps+csv.csv
 https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehis
tory?select=coin_Bitcoin.csv

Note: These datasets can be quite rough to handle at first, feel free to
ask tips from your instructor if some dataset interests you!

You might also like