Reshaping Data With Python
Reshaping Data With Python
movies_indexed = movies.set_index("title")
music.explode("singles")
movies_indexed.reset_index()
# Replace index, left joining new index to existing data with .reindex()
pd.json_normalize(music_exploded["singles"])
avengers_index = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War",
"Avengers: Endgame"]
# Equivalent to pd.DataFrame(index=avengers_index) \
The majority of data analysis in Python is performed in pandas DataFrames. These are rectangular datasets consisting # level argument starts with 0 for the outer index
A variable is an attribute for the object, across all the observations. For example, the release dates for all the movies
# Concatenate several columns into a single string column with .str.cat()
# Move (multi-)indexes from a row index to a column index with .stack()
Tidy data provides a standard way to organize data. Having a consistent shape for datasets enables you to worry less
# Each column must be converted to string type before joining
pig_feed_stacked.unstack(level=1)
about data structures and more on getting useful results. The principles of tidy data are
movies["release_year"].astype(str) \
import json
> Datasets used throughout this cheat sheet # Combine several columns into a list column with .values.tolist()
# Convert series containing nested elements to JSON string with json.dumps()
Throughout this cheat sheet we will use a dataset of the top grossing movies of all time, stored as movies.
.values.tolist()
Joe Russo
2.048
> Melting and pivoting # Drop rows containing any missing values in the specified columns with .dropna()
The second dataset involves an experiment with the number of unpopped kernels in bags of popcorn, adapted from the people.dropna(subset="weight_kg")
Popcorn dataset in the R's Stat2Data package. # Move side-by-side columns to consecutive rows with .melt()
popcorn_indexed = popcorn.set_index("brand")
The third dataset is JSON data about music containing nested elements. The JSON is parsed into nested lists using popcorn_indexed.melt(var_name="trial", value_name="n_unpopped", ignore_index=False)
read_json() from the pandas package. Notice that each element in the singles column is a list of dictionaries.
artist singles
# Where there is a column multi-index, specify id_vars with a list of tuples
popcorn_long \
www.DataCamp.com
.pivot(values="n_unpopped", index="brand", columns="trial") \
The fifth dataset, pig_feed, shows weight gain in pigs from additives to their feed. There is a multi-index on the columns. popcorn_long \
Antibiotic No Yes
.reset_index()
B12 No Yes No Yes
19 22 3 54