The document provides information about data analysis and visualization using Pandas. It covers topics like loading and exporting data, data wrangling operations, merging and aggregating data, and creating basic visualizations.
The document provides information about data analysis and visualization using Pandas. It covers topics like loading and exporting data, data wrangling operations, merging and aggregating data, and creating basic visualizations.
The document provides information about data analysis and visualization using Pandas. It covers topics like loading and exporting data, data wrangling operations, merging and aggregating data, and creating basic visualizations.
The document provides information about data analysis and visualization using Pandas. It covers topics like loading and exporting data, data wrangling operations, merging and aggregating data, and creating basic visualizations.
cheat sheet Data as NumPy array: positions = [1.0, 2.0, 3.0, 4.0] df.values plt.xticks(positions, labels) plt.yticks(positions, labels) Save data as CSV file: df.to_csv('output.csv', sep=",") Select area to plot: All of the following code examples refer to this table: plt.axis([0.0, 2.5, 0.0, 10.0]) Format a data frame as tabular string: # [from x, to x, from y, to y] df= df.to_string() col1 col2 Label diagram and axes: Convert a data frame to a dictionary: plt.title('Correlation') A 1 4 df.to_dict() plt.xlabel('Nunstück') plt.ylabel('Slotermeyer') B 2 5 Save a data frame as an Excel table: df.to_excel('output.xlsx') Save most recent diagram: C 3 6 plt.savefig('plot.png') (requires package xlwt) plt.savefig('plot.png', dpi=300) plt.savefig('plot.svg') Getting started Visualization Import pandas: import pandas as pd Import matplotlib: import pylab as plt Create a series: s = pd.Series([1, 2, 3], index=['A', 'B', 'C'], Start a new diagram: name='col1') plt.figure() Create a data frame: Scatter plot: data = [[1, 4], [2, 5], [3, 6]] df.plot.scatter('col1', 'col2', style='ro') index = ['A', 'B', 'C'] df = pd.DataFrame(data, index=index, Bar plot: columns=['col1', 'col2']) df.plot.bar(x='col1', y='col2', width=0.7) Load a data frame: Area plot: df = pd.read_csv('filename.csv', df.plot.area(stacked=True, alpha=1.0) sep=',', names=['col1', 'col2'], Box-and-whisker plot: index_col=0, df.plot.box() encoding='utf-8', Text by Kristian Rother, Thomas Lotze (CC-BY-SA 4.0) nrows=3) Histogram over one column: df['col1'].plot.hist(bins=3) https://www.cusy.io/de/seminare Selecting rows and columns Histogram over all columns: df.plot.hist(bins=3, alpha=0.5) Select single column: df['col1'] Select multiple columns: Merge multiple data frames horizontally: Count unique values: df[['col1', 'col2']] df3 = pd.DataFrame([[1, 7], [8, 9]], df['col1'].value_counts() index=['B', 'D'], Show first n rows: columns=['col1', 'col3']) Summarize descriptive statistics: df.head(2) df.describe() Only merge complete rows (INNER JOIN): Show last n rows: df.merge(df3) df.tail(2) Hierarchical indexing Left column stays complete (LEFT OUTER JOIN): Select rows by index values: df.merge(df3, how='left') Create hierarchical index: df.loc['A'] df.stack() df.loc[['A', 'B']] Right column stays complete (RIGHT OUTER JOIN): df.merge(df3, how='right') Dissolve hierarchical index: Select rows by position: df.unstack() df.loc[1] Preserve all values (OUTER JOIN): df.loc[1:] df.merge(df3, how='outer') Aggregation Merge rows by index: Data wrangling df.merge(df3, left_index=True, right_index=True) Create group object: g = df.groupby('col1') Filter by value: Fill NaN values: df[df['col1'] > 1] df.fillna(0.0) Iterate over groups: for i, group in g: Sort by columns: Apply your own function: print(i, group) df.sort_values(['col2', 'col2'], ascending=[False, True]) def func(x): return 2**x df.apply(func) Aggregate groups: Identify duplicate rows: g.sum() df.duplicated() g.prod() Arithmetics and statistics g.mean() Identify unique rows: g.std() df['col1'].unique() Add to all values: g.describe() df + 10 Swap rows and columns: Select columns from groups: df = df.transpose() Sum over columns: g['col2'].sum() df.sum() g[['col2', 'col3']].sum() Remove a column: del df['col2'] Cumulative sum over columns: Transform values: df.cumsum() import math Clone a data frame: g.transform(math.log) clone = df.copy() Mean over columns: df.mean() Apply a list function on each group: Connect multiple data frames vertically: def strsum(group): df2 = df + 10 Standard devieation over columns: return ''.join([str(x) for x in group.values]) pd.concat([df, df2]) df.std() g['col2'].apply(strsum)