06 Data Loading Storage and File Formats
06 Data Loading Storage and File Formats
File Formats
1
Reference
• Chapter 6
• Wes McKinney, Python for Data Analysis: Data Wrangling
with Pandas, NumPy, and IPython, O’Reilly Media, 2nd
Edition, 2018.
• Material: https://github.com/wesm/pydata-book
2
Outline
3
Outline
Most used
5
Parsing Functions in pandas – cont.
6
6.1 Reading and Writing Data in Text Format
• These parsing functions take optional arguments that fall
into the following categories:
• Indexing
• Type inference and data conversion
• Datetime parsing
• Iterating
• Unclean data issues
• Many options, so refer to the online documentation for
complex cases.
7
Comma-Separated (CSV) Text Files
ex1.csv
0
0
1
1
2
2
3
3 4
4 hello
• For files with no headers,
a,b,c,d,message
1,2,3,4,hello … ask pandas to assign default
5,6,7,8,world column names, or specify
9,10,11,12,foo names.
df = pd.read_csv('ex1.csv') pd.read_csv('ex2.csv',
header=None)
df
a b c d message pd.read_csv('ex2.csv',
1 1 2 3 4 hello names=['a', 'b', 'c', 'd',
2 5 6 7 8 world 'message'])
3 9 10 11 12 foo
Alternative:
pd.read_table('ex1.csv', sep=',')
8
Comma-Separated (CSV) Text Files
• You can use one of the file names = ['a', 'b', 'c', 'd',
'message']
columns as index. pd.read_csv('ex2.csv', names=names,
• How to handle fields index_col='message')
a b c d
separated by a variable message
amount of whitespace? hello 1 2 3 4
Regular expression
world 5 6 7 8 for one or more
ex3.csv foo 9 10 11 12 white space
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399 pd.read_table('ex3.txt', sep='\s+')
10
Reading Text Files in Pieces
• If you want to read a small pd.read_csv('ex6.csv', nrows=5)
number of rows, use nrows. chunker = pd.read_csv('ex6.csv',
chunksize=1000)
• To read a file in pieces, chunker
specify a chunksize of rows. <pandas.io.parsers.TextFileReader
at 0x7f6b1e2672e8>
• Iterate on the returned
parser object to aggregate tot = pd.Series([])
for c in chunker:
the value counts in the tot = tot.add(
'key' column. c['key'].value_counts(),
fill_value=0)
• There is also
chunker.get_chunk(n). tot = tot.sort_values(
ascending=False)
11
Writing Data to Text Format
• We can write the data out data.to_csv('out.csv')
data.to_csv(sys.stdout,
index=False,
• You can also write only a header=False)
subset of the columns, and data.to_csv(sys.stdout,
in an order of your choosing. columns=['a', 'b', 'c'])
12
JSON Data
• JSON (short for JavaScript Object Notation) is a standard
formats for sending data.
• It is a free-form data format. Example:
{"name": "Wes",
"places_lived": ["United States", Call it j_str
"Spain"], "pet": null,
"siblings": [{"name": "Scott", "age": 30},
{"name": "Katie", "age": 38}]
}
Nearly valid Python code. Exceptions: The null value is
null. Disallowing trailing commas at the end of lists. All of
the keys in an object must be strings.
13
JSON Data Also json.load(fp)
14
JSON Data
• The default options for data = pd.read_json('example.json')
data
pandas.read_json assume
a b c
that each object in the JSON 0 1 2 3
array is a row in the table.
1 4 5 6
• To export data from pandas 2 7 8 9
to JSON, use the to_json.
print(data.to_json(
example.json orient='records'))
[{"a": 1, "b": 2, "c": 3}, [{"a":1,"b":2,"c":3},{"a":4,"b":5,"
{"a": 4, "b": 5, "c": 6}, c":6}]
{"a": 7, "b": 8, "c": 9}]
15
XML and HTML: Web Scraping
• Given the html document conda install lxml
pip install beautifulsoup4 html5lib
from the US FDIC list for
tables = pd.read_html(
bank failures, find the years 'fdic_failed_bank_list.html')
with most bank failures. failures = tables[0]
16
Outline
17
6.2 Binary Data Formats
• Python’s has built-in pickle import pickle
18
6.2 Binary Data Formats
• All pandas objects have frame = pd.DataFrame(
np.arange(9).reshape((3, 3)))
to_pickle.
frame.to_pickle('frame_pickle')
19
Using HDF5 Format
• The hierarchical data format df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]})
is efficient and cross df.to_hdf('data.h5', key='df1',
mode='w')
platform.
s = pd.Series([1, 2, 3, 4])
• Pandas has built-in support s.to_hdf('data.h5', key='s1')
for HDF5.
pd.read_hdf('data.h5', 'df1')
• Use to_hdf and read_hdf to A B
0 1 4
access one or more pandas 1 2 5
objects in an HDF5 file. 2 3 6
20
Using HDF5 Format
frame = pd.DataFrame({'a':
np.random.randn(100)})
• The HDFStore class works
like a dict and handles the store = pd.HDFStore('mydata.h5')
low-level details for writing store['obj1'] = frame
and retrieving. store['obj1_col'] = frame['a']
frm2 = store['obj1']
store.close()
21
Reading Microsoft Excel Files
• Pandas supports reading xlsx = pd.ExcelFile('ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')
from Excel 2003 (and a b c d message
higher) files using either the 1 1 2 3 4 hello
2 5 6 7 8 world
ExcelFile class or 3 9 10 11 12 foo
pandas.read_excel # Alternatively, for one sheet:
function. frame = pd.read_excel('ex1.xlsx',
'Sheet1')
• Need the packages xlrd and
writer = pd.ExcelWriter('ex2.xlsx')
openpyxl. frame.to_excel(writer, 'Sheet1')
• Writing is supported with writer.save() # Save and close
# Alternatively, for one sheet:
ExcelWriter. frame.to_excel('ex2.xlsx')
22
Outline
23
6.3 Interacting with Web APIs
• Many websites have public import requests
url =
APIs providing data feeds via 'https://api.github.com/repos/panda
JSON, e.g., Weather Data. s-dev/pandas/issues'
resp = requests.get(url)
• To find the last 30 GitHub data = resp.json() # list of dict
data[0]['title']
issues for pandas, we can 'Period does not round down for …'
make a GET HTTP request issues = pd.DataFrame(data,
using the add-on requests columns=['number',
'title', 'labels',
library. 'state'])
24
Outline
25
6.4 Interacting with Databases
• The SQLAlchemy project is a import sqlalchemy as sqla
db = sqla.create_engine(
popular Python SQL toolkit
'sqlite:///mydata.sqlite')
for interfacing with SQL
databases.
pd.read_sql('select * from test',
• Supports SQLite, Postgresql, db)
MySQL, Oracle, MS-SQL, a b c d
Firebird, Sybase and others. 1 Atlanta Georgia 1.25 6