Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
38 views

06 Data Loading Storage and File Formats

Uploaded by

Amal Emad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

06 Data Loading Storage and File Formats

Uploaded by

Amal Emad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Loading, Storage and

File Formats

1
Reference

• Chapter 6
• Wes McKinney, Python for Data Analysis: Data Wrangling
with Pandas, NumPy, and IPython, O’Reilly Media, 2nd
Edition, 2018.
• Material: https://github.com/wesm/pydata-book

2
Outline

1. Reading and Writing Data in Text Format


2. Binary Data Formats
3. Interacting with Web APIs
4. Interacting with Databases

3
Outline

1. Reading and Writing Data • Parsing Functions


in Text Format • Reading Text Files in Pieces
2. Binary Data Formats • Writing Data to Text
3. Interacting with Web APIs Format
4. Interacting with • JSON Data
Databases • XML and HTML: Web
Scraping
4
Parsing Functions in pandas

Most used

5
Parsing Functions in pandas – cont.

6
6.1 Reading and Writing Data in Text Format
• These parsing functions take optional arguments that fall
into the following categories:
• Indexing
• Type inference and data conversion
• Datetime parsing
• Iterating
• Unclean data issues
• Many options, so refer to the online documentation for
complex cases.
7
Comma-Separated (CSV) Text Files
ex1.csv
0
0
1
1
2
2
3
3 4
4 hello
• For files with no headers,
a,b,c,d,message
1,2,3,4,hello … ask pandas to assign default
5,6,7,8,world column names, or specify
9,10,11,12,foo names.
df = pd.read_csv('ex1.csv') pd.read_csv('ex2.csv',
header=None)
df
a b c d message pd.read_csv('ex2.csv',
1 1 2 3 4 hello names=['a', 'b', 'c', 'd',
2 5 6 7 8 world 'message'])
3 9 10 11 12 foo
Alternative:
pd.read_table('ex1.csv', sep=',')
8
Comma-Separated (CSV) Text Files
• You can use one of the file names = ['a', 'b', 'c', 'd',
'message']
columns as index. pd.read_csv('ex2.csv', names=names,
• How to handle fields index_col='message')
a b c d
separated by a variable message
amount of whitespace? hello 1 2 3 4
Regular expression
world 5 6 7 8 for one or more
ex3.csv foo 9 10 11 12 white space
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399 pd.read_table('ex3.txt', sep='\s+')

Must be index, why?


9
Comma-Separated (CSV) Text Files
• Missing data is usually pd.read_csv('ex5.csv')
something a b c d msg
either not present (empty 0 one 1 2 3.0 4 NaN
string) or marked by some 1 two 5 6 NaN 8 foo
sentinel value.
sentinels = {'msg': ['foo', 'NA'],
• Can Specify the sentinel 'something': ['two']}
values. pd.read_csv('examples/ex5.csv',
na_values=sentinels)
ex5.csv something a b c d msg
something,a,b,c,d,msg 0 one 1 2 3.0 4 NaN
one,1,2,3,4,NA 1 NaN 5 6 NaN 8 NaN
two,5,6,,8,foo

10
Reading Text Files in Pieces
• If you want to read a small pd.read_csv('ex6.csv', nrows=5)
number of rows, use nrows. chunker = pd.read_csv('ex6.csv',
chunksize=1000)
• To read a file in pieces, chunker
specify a chunksize of rows. <pandas.io.parsers.TextFileReader
at 0x7f6b1e2672e8>
• Iterate on the returned
parser object to aggregate tot = pd.Series([])
for c in chunker:
the value counts in the tot = tot.add(
'key' column. c['key'].value_counts(),
fill_value=0)
• There is also
chunker.get_chunk(n). tot = tot.sort_values(
ascending=False)
11
Writing Data to Text Format
• We can write the data out data.to_csv('out.csv')

to a comma-separated file. data.to_csv(sys.stdout,


sep='|' ,
• Useful options: sep, na_rep, na_rep='NULL')
|something|a|b|c|d|message
index, and header. 0|one|1|2|3.0|4|NULL

data.to_csv(sys.stdout,
index=False,
• You can also write only a header=False)
subset of the columns, and data.to_csv(sys.stdout,
in an order of your choosing. columns=['a', 'b', 'c'])

12
JSON Data
• JSON (short for JavaScript Object Notation) is a standard
formats for sending data.
• It is a free-form data format. Example:
{"name": "Wes",
"places_lived": ["United States", Call it j_str
"Spain"], "pet": null,
"siblings": [{"name": "Scott", "age": 30},
{"name": "Katie", "age": 38}]
}
Nearly valid Python code. Exceptions: The null value is
null. Disallowing trailing commas at the end of lists. All of
the keys in an object must be strings.
13
JSON Data Also json.load(fp)

• Python has built in JSON result = json.loads(j_str)

support. asjson = json.dumps(result)

• To convert a JSON string to Also json.dump(fp)


Python form, use siblings = pd.DataFrame(
json.loads. result['siblings'],
columns=['name', 'age'])
• json.dumps converts a siblings
name age
Python object to JSON. 1 Scott 30
2 Katie 38
• Python dict to DataFrame.

14
JSON Data
• The default options for data = pd.read_json('example.json')
data
pandas.read_json assume
a b c
that each object in the JSON 0 1 2 3
array is a row in the table.
1 4 5 6
• To export data from pandas 2 7 8 9
to JSON, use the to_json.
print(data.to_json(
example.json orient='records'))
[{"a": 1, "b": 2, "c": 3}, [{"a":1,"b":2,"c":3},{"a":4,"b":5,"
{"a": 4, "b": 5, "c": 6}, c":6}]
{"a": 7, "b": 8, "c": 9}]
15
XML and HTML: Web Scraping
• Given the html document conda install lxml
pip install beautifulsoup4 html5lib
from the US FDIC list for
tables = pd.read_html(
bank failures, find the years 'fdic_failed_bank_list.html')
with most bank failures. failures = tables[0]

• Pandas has read_html that close_timestamps = pd.to_datetime(


failures['Closing Date'])
returns a list of DataFrames. close_timestamps.dt.year.
value_counts()
• We need to extract years 2010 157
2009 140
from the ‘Closing Date’ 2011 92
column. …

16
Outline

1. Reading and Writing Data


in Text Format
2. Binary Data Formats • Pickle
3. Interacting with Web APIs • Using HDF5 Format
4. Interacting with • Reading Microsoft Excel
Databases Files

17
6.2 Binary Data Formats
• Python’s has built-in pickle import pickle

serialization. dogs_dict = { 'Ozzy': 3, 'Filou': 8,


'Luna': 5, 'Skippy': 10, 'Barco': 12,
• Serialization is converting an 'Balou': 9, 'Laika': 16 }
object in memory to a byte
stream that can be stored on filename = 'dogs'
outfile = open(filename, 'wb')
disk or sent over a network. pickle.dump(dogs_dict, outfile)
• Python’s pickle package has outfile.close()
dump and load. infile = open(filename, 'rb')
• Good for short-term storage. new_dict = pickle.load(infile)
infile.close()

18
6.2 Binary Data Formats
• All pandas objects have frame = pd.DataFrame(
np.arange(9).reshape((3, 3)))
to_pickle.
frame.to_pickle('frame_pickle')

• The reverse is read_pickle. pd.read_pickle('frame_pickle')


0 1 2
0 0 1 2
1 3 4 5
2 6 7 8

19
Using HDF5 Format
• The hierarchical data format df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]})
is efficient and cross df.to_hdf('data.h5', key='df1',
mode='w')
platform.
s = pd.Series([1, 2, 3, 4])
• Pandas has built-in support s.to_hdf('data.h5', key='s1')
for HDF5.
pd.read_hdf('data.h5', 'df1')
• Use to_hdf and read_hdf to A B
0 1 4
access one or more pandas 1 2 5
objects in an HDF5 file. 2 3 6

20
Using HDF5 Format
frame = pd.DataFrame({'a':
np.random.randn(100)})
• The HDFStore class works
like a dict and handles the store = pd.HDFStore('mydata.h5')
low-level details for writing store['obj1'] = frame
and retrieving. store['obj1_col'] = frame['a']

frm2 = store['obj1']
store.close()

21
Reading Microsoft Excel Files
• Pandas supports reading xlsx = pd.ExcelFile('ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')
from Excel 2003 (and a b c d message
higher) files using either the 1 1 2 3 4 hello
2 5 6 7 8 world
ExcelFile class or 3 9 10 11 12 foo
pandas.read_excel # Alternatively, for one sheet:
function. frame = pd.read_excel('ex1.xlsx',
'Sheet1')
• Need the packages xlrd and
writer = pd.ExcelWriter('ex2.xlsx')
openpyxl. frame.to_excel(writer, 'Sheet1')
• Writing is supported with writer.save() # Save and close
# Alternatively, for one sheet:
ExcelWriter. frame.to_excel('ex2.xlsx')

22
Outline

1. Reading and Writing Data in Text Format


2. Binary Data Formats
3. Interacting with Web APIs
4. Interacting with Databases

23
6.3 Interacting with Web APIs
• Many websites have public import requests
url =
APIs providing data feeds via 'https://api.github.com/repos/panda
JSON, e.g., Weather Data. s-dev/pandas/issues'

resp = requests.get(url)
• To find the last 30 GitHub data = resp.json() # list of dict
data[0]['title']
issues for pandas, we can 'Period does not round down for …'
make a GET HTTP request issues = pd.DataFrame(data,
using the add-on requests columns=['number',
'title', 'labels',
library. 'state'])
24
Outline

1. Reading and Writing Data in Text Format


2. Binary Data Formats
3. Interacting with Web APIs
4. Interacting with Databases

25
6.4 Interacting with Databases
• The SQLAlchemy project is a import sqlalchemy as sqla
db = sqla.create_engine(
popular Python SQL toolkit
'sqlite:///mydata.sqlite')
for interfacing with SQL
databases.
pd.read_sql('select * from test',
• Supports SQLite, Postgresql, db)
MySQL, Oracle, MS-SQL, a b c d
Firebird, Sybase and others. 1 Atlanta Georgia 1.25 6

• pandas has read_sql that 2 Tallahassee Florida 2.60 3


3 Sacramento California 1.70 5
reads data easily from a
SQLAlchemy connection.
26

You might also like