0% found this document useful (0 votes)

38 views

06 Data Loading Storage and File Formats

Uploaded by

Amal Emad

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

06 Data Loading Storage and File Formats

Uploaded by

Amal Emad

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Loading, Storage and

File Formats

1
Reference

• Chapter 6
• Wes McKinney, Python for Data Analysis: Data Wrangling
with Pandas, NumPy, and IPython, O’Reilly Media, 2nd
Edition, 2018.
• Material: https://github.com/wesm/pydata-book

2
Outline

1. Reading and Writing Data in Text Format

2. Binary Data Formats
3. Interacting with Web APIs
4. Interacting with Databases

3
Outline

1. Reading and Writing Data • Parsing Functions

in Text Format • Reading Text Files in Pieces
2. Binary Data Formats • Writing Data to Text
3. Interacting with Web APIs Format
4. Interacting with • JSON Data
Databases • XML and HTML: Web
Scraping
4
Parsing Functions in pandas

Most used

5
Parsing Functions in pandas – cont.

6
6.1 Reading and Writing Data in Text Format
• These parsing functions take optional arguments that fall
into the following categories:
• Indexing
• Type inference and data conversion
• Datetime parsing
• Iterating
• Unclean data issues
• Many options, so refer to the online documentation for
complex cases.
7
Comma-Separated (CSV) Text Files
ex1.csv
0
0
1
1
2
2
3
3 4
4 hello
• For files with no headers,
a,b,c,d,message
1,2,3,4,hello … ask pandas to assign default
5,6,7,8,world column names, or specify
9,10,11,12,foo names.
df = pd.read_csv('ex1.csv') pd.read_csv('ex2.csv',
header=None)
df
a b c d message pd.read_csv('ex2.csv',
1 1 2 3 4 hello names=['a', 'b', 'c', 'd',
2 5 6 7 8 world 'message'])
3 9 10 11 12 foo
Alternative:
pd.read_table('ex1.csv', sep=',')
8
Comma-Separated (CSV) Text Files
• You can use one of the file names = ['a', 'b', 'c', 'd',
'message']
columns as index. pd.read_csv('ex2.csv', names=names,
• How to handle fields index_col='message')
a b c d
separated by a variable message
amount of whitespace? hello 1 2 3 4
Regular expression
world 5 6 7 8 for one or more
ex3.csv foo 9 10 11 12 white space
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399 pd.read_table('ex3.txt', sep='\s+')

Must be index, why?

9
Comma-Separated (CSV) Text Files
• Missing data is usually pd.read_csv('ex5.csv')
something a b c d msg
either not present (empty 0 one 1 2 3.0 4 NaN
string) or marked by some 1 two 5 6 NaN 8 foo
sentinel value.
sentinels = {'msg': ['foo', 'NA'],
• Can Specify the sentinel 'something': ['two']}
values. pd.read_csv('examples/ex5.csv',
na_values=sentinels)
ex5.csv something a b c d msg
something,a,b,c,d,msg 0 one 1 2 3.0 4 NaN
one,1,2,3,4,NA 1 NaN 5 6 NaN 8 NaN
two,5,6,,8,foo

10
Reading Text Files in Pieces
• If you want to read a small pd.read_csv('ex6.csv', nrows=5)
number of rows, use nrows. chunker = pd.read_csv('ex6.csv',
chunksize=1000)
• To read a file in pieces, chunker
specify a chunksize of rows. <pandas.io.parsers.TextFileReader
at 0x7f6b1e2672e8>
• Iterate on the returned
parser object to aggregate tot = pd.Series([])
for c in chunker:
the value counts in the tot = tot.add(
'key' column. c['key'].value_counts(),
fill_value=0)
• There is also
chunker.get_chunk(n). tot = tot.sort_values(
ascending=False)
11
Writing Data to Text Format
• We can write the data out data.to_csv('out.csv')

to a comma-separated file. data.to_csv(sys.stdout,

sep='|' ,
• Useful options: sep, na_rep, na_rep='NULL')
|something|a|b|c|d|message
index, and header. 0|one|1|2|3.0|4|NULL

data.to_csv(sys.stdout,
index=False,
• You can also write only a header=False)
subset of the columns, and data.to_csv(sys.stdout,
in an order of your choosing. columns=['a', 'b', 'c'])

12
JSON Data
• JSON (short for JavaScript Object Notation) is a standard
formats for sending data.
• It is a free-form data format. Example:
{"name": "Wes",
"places_lived": ["United States", Call it j_str
"Spain"], "pet": null,
"siblings": [{"name": "Scott", "age": 30},
{"name": "Katie", "age": 38}]
}
Nearly valid Python code. Exceptions: The null value is
null. Disallowing trailing commas at the end of lists. All of
the keys in an object must be strings.
13
JSON Data Also json.load(fp)

• Python has built in JSON result = json.loads(j_str)

support. asjson = json.dumps(result)

• To convert a JSON string to Also json.dump(fp)

Python form, use siblings = pd.DataFrame(
json.loads. result['siblings'],
columns=['name', 'age'])
• json.dumps converts a siblings
name age
Python object to JSON. 1 Scott 30
2 Katie 38
• Python dict to DataFrame.

14
JSON Data
• The default options for data = pd.read_json('example.json')
data
pandas.read_json assume
a b c
that each object in the JSON 0 1 2 3
array is a row in the table.
1 4 5 6
• To export data from pandas 2 7 8 9
to JSON, use the to_json.
print(data.to_json(
example.json orient='records'))
[{"a": 1, "b": 2, "c": 3}, [{"a":1,"b":2,"c":3},{"a":4,"b":5,"
{"a": 4, "b": 5, "c": 6}, c":6}]
{"a": 7, "b": 8, "c": 9}]
15
XML and HTML: Web Scraping
• Given the html document conda install lxml
pip install beautifulsoup4 html5lib
from the US FDIC list for
tables = pd.read_html(
bank failures, find the years 'fdic_failed_bank_list.html')
with most bank failures. failures = tables[0]

• Pandas has read_html that close_timestamps = pd.to_datetime(

failures['Closing Date'])
returns a list of DataFrames. close_timestamps.dt.year.
value_counts()
• We need to extract years 2010 157
2009 140
from the ‘Closing Date’ 2011 92
column. …

16
Outline

1. Reading and Writing Data

in Text Format
2. Binary Data Formats • Pickle
3. Interacting with Web APIs • Using HDF5 Format
4. Interacting with • Reading Microsoft Excel
Databases Files

17
6.2 Binary Data Formats
• Python’s has built-in pickle import pickle

serialization. dogs_dict = { 'Ozzy': 3, 'Filou': 8,

'Luna': 5, 'Skippy': 10, 'Barco': 12,
• Serialization is converting an 'Balou': 9, 'Laika': 16 }
object in memory to a byte
stream that can be stored on filename = 'dogs'
outfile = open(filename, 'wb')
disk or sent over a network. pickle.dump(dogs_dict, outfile)
• Python’s pickle package has outfile.close()
dump and load. infile = open(filename, 'rb')
• Good for short-term storage. new_dict = pickle.load(infile)
infile.close()

18
6.2 Binary Data Formats
• All pandas objects have frame = pd.DataFrame(
np.arange(9).reshape((3, 3)))
to_pickle.
frame.to_pickle('frame_pickle')

• The reverse is read_pickle. pd.read_pickle('frame_pickle')

0 1 2
0 0 1 2
1 3 4 5
2 6 7 8

19
Using HDF5 Format
• The hierarchical data format df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]})
is efficient and cross df.to_hdf('data.h5', key='df1',
mode='w')
platform.
s = pd.Series([1, 2, 3, 4])
• Pandas has built-in support s.to_hdf('data.h5', key='s1')
for HDF5.
pd.read_hdf('data.h5', 'df1')
• Use to_hdf and read_hdf to A B
0 1 4
access one or more pandas 1 2 5
objects in an HDF5 file. 2 3 6

20
Using HDF5 Format
frame = pd.DataFrame({'a':
np.random.randn(100)})
• The HDFStore class works
like a dict and handles the store = pd.HDFStore('mydata.h5')
low-level details for writing store['obj1'] = frame
and retrieving. store['obj1_col'] = frame['a']

frm2 = store['obj1']
store.close()

21
Reading Microsoft Excel Files
• Pandas supports reading xlsx = pd.ExcelFile('ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')
from Excel 2003 (and a b c d message
higher) files using either the 1 1 2 3 4 hello
2 5 6 7 8 world
ExcelFile class or 3 9 10 11 12 foo
pandas.read_excel # Alternatively, for one sheet:
function. frame = pd.read_excel('ex1.xlsx',
'Sheet1')
• Need the packages xlrd and
writer = pd.ExcelWriter('ex2.xlsx')
openpyxl. frame.to_excel(writer, 'Sheet1')
• Writing is supported with writer.save() # Save and close
# Alternatively, for one sheet:
ExcelWriter. frame.to_excel('ex2.xlsx')

22
Outline

1. Reading and Writing Data in Text Format

2. Binary Data Formats
3. Interacting with Web APIs
4. Interacting with Databases

23
6.3 Interacting with Web APIs
• Many websites have public import requests
url =
APIs providing data feeds via 'https://api.github.com/repos/panda
JSON, e.g., Weather Data. s-dev/pandas/issues'

resp = requests.get(url)
• To find the last 30 GitHub data = resp.json() # list of dict
data[0]['title']
issues for pandas, we can 'Period does not round down for …'
make a GET HTTP request issues = pd.DataFrame(data,
using the add-on requests columns=['number',
'title', 'labels',
library. 'state'])
24
Outline

1. Reading and Writing Data in Text Format

2. Binary Data Formats
3. Interacting with Web APIs
4. Interacting with Databases

25
6.4 Interacting with Databases
• The SQLAlchemy project is a import sqlalchemy as sqla
db = sqla.create_engine(
popular Python SQL toolkit
'sqlite:///mydata.sqlite')
for interfacing with SQL
databases.
pd.read_sql('select * from test',
• Supports SQLite, Postgresql, db)
MySQL, Oracle, MS-SQL, a b c d
Firebird, Sybase and others. 1 Atlanta Georgia 1.25 6

• pandas has read_sql that 2 Tallahassee Florida 2.60 3

3 Sacramento California 1.70 5
reads data easily from a
SQLAlchemy connection.
26

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
MongoDB Schema Design Basics
100% (2)
MongoDB Schema Design Basics
51 pages
Qualifier CT Set - II Solution
100% (2)
Qualifier CT Set - II Solution
26 pages
Pandas
No ratings yet
Pandas
57 pages
Python For Data Analysis (1) - 171-192
No ratings yet
Python For Data Analysis (1) - 171-192
24 pages
4 2. Sequences
No ratings yet
4 2. Sequences
39 pages
Dokumen - Pub Python 3 Module Examples
No ratings yet
Dokumen - Pub Python 3 Module Examples
109 pages
Unit 2.2
No ratings yet
Unit 2.2
58 pages
Python Unit 5
No ratings yet
Python Unit 5
21 pages
Importing Data Into Pandas Dataframes
No ratings yet
Importing Data Into Pandas Dataframes
5 pages
INT 222
No ratings yet
INT 222
24 pages
Data Import
No ratings yet
Data Import
2 pages
Data Import::: Cheat Sheet
No ratings yet
Data Import::: Cheat Sheet
2 pages
Actuators and Drivers
No ratings yet
Actuators and Drivers
23 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
122 pages
Lec 07-I-DSFa23
No ratings yet
Lec 07-I-DSFa23
30 pages
Pandas 1
No ratings yet
Pandas 1
64 pages
Csempesz Mongodb
No ratings yet
Csempesz Mongodb
66 pages
Data Import
No ratings yet
Data Import
2 pages
Data Import
No ratings yet
Data Import
2 pages
Data-Import With Tidyverse
No ratings yet
Data-Import With Tidyverse
2 pages
Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Numpy
No ratings yet
Numpy
30 pages
Kunj Project 2
No ratings yet
Kunj Project 2
31 pages
2 Data Formats Relational DB
No ratings yet
2 Data Formats Relational DB
44 pages
17 - Working With CSV, JSON, YAML Files
No ratings yet
17 - Working With CSV, JSON, YAML Files
10 pages
FILES
No ratings yet
FILES
59 pages
1
No ratings yet
1
9 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
BTech 5 CSE Data Analytics Using Python Unit 4 Notes
No ratings yet
BTech 5 CSE Data Analytics Using Python Unit 4 Notes
25 pages
PythonProgramming_DCA6109
No ratings yet
PythonProgramming_DCA6109
13 pages
CS 5 MARKS (Lesson 13 To 14)
No ratings yet
CS 5 MARKS (Lesson 13 To 14)
6 pages
Data Science Wrangling
No ratings yet
Data Science Wrangling
121 pages
dw lab file
No ratings yet
dw lab file
18 pages
R Programming UNIT 2
No ratings yet
R Programming UNIT 2
119 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
CSV New
No ratings yet
CSV New
4 pages
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
No ratings yet
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
9 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Unit 4
No ratings yet
Unit 4
36 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
Pandas Cheatsheets 1.0.6 Web Binder PDF
No ratings yet
Pandas Cheatsheets 1.0.6 Web Binder PDF
8 pages
Lecture05 - Basic Grammar (string, file read, write)
No ratings yet
Lecture05 - Basic Grammar (string, file read, write)
11 pages
Introduction To Pandas: Takeaways: Syntax
No ratings yet
Introduction To Pandas: Takeaways: Syntax
2 pages
Introduction To Pandas Takeaways
No ratings yet
Introduction To Pandas Takeaways
2 pages
Adv ML Lab Record
No ratings yet
Adv ML Lab Record
36 pages
Chapter 4 - Import-Export Data
No ratings yet
Chapter 4 - Import-Export Data
30 pages
Pandas Summarized Visually in 8
100% (2)
Pandas Summarized Visually in 8
8 pages
Lab 3 - Working With Data Frames
No ratings yet
Lab 3 - Working With Data Frames
10 pages
CSV 40
No ratings yet
CSV 40
2 pages
Importing and Exporting CSV Files From
No ratings yet
Importing and Exporting CSV Files From
4 pages
DATA_FILE_HANDLING_TEXT_FILE
No ratings yet
DATA_FILE_HANDLING_TEXT_FILE
9 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
Pandas_Notes
No ratings yet
Pandas_Notes
6 pages
Workshet - 3
No ratings yet
Workshet - 3
2 pages
Pandas
No ratings yet
Pandas
13 pages
Python For Data Science Unit 3: DR Kruti Dangarwala CSE & IT Department Svmit
No ratings yet
Python For Data Science Unit 3: DR Kruti Dangarwala CSE & IT Department Svmit
113 pages
المختبر الثاني
No ratings yet
المختبر الثاني
19 pages
CSV File Guide
From Everand
CSV File Guide
Mia Wright
No ratings yet
CIB - Building Pathology A State-of-the-Art Report - Pub - 155 PDF
No ratings yet
CIB - Building Pathology A State-of-the-Art Report - Pub - 155 PDF
77 pages
Pioneer VSX 518
100% (2)
Pioneer VSX 518
94 pages
ABB Aquamaster3&4 Terminal Connection
No ratings yet
ABB Aquamaster3&4 Terminal Connection
2 pages
Lecture-4 - Ch-6 Computer Graphics and Data Base
No ratings yet
Lecture-4 - Ch-6 Computer Graphics and Data Base
40 pages
Ch1_Amplitude Modulation
No ratings yet
Ch1_Amplitude Modulation
57 pages
Bule Hora University: Collage of Engineering and Technology Department of Mechanical Engineering
No ratings yet
Bule Hora University: Collage of Engineering and Technology Department of Mechanical Engineering
25 pages
ICT LABORATORY MANUAL HANDBOOK Final
No ratings yet
ICT LABORATORY MANUAL HANDBOOK Final
32 pages
Ic-M34 Service Manual
No ratings yet
Ic-M34 Service Manual
30 pages
Pune Help
No ratings yet
Pune Help
2 pages
Specifications
No ratings yet
Specifications
3 pages
Onthitienganhcn
No ratings yet
Onthitienganhcn
11 pages
Chapter 3 Data Representation
No ratings yet
Chapter 3 Data Representation
23 pages
Program Correctness & Efficiency
No ratings yet
Program Correctness & Efficiency
45 pages
Scribd For Entrepreneurs: A Blueprint For Business Success
No ratings yet
Scribd For Entrepreneurs: A Blueprint For Business Success
2 pages
Earth Fault Loop Impedance Summary
No ratings yet
Earth Fault Loop Impedance Summary
4 pages
OP AMP Basics
No ratings yet
OP AMP Basics
38 pages
fx-95MS fx-100MS fx-115MS fx-570MS fx-991MS: User's Guide
No ratings yet
fx-95MS fx-100MS fx-115MS fx-570MS fx-991MS: User's Guide
42 pages
It Set-1 X
No ratings yet
It Set-1 X
2 pages
LSMW Q04 XK02
No ratings yet
LSMW Q04 XK02
8 pages
Comtech/EFData CDM-625A Satellite Modem Data Sheet
No ratings yet
Comtech/EFData CDM-625A Satellite Modem Data Sheet
5 pages
05 Ishikawa Diagram & Pareto Diagram-1
No ratings yet
05 Ishikawa Diagram & Pareto Diagram-1
13 pages
4.6 Full System Leak Detection Philosophy
No ratings yet
4.6 Full System Leak Detection Philosophy
6 pages
Measure of Central Tendency Ungrouped Data
No ratings yet
Measure of Central Tendency Ungrouped Data
31 pages
CSharpExercises 20170426
0% (1)
CSharpExercises 20170426
51 pages
Product Management Cohort 1 Day 1
No ratings yet
Product Management Cohort 1 Day 1
10 pages
Power Prediction Modeling of Convetional High-Speed Craft
No ratings yet
Power Prediction Modeling of Convetional High-Speed Craft
15 pages
Benefits Register TEMPLATE
No ratings yet
Benefits Register TEMPLATE
2 pages
Memory Saver Manual (SLK-TS100, TS200) Eng Rev 2403
No ratings yet
Memory Saver Manual (SLK-TS100, TS200) Eng Rev 2403
20 pages
HAKDOGKOMALAKO
No ratings yet
HAKDOGKOMALAKO
2 pages