Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
32 views

Read CSV Files Using Pandas Library

The document discusses importing a dataset called 'dataset_1.data' using Pandas in Python. It provides the code to import necessary libraries, read the CSV file into a dataframe, and view the first few rows of the data. The dataset contains information about automobiles such as make, fuel type, engine details, and price.

Uploaded by

fi20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Read CSV Files Using Pandas Library

The document discusses importing a dataset called 'dataset_1.data' using Pandas in Python. It provides the code to import necessary libraries, read the CSV file into a dataframe, and view the first few rows of the data. The dataset contains information about automobiles such as make, fuel type, engine details, and price.

Uploaded by

fi20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

importing-datasets-pandas

September 24, 2023

1 Importing Dataset using Pandas


Data Acquisition
In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value)
format.
dataset name: dataset_1.data

2 Importing Libraries
[1]: import numpy as np
import pandas as pd

3 Importing datasets
We use pandas.read_csv() function to read the csv file. In the bracket, we put the file path along
with a quotation mark, so that pandas will read the file into a data frame from that address. The
file path can be either an URL or your local file address.
Name of Dataset is “dataset_1.data”
[ ]: headers_data =␣
↪["symboling","normalized-losses","make","fuel-type","aspiration",␣

↪"num-of-doors","body-style",

"drive-wheels","engine-location","wheel-base",␣
↪"length","width","height","curb-weight","engine-type",

"num-of-cylinders",␣
↪"engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",

"peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", type(headers_data))

headers
<class 'list'>

[ ]: #Read the dataset in "df" variable


df = pd.read_csv("dataset_1.data", names = headers_data)
df

1
[ ]: symboling normalized-losses make fuel-type aspiration \
0 3 ? alfa-romero gas std
1 3 ? alfa-romero gas std
2 1 ? alfa-romero gas std
3 2 164 audi gas std
4 2 164 audi gas std
.. … … … … …
200 -1 95 volvo gas std
201 -1 95 volvo gas turbo
202 -1 95 volvo gas std
203 -1 95 volvo diesel turbo
204 -1 95 volvo gas turbo

num-of-doors body-style drive-wheels engine-location wheel-base … \


0 two convertible rwd front 88.6 …
1 two convertible rwd front 88.6 …
2 two hatchback rwd front 94.5 …
3 four sedan fwd front 99.8 …
4 four sedan 4wd front 99.4 …
.. … … … … … …
200 four sedan rwd front 109.1 …
201 four sedan rwd front 109.1 …
202 four sedan rwd front 109.1 …
203 four sedan rwd front 109.1 …
204 four sedan rwd front 109.1 …

engine-size fuel-system bore stroke compression-ratio horsepower \


0 130 mpfi 3.47 2.68 9.0 111
1 130 mpfi 3.47 2.68 9.0 111
2 152 mpfi 2.68 3.47 9.0 154
3 109 mpfi 3.19 3.40 10.0 102
4 136 mpfi 3.19 3.40 8.0 115
.. … … … … … …
200 141 mpfi 3.78 3.15 9.5 114
201 141 mpfi 3.78 3.15 8.7 160
202 173 mpfi 3.58 2.87 8.8 134
203 145 idi 3.01 3.40 23.0 106
204 141 mpfi 3.78 3.15 9.5 114

peak-rpm city-mpg highway-mpg price


0 5000 21 27 13495
1 5000 21 27 16500
2 5000 19 26 16500
3 5500 24 30 13950
4 5500 18 22 17450
.. … … … …
200 5400 23 28 16845

2
201 5300 19 25 19045
202 5500 18 23 21485
203 4800 26 27 22470
204 5400 19 25 22625

[205 rows x 26 columns]

After reading the dataset, we can use the dataframe.head(n) method to check the top n rows of
the dataframe; where n is an integer. Contrary to dataframe.head(n), dataframe.tail(n) will show
you the bottom n rows of the dataframe.
[ ]: # show the first 5 rows using dataframe.head() method
df.head(15)

[ ]: symboling normalized-losses make fuel-type aspiration \


0 3 ? alfa-romero gas std
1 3 ? alfa-romero gas std
2 1 ? alfa-romero gas std
3 2 164 audi gas std
4 2 164 audi gas std
5 2 ? audi gas std
6 1 158 audi gas std
7 1 ? audi gas std
8 1 158 audi gas turbo
9 0 ? audi gas turbo
10 2 192 bmw gas std
11 0 192 bmw gas std
12 0 188 bmw gas std
13 0 188 bmw gas std
14 1 ? bmw gas std

num-of-doors body-style drive-wheels engine-location wheel-base … \


0 two convertible rwd front 88.6 …
1 two convertible rwd front 88.6 …
2 two hatchback rwd front 94.5 …
3 four sedan fwd front 99.8 …
4 four sedan 4wd front 99.4 …
5 two sedan fwd front 99.8 …
6 four sedan fwd front 105.8 …
7 four wagon fwd front 105.8 …
8 four sedan fwd front 105.8 …
9 two hatchback 4wd front 99.5 …
10 two sedan rwd front 101.2 …
11 four sedan rwd front 101.2 …
12 two sedan rwd front 101.2 …
13 four sedan rwd front 101.2 …
14 four sedan rwd front 103.5 …

3
engine-size fuel-system bore stroke compression-ratio horsepower \
0 130 mpfi 3.47 2.68 9.0 111
1 130 mpfi 3.47 2.68 9.0 111
2 152 mpfi 2.68 3.47 9.0 154
3 109 mpfi 3.19 3.40 10.0 102
4 136 mpfi 3.19 3.40 8.0 115
5 136 mpfi 3.19 3.40 8.5 110
6 136 mpfi 3.19 3.40 8.5 110
7 136 mpfi 3.19 3.40 8.5 110
8 131 mpfi 3.13 3.40 8.3 140
9 131 mpfi 3.13 3.40 7.0 160
10 108 mpfi 3.50 2.80 8.8 101
11 108 mpfi 3.50 2.80 8.8 101
12 164 mpfi 3.31 3.19 9.0 121
13 164 mpfi 3.31 3.19 9.0 121
14 164 mpfi 3.31 3.19 9.0 121

peak-rpm city-mpg highway-mpg price


0 5000 21 27 13495
1 5000 21 27 16500
2 5000 19 26 16500
3 5500 24 30 13950
4 5500 18 22 17450
5 5500 19 25 15250
6 5500 19 25 17710
7 5500 19 25 18920
8 5500 17 20 23875
9 5500 16 22 ?
10 5800 23 29 16430
11 5800 23 29 16925
12 4250 21 28 20970
13 4250 21 28 21105
14 4250 20 25 24565

[15 rows x 26 columns]

[ ]: #show the last 10 rows of the dataframe.


df.tail(10)

[ ]: symboling normalized-losses make fuel-type aspiration num-of-doors \


195 -1 74 volvo gas std four
196 -2 103 volvo gas std four
197 -1 74 volvo gas std four
198 -2 103 volvo gas turbo four
199 -1 74 volvo gas turbo four
200 -1 95 volvo gas std four

4
201 -1 95 volvo gas turbo four
202 -1 95 volvo gas std four
203 -1 95 volvo diesel turbo four
204 -1 95 volvo gas turbo four

body-style drive-wheels engine-location wheel-base … engine-size \


195 wagon rwd front 104.3 … 141
196 sedan rwd front 104.3 … 141
197 wagon rwd front 104.3 … 141
198 sedan rwd front 104.3 … 130
199 wagon rwd front 104.3 … 130
200 sedan rwd front 109.1 … 141
201 sedan rwd front 109.1 … 141
202 sedan rwd front 109.1 … 173
203 sedan rwd front 109.1 … 145
204 sedan rwd front 109.1 … 141

fuel-system bore stroke compression-ratio horsepower peak-rpm \


195 mpfi 3.78 3.15 9.5 114 5400
196 mpfi 3.78 3.15 9.5 114 5400
197 mpfi 3.78 3.15 9.5 114 5400
198 mpfi 3.62 3.15 7.5 162 5100
199 mpfi 3.62 3.15 7.5 162 5100
200 mpfi 3.78 3.15 9.5 114 5400
201 mpfi 3.78 3.15 8.7 160 5300
202 mpfi 3.58 2.87 8.8 134 5500
203 idi 3.01 3.40 23.0 106 4800
204 mpfi 3.78 3.15 9.5 114 5400

city-mpg highway-mpg price


195 23 28 13415
196 24 28 15985
197 24 28 16515
198 17 22 18420
199 17 22 18950
200 23 28 16845
201 19 25 19045
202 18 23 21485
203 26 27 22470
204 19 25 22625

[10 rows x 26 columns]

Basic Insight of Dataset


After reading data into Pandas dataframe, it is time for us to explore the dataset. There are several
ways to obtain essential insights of the data to help us better understand our dataset.

5
Data Types
Data has a variety of types. The main types stored in Pandas dataframes are object, float, int, bool
and datetime64. In order to better learn about each attribute, it is always good for us to know the
data type of each column. In Pandas:
Syntax : dataframe.dtypes
returns a Series with the data type of each column.
[ ]: # check the data type of data frame "df" by .dtypes
df.dtypes

[ ]: symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object

As a result, as shown above, it is clear to see that the data type of “symboling” and “curb-weight”
are int64, “normalized-losses” is object, and “wheel-base” is float64, etc.
These data types can be changed; we will learn how to accomplish this in a later module.
Describe
If we would like to get a statistical summary of each column, such as count, column mean value,
column standard deviation, etc. We use the describe method: Syntax : dataframe.describe() This
method will provide various summary statistics, excluding NaN (Not a Number) values.

6
[ ]: df.describe()

[ ]: symboling wheel-base length width height \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 98.756585 174.049268 65.907805 53.724878
std 1.245307 6.021776 12.337289 2.145204 2.443522
min -2.000000 86.600000 141.100000 60.300000 47.800000
25% 0.000000 94.500000 166.300000 64.100000 52.000000
50% 1.000000 97.000000 173.200000 65.500000 54.100000
75% 2.000000 102.400000 183.100000 66.900000 55.500000
max 3.000000 120.900000 208.100000 72.300000 59.800000

curb-weight engine-size compression-ratio city-mpg highway-mpg


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 2555.565854 126.907317 10.142537 25.219512 30.751220
std 520.680204 41.642693 3.972040 6.542142 6.886443
min 1488.000000 61.000000 7.000000 13.000000 16.000000
25% 2145.000000 97.000000 8.600000 19.000000 25.000000
50% 2414.000000 120.000000 9.000000 24.000000 30.000000
75% 2935.000000 141.000000 9.400000 30.000000 34.000000
max 4066.000000 326.000000 23.000000 49.000000 54.000000

This shows the statistical summary of all numeric-typed (int, float) columns. For example, the
attribute “symboling” has 205 counts, the mean value of this column is 0.83, the standard deviation
is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and
the maximum value is 3. However, what if we would also like to check all the columns including
those that are of type object.
You can add an argument include = “all” inside the bracket. Let’s try it again.
[ ]: # describe all the columns in "df"
df.describe(include = "all")

[ ]: symboling normalized-losses make fuel-type aspiration \


count 205.000000 205 205 205 205
unique NaN 52 22 2 2
top NaN ? toyota gas std
freq NaN 41 32 185 168
mean 0.834146 NaN NaN NaN NaN
std 1.245307 NaN NaN NaN NaN
min -2.000000 NaN NaN NaN NaN
25% 0.000000 NaN NaN NaN NaN
50% 1.000000 NaN NaN NaN NaN
75% 2.000000 NaN NaN NaN NaN
max 3.000000 NaN NaN NaN NaN

num-of-doors body-style drive-wheels engine-location wheel-base … \


count 205 205 205 205 205.000000 …

7
unique 3 5 3 2 NaN …
top four sedan fwd front NaN …
freq 114 96 120 202 NaN …
mean NaN NaN NaN NaN 98.756585 …
std NaN NaN NaN NaN 6.021776 …
min NaN NaN NaN NaN 86.600000 …
25% NaN NaN NaN NaN 94.500000 …
50% NaN NaN NaN NaN 97.000000 …
75% NaN NaN NaN NaN 102.400000 …
max NaN NaN NaN NaN 120.900000 …

engine-size fuel-system bore stroke compression-ratio horsepower \


count 205.000000 205 205 205 205.000000 205
unique NaN 8 39 37 NaN 60
top NaN mpfi 3.62 3.40 NaN 68
freq NaN 94 23 20 NaN 19
mean 126.907317 NaN NaN NaN 10.142537 NaN
std 41.642693 NaN NaN NaN 3.972040 NaN
min 61.000000 NaN NaN NaN 7.000000 NaN
25% 97.000000 NaN NaN NaN 8.600000 NaN
50% 120.000000 NaN NaN NaN 9.000000 NaN
75% 141.000000 NaN NaN NaN 9.400000 NaN
max 326.000000 NaN NaN NaN 23.000000 NaN

peak-rpm city-mpg highway-mpg price


count 205 205.000000 205.000000 205
unique 24 NaN NaN 187
top 5500 NaN NaN ?
freq 37 NaN NaN 4
mean NaN 25.219512 30.751220 NaN
std NaN 6.542142 6.886443 NaN
min NaN 13.000000 16.000000 NaN
25% NaN 19.000000 25.000000 NaN
50% NaN 24.000000 30.000000 NaN
75% NaN 30.000000 34.000000 NaN
max NaN 49.000000 54.000000 NaN

[11 rows x 26 columns]

Now, it provides the statistical summary of all the columns, including object-typed attributes. We
can now see how many unique values, which is the top value and the frequency of top value in
the object-typed columns. Some values in the table above show as “NaN”, this is because those
numbers are not available regarding a particular column type.
[ ]: #Replacing "?" with np.nan so that pandas can recognize the null values.
df.replace('?', np.nan, inplace=True)

We use the replace method to replace all occurrences of “?” with np.nan in the DataFrame. The

8
inplace=True argument ensures that the changes are made in place in the original DataFrame.bold
text
Info
Another method you can use to check your dataset is: Syntax : dataframe.info() It provide a
concise summary of your DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-
null values and memory usage.
[ ]: # look at the info of "df"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 164 non-null object
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 203 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 201 non-null object
19 stroke 201 non-null object
20 compression-ratio 205 non-null float64
21 horsepower 203 non-null object
22 peak-rpm 203 non-null object
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 201 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
Save Dataset

9
Correspondingly, Pandas enables us to save the dataset to csv by using the dataframe.to_csv()
method, you can add the file path and name along with quotation marks in the brackets.
For example, if you would save the dataframe df as automobile.csv to your local machine, you may
use the syntax below:
We can also read and save other file formats, we can use similar functions to pd.read_csv() and
df.to_csv() for other data formats, the functions are listed in the following table:
Read/Save Other Data Formats

Data Formate Read Save


csv pd.read_csv() df.to_csv()
json pd.read_json() df.to_json()
excel pd.read_excel() df.to_excel()
hdf pd.read_hdf() df.to_hdf()
sql pd.read_sql() df.to_sql()
… … …

10

You might also like