Python Tips For Data Scientist
Python Tips For Data Scientist
1 Preface 3
1.1 About . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 About this note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 About the authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Motivation for this note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Feedback and suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Python Installation 7
3 Notebooks 9
3.1 Nteract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Jupyter Notebook Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Apache Zeppelin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Confidential Information 15
5 Primer Functions 17
5.1 * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3.1 random.random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3.2 np.random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4 round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.5 TODO.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Data Structures 21
6.1 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.1 Create list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.2 Unpack list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.1.3 Methods of list objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Tuple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
i
6.3.1 Create dict from lists . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3.2 dict.get() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3.3 Looping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3.4 Update Values in Dict . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3.5 Update Keys in Dict . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 One line if-else statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4.1 With filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4.2 Without filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8 pd.DataFrame manipulation 33
8.1 TODO.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9 rdd.DataFrame manipulation 35
9.1 TODO.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10 pd.DataFrame vs pd.DataFrame 37
10.1 Create DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.1.1 From List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.1.2 From Dict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.2 Load DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.2.1 From DataBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.2.2 From .csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.2.3 From .json . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.3 First n Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.4 Column Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
10.6 Replace Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.7 Fill Null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10.8 Replace Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10.9 Rename Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
10.9.1 Rename all columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
10.9.2 Rename one or more columns . . . . . . . . . . . . . . . . . . . . . . . 44
10.10 Drop Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
10.11 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
10.12 With New Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
10.13 Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.13.1 Left Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10.13.2 Right Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.13.3 Inner Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.13.4 Full Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ii
10.14 Concat Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.15 GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.16 Pivot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.17 Unixtime to Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11 Kaggle Competitions 57
11.1 TODO.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
12 Package Wrapper 59
12.1 Hierarchical Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
12.2 Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
12.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.4 ReadMe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
15 API Book 71
15.1 Basics Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.1.1 rnorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.1.2 dnorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.1.3 runif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.2 Tests Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.2.1 T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16 Main Reference 75
Bibliography 77
Index 81
iii
iv
Python Tips for Data Scientist
Welcome to my Python Tips for Data Scientist notes! In those notes, you will learn some useful
tips for Data Scientist daily work. The PDF version can be downloaded from HERE.
CONTENTS 1
Python Tips for Data Scientist
2 CONTENTS
CHAPTER
ONE
PREFACE
Chinese proverb
The palest ink is better than the best memory. – old Chinese proverb
1.1 About
This document is a summary of our valueable experiences in using Python for Data
Scientist daily work. The PDF version can be downloaded from HERE.
You may download and distribute it. Please be aware, however, that the note contains typos
as well as inaccurate or incorrect description.
In this repository, we try to use the detailed Data Scientist related demo code and examples
to share some useful python tips for Data Scientist work. If you find your work wasn’t cited in this
note, please feel free to let me know.
Although we are by no means a python programming and Data Scientist expert, We decided that
it would be useful for us to share what we learned about Python in the form of easy note with
detailed example. We hope those notes will be a valuable tool for your studies.
The notes assume that the reader has a preliminary knowledge of python programing, LaTex
and Linux. And this document is generated automatically by using sphinx. More details can be
found at [Georg2018].
• Wenqiang Feng
– Data Scientist and PhD in Mathematics
3
Python Tips for Data Scientist
No matter you like it or not, Python has been one of the most popular programming languages. I
have been using Python for almost 4 years. Frankly speaking, I wasn’t impressed and attracted by
Python at the first using. After starting working in industry, I have to use Python. Graduately I
recognize the elegance of Python and use it as one of my main programming language. But I foud
that:
• Most of the Python books or tutorials which emphasize on programming will overwhelme
the green hand.
• While most of the Python books or tutorials for Data Scientist or Data
Analysis didn’t cover some essential skills from the engineer side.
So I want to keep some of my valuable tips which are heavily applied in my daily work.
4 Chapter 1. Preface
Python Tips for Data Scientist
Your comments and suggestions are highly appreciated. I am more than happy to receive cor-
rections, suggestions or feedbacks through email (Wenqiang Feng: von198@gmail.com, XuGao:
duncangao@gmail.com) for improvements.
6 Chapter 1. Preface
CHAPTER
TWO
PYTHON INSTALLATION
Note: This Chapter Python Installation is for beginner. If you have some Python programming
experience, you may skip this chapter.
No matter what operator system is, I will strongly recommend you to install Anaconda which
contains Python, Jupyter, spyder, Numpy, Scipy, Numba, pandas, DASK, Bokeh,
HoloViews, Datashader, matplotlib, scikit-learn, H2O.ai, TensorFlow,
CONDA and more.
Download link: https://www.anaconda.com/distribution/
7
Python Tips for Data Scientist
THREE
NOTEBOOKS
Note: This Chapter Notebooks is for beginner. If you have alreay know Nteract, Zeppelin
and Python, you may skip this chapter.
If you are a Data Scientist, it’s not enough to just know Jupyter Notebook. You should also
take a look at nbviewer, Nteract and Zeppelin notebooks.
3.1 Nteract
Nteract is an amazing .ipynb reader. You can open and run the .ipynb by just double
clicking the .ipynb file.
Download from: https://nteract.io/
If you are a MAC user, you can also install the Jupyter Notebook Viewer –nbviewer-app which
is much faster than Nteract.
Download from: https://github.com/tuxu/nbviewer-app
The Zeppelin (Apache Zeppelin) is an open-source Web-based notebook that enables data-
driven, interactive data analytics and collaborative documents with Python, PySpark, SQL,
Scala and more.
Download from: https://zeppelin.apache.org/
9
Python Tips for Data Scientist
10 Chapter 3. Notebooks
Python Tips for Data Scientist
12 Chapter 3. Notebooks
Python Tips for Data Scientist
The Jupyter Notebook (Ipython Notebook) is an open-source web application that allows
you to create and share documents that contain live code, equations, visualizations
and narrative text. Uses include: data cleaning and transformation, numerical simulation,
statistical modeling, data visualization, machine learning, and much more.
14 Chapter 3. Notebooks
CHAPTER
FOUR
CONFIDENTIAL INFORMATION
Chinese proverb
Be mindful of guarding against harm from others, and stay away from placing harming upon others.
If you are a real Data Scientist, you have to share your code with your colleagues or release your
code for Code Review or Quality assurance(QA). You will definitely do not want to have your
User Information in the code. So you can save them in login.txt in a safe folder:
runawayhorse001
PythonTips
#User Information
try:
login = pd.read_csv(r'login.txt', header=None)
user = login[0][0]
pw = login[0][1]
print('User information is ready!')
except:
print('Login information is not available!!!')
You may also want to get the User Information by using os.environ in Python:
try:
user = os.environ['LOGNAME']
except OSError:
user = os.environ['USER']
except OSError:
user = os.environ['USERNAME']
print(err)
except OSError as err:
print('The user information is not available!!!')
15
Python Tips for Data Scientist
FIVE
PRIMER FUNCTIONS
Note: This Chapter Primer Functions is for beginner. If you have some Python programming
experience, you may skip this chapter.
The following functions have been heavily used in my daily Data Scientist work.
5.1 *
Single asterisk as used in function declaration allows variable number of arguments passed from
calling environment. Inside the function it behaves as a tuple.
:: Python Code:
my_list = [1,2,3]
print(my_list)
print(*my_list)
:: Ouput:
[1, 2, 3]
1 2 3
5.2 range
:: Python Code:
print(range(5))
print(*range(5))
print(*range(3,8))
17
Python Tips for Data Scientist
:: Ouput:
range(0, 5)
0 1 2 3 4
3 4 5 6 7
5.3 random
5.3.1 random.random
:: Python Code:
import random
random.random()
# (b - a) * random() + a
random.uniform(3,8)
:: Ouput:
0.33844051243073625
7.772024014335885
5.3.2 np.random
:: Python Code:
np.random.random_sample()
np.random.random_sample(4)
np.random.random_sample([2,4])
# (b - a) * random_sample() + a
a = 3; b = 8
(b-a)*np.random.random_sample([2,4])+a
:: Ouput:
0.11919402208670005
array([0.07384755, 0.9005251 , 0.30030561, 0.38221819])
array([[0.76851156, 0.56973309, 0.47074505, 0.7814957 ],
[0.5778028 , 0.94653057, 0.51193493, 0.48693931]])
5.4 round
Sometimes, we really do not need the scientific decimals for output results. So you can use this
function to round an array to the given number of decimals.
:: Python Code:
np.round(np.random.random_sample([2,4]),2)
:: Ouput:
5.5 TODO..
:: Python Code:
:: Ouput:
:: Python Code:
:: Ouput:
:: Python Code:
5.4. round 19
Python Tips for Data Scientist
:: Ouput:
:: Python Code:
:: Ouput:
SIX
DATA STRUCTURES
Note: This Chapter Data Structures is for beginner. If you have some Python programming
experience, you may skip this chapter.
6.1 List
my_list = []
type(my_list)
:: Ouput:
list
I applied the empty list to initialize my silhouette score list when I try to find the optimal
number of the clusters.
:: Example:
min_cluster = 3
max_cluster =8
21
Python Tips for Data Scientist
print(scores)
:: Ouput:
:: Example:
num = [1,2,3,4,5,6,7,8,9,10]
print(*num)
:: Ouput:
1 2 3 4 5 6 7 8 9 10
Name Description
list. append(x) Add an item to the end of the list
list. extend(iterable) Extend the list by appending all
list. insert(i, x) Insert an item at a given position
list. remove(x) Remove the first item
list. pop([i]) Remove the item at given position
list. clear() Remove all items from the list
list. index(x[,s[,e]]) Return zero-based index in the list
list. count(x) Return the number of times x
list. sort(key,reverse) Sort the items of the list
list. reverse() Reverse the elements of the list
list. copy() Return a shallow copy1 of list
1
Shallow Copy vs Deep Copy Reference: https://stackoverflow.com/posts/184780/revisions
6.2 Tuple
A tuple is an assortment of data, separated by commas, which makes it similar to the Python list,
but a tuple is fundamentally different in that a tuple is “immutable.” This means that it cannot be
changed, modified, or manipulated.
6.3 Dictionary
dict is one of another data sctructures which is heavily using in my daily work. I heavily applied
the dict in my PyAudit package, more details can be found at PyAudit.
:: Example:
Shallow copy:
The variables A and B refer to different areas of memory, when B is assigned to A the two variables refer to the
same area of memory. Later modifications to the contents of either are instantly reflected in the contents of other, as
they share contents.
Deep Copy:
The variables A and B refer to different areas of memory, when B is assigned to A the values in the memory area
which A points to are copied into the memory area to which B points. Later modifications to the contents of either
remain unique to A or B; the contents are not shared.
6.2. Tuple 23
Python Tips for Data Scientist
df = pd.DataFrame(d)
print(df)
:: Ouput:
6.3.2 dict.get()
When get() is called, Python checks if the specified key exists in the dict. If it does, then get()
returns the value of that key. If the key does not exist, then get() returns the value specified in
the second argument to get(). A good application of get() can be found at Update Keys in
Dict.
:: Example:
:: Ouput:
Michael
George
:: Example:
:: Ouput:
:: Ouput:
:: Ouput:
:: Example:
6.3. Dictionary 25
Python Tips for Data Scientist
:: Ouput:
{'name': 'Michael', 'Age': '30', 'Sex': 'Male', 'Car': [
˓→'Tesla S', 'Tesla X'], 'Kid': ['Tom', 'Jim']}
{'name': 'Michael', 'Age': '30', 'Sex': 'Male', 'Cars': [
˓→'Tesla S', 'Tesla X'], 'Kids': ['Tom', 'Jim']}
::syntax:
[ RESUT for x in seq if COND ]
:: Python Code:
num = [1,2,3,4,5,6,7,8,9,10]
:: Ouput:
[2, 4, 6, 8, 10]
::syntax:
[ RESUT1 if COND1 else RESUT2 if COND2 else RESUT3 for x in
˓→seq]
:: Python Code:
num = [1,2,3,4,5,6,7,8,9,10]
:: Ouput:
['Low',
'Low',
'Low',
'Median',
'Median',
'Median',
'Median',
'High',
'High',
'High']
[VanderPlas2016] [McKinney2013]
SEVEN
# User Information
try:
login = pd.read_csv(r'login.txt', header=None)
user = login[0][0]
pw = login[0][1]
print('User information is ready!')
except:
print('Login information is not available!!!')
# Database information
host = '##.###.###.##'
db_name = 'db_name'
table_name = 'table_name'
# Setup connection
conn = psycopg2.connect(host=host, database=db_name, user=user,
˓→password=pw)
cur = conn.cursor()
29
Python Tips for Data Scientist
Note: You can also use copy_to to copy the dataframe from local memory to GP
cur.copy_to(df, table_name)
# User information
try:
login = pd.read_csv(r'login.txt', header=None)
user = login[0][0]
pw = login[0][1]
print('User information is ready!')
except:
print('Login information is not available!!!')
# Database information
host = '##.###.###.##'
db_name = 'db_name'
(continues on next page)
# Setup connection
conn = psycopg2.connect(host=host, database=db_name, user=user,
˓→password=pw)
cur = conn.cursor()
# Read table
sql = """
select *
from {table_name}
""".format(table_name=table_name)
dp = pd.read_sql(sql, conn)
EIGHT
PD.DATAFRAME MANIPULATION
Note: This Chapter Notebooks is for beginner. If you have some Python programming experi-
ence, you may skip this chapter.
8.1 TODO..
33
Python Tips for Data Scientist
NINE
RDD.DATAFRAME MANIPULATION
Note: This Chapter Notebooks is for beginner. If you have some Python programming experi-
ence, you may skip this chapter.
9.1 TODO..
35
Python Tips for Data Scientist
TEN
PD.DATAFRAME VS PD.DATAFRAME
:: Python Code:
# caution for the columns=
pd.DataFrame(my_list,columns= col_name)
#
spark.createDataFrame(my_list, col_name).show()
:: Comparison:
+---+---+---+
| A| B| C|
A B C +---+---+---+
0 a 1 2 | a| 1| 2|
1 b 2 3 | b| 2| 3|
2 c 3 4 | c| 3| 4|
+---+---+---+
37
Python Tips for Data Scientist
:: Comparison:
A B C 0 1 2
0 a 1 2 A a 1 2
1 b 2 3 B b 2 3
2 c 3 4 C c 3 4
:: Python Code:
pd.DataFrame(d)for
# Tedious for PySpark
spark.createDataFrame(np.array(list(d.values())).T.tolist(),list(d.
˓→keys())).show()
:: Comparison:
+---+---+---+
| A| B| C|
A B C +---+---+---+
0 0 1 1 | 0| 1| 1|
1 1 0 0 | 1| 0| 0|
2 0 1 0 | 0| 1| 0|
+---+---+---+
Most of time, you need to share your code with your colleagues or release your code for
Code Review or Quality assurance(QA). You will definitely do not want to have your User
Information in the code. So you can save them in login.txt:
runawayhorse001
PythonTips
#User Information
try:
login = pd.read_csv(r'login.txt', header=None)
user = login[0][0]
pw = login[0][1]
print('User information is ready!')
except:
print('Login information is not available!!!')
#Database information
host = '##.###.###.##'
db_name = 'db_name'
table_name = 'table_name'
:: Comparison:
cur = conn.cursor()
sql = """
select *
from {table_name}
""".format(table_name=table_name)
dp = pd.read_sql(sql, conn)
# connect to database
url = 'jdbc:postgresql://'+host+':5432/'+db_name+'?user='+user+'&
˓→password='+pw
Attention: Reading tables from Database with PySpark needs the proper drive for the corre-
sponding Database. For example, the above demo needs org.postgresql.Driver and you need to
download it and put it in jars folder of your spark installation path. I download postgresql-
42.1.1.jar from the official website and put it in jars folder.
:: Comparison:
dp = pd.read_json("data/data.json")
ds = spark.read.json('data/data.json')
:: Python Code:
dp[['id','timestamp']].head(4)
#
ds[['id','timestamp']].show(4)
:: Comparison:
+----------+-----------
˓→ --------+
| id|
˓→timestamp|
id timestamp +----------+-----------
˓→--------+
+----------+-----------
˓→ --------+
only showing top 4 rows
dp.head(4)
#
ds.show(4)
:: Comparison:
+-----+-----+---------+-----+
| TV|Radio|Newspaper|Sales|
TV Radio Newspaper Sales +-----+-----+---------+-----+
0 230.1 37.8 69.2 22.1 |230.1| 37.8| 69.2| 22.1|
1 44.5 39.3 45.1 10.4 | 44.5| 39.3| 45.1| 10.4|
2 17.2 45.9 69.3 9.3 | 17.2| 45.9| 69.3| 9.3|
3 151.5 41.3 58.5 18.5 |151.5| 41.3| 58.5| 18.5|
+-----+-----+---------+-----+
only showing top 4 rows
dp.columns
#
ds.columns
:: Comparison:
dp.dtypes
#
ds.dtypes
:: Comparison:
dp = pd.DataFrame(my_list,columns=col_name)
ds = spark.createDataFrame(dp)
dp.dtypes
col1 object
col2 int64
col3 int64
dtype: object
:: Python Code:
d = {'col2': 'string','col3':'string'}
dp = dp.astype({'col2': 'str','col3':'str'})
ds = ds.select(*list(set(ds.columns)-set(d.keys())),
*(col(c[0]).astype(c[1]).alias(c[0]) for c in d.
˓→items()))
:: Comparison:
col1 object
col2 object [('col1', 'string'), ('col2', 'string'), (
˓→'col3', 'string')]
col3 object
dtype: object
:: Comparison:
+------+---+----+
| A| B| C|
A B C +------+---+----+
0 male 1 NaN | male| 1|null|
1 female 2 3.0 |female| 2| 3|
2 male 3 4.0 | male| 3| 4|
+------+---+----+
:: Python Code:
dp.fillna(-99)
#
ds.fillna(-99).show()
:: Comparison:
+------+---+----+
| A| B| C|
A B C +------+---+----+
0 male 1 -99 | male| 1| -99|
1 female 2 3.0 |female| 2| 3|
2 male 3 4.0 | male| 3| 4|
+------+---+----+
:: Comparison:
+---+---+----+
| A| B| C|
A B C +---+---+----+
0 1 1 NaN | 1| 1|null|
1 0 2 3.0 | 0| 2| 3|
2 1 3 4.0 | 1| 3| 4|
+---+---+----+
:: Python Code:
dp.columns = ['a','b','c','d']
dp.head(4)
#
ds.toDF('a','b','c','d').show(4)
:: Comparison:
+-----+----+----+----+
| a| b| c| d|
a b c d +-----+----+----+----+
0 230.1 37.8 69.2 22.1 |230.1|37.8|69.2|22.1|
1 44.5 39.3 45.1 10.4 | 44.5|39.3|45.1|10.4|
2 17.2 45.9 69.3 9.3 | 17.2|45.9|69.3| 9.3|
3 151.5 41.3 58.5 18.5 |151.5|41.3|58.5|18.5|
+-----+----+----+----+
only showing top 4 rows
mapping = {'Newspaper':'C','Sales':'D'}
:: Python Code:
dp.rename(columns=mapping).head(4)
#
new_names = [mapping.get(col,col) for col in ds.columns]
ds.toDF(*new_names).show(4)
:: Comparison:
+-----+-----+----+----+
| TV|Radio| C| D|
TV Radio C D +-----+-----+----+----+
0 230.1 37.8 69.2 22.1 |230.1| 37.8|69.2|22.1|
1 44.5 39.3 45.1 10.4 | 44.5| 39.3|45.1|10.4|
2 17.2 45.9 69.3 9.3 | 17.2| 45.9|69.3| 9.3|
3 151.5 41.3 58.5 18.5 |151.5| 41.3|58.5|18.5|
+-----+-----+----+----+
only showing top 4 rows
Note: You can also use withColumnRenamed to rename one column in PySpark.
:: Python Code:
ds.withColumnRenamed('Newspaper','Paper').show(4
:: Comparison:
+-----+-----+-----+-----+
| TV|Radio|Paper|Sales|
+-----+-----+-----+-----+
|230.1| 37.8| 69.2| 22.1|
| 44.5| 39.3| 45.1| 10.4|
| 17.2| 45.9| 69.3| 9.3|
|151.5| 41.3| 58.5| 18.5|
+-----+-----+-----+-----+
only showing top 4 rows
drop_name = ['Newspaper','Sales']
:: Python Code:
dp.drop(drop_name,axis=1).head(4)
#
ds.drop(*drop_name).show(4)
:: Comparison:
+-----+-----+
| TV|Radio|
TV Radio +-----+-----+
0 230.1 37.8 |230.1| 37.8|
1 44.5 39.3 | 44.5| 39.3|
2 17.2 45.9 | 17.2| 45.9|
3 151.5 41.3 |151.5| 41.3|
+-----+-----+
only showing top 4 rows
10.11 Filter
dp = pd.read_csv('Advertising.csv')
#
ds = spark.read.csv(path='Advertising.csv',
header=True,
inferSchema=True)
:: Python Code:
dp[dp.Newspaper<20].head(4)
#
ds[ds.Newspaper<20].show(4)
:: Comparison:
+-----+-----+---------
˓→ +-----+
|
˓→ TV|Radio|Newspaper|Sales|
TV Radio Newspaper Sales +-----+-----+---------
˓→+-----+
:: Python Code:
dp[(dp.Newspaper<20)&(dp.TV>100)].head(4)
#
ds[(ds.Newspaper<20)&(ds.TV>100)].show(4)
:: Comparison:
+-----+-----+---------
˓→ +-----+
|
˓→ TV|Radio|Newspaper|Sales|
TV Radio Newspaper Sales +-----+-----+---------
˓→+-----+
+-----+-----+---------
˓→ +-----+
only showing top 4 rows
:: Comparison:
+-----+-----+---------
˓→ +-----+--------------------+
|
˓→ TV|Radio|Newspaper|Sales| tv_norm|
TV Radio Newspaper Sales tv_norm +-----+-----+---------
˓→+-----+--------------------+
+-----+-----+---------
˓→ +-----+--------------------+
only showing top 4 rows
:: Python Code:
#
ds.withColumn('cond',F.when((ds.TV>100)&(ds.Radio<40),1)\
.when(ds.Sales>10, 2)\
.otherwise(3)).show(4)
:: Comparison:
+-----+-----+---------
˓→ +-----+----+
|
˓→ TV|Radio|Newspaper|Sales|cond|
TV Radio Newspaper Sales cond +-----+-----+---------
˓→+-----+----+
:: Python Code:
dp['log_tv'] = np.log(dp.TV)
dp.head(4)
#
ds.withColumn('log_tv',F.log(ds.TV)).show(4)
:: Comparison:
+-----+-----+---------
˓→ +-----+------------------+
|
˓→ TV|Radio|Newspaper|Sales| log_tv|
TV Radio Newspaper Sales log_tv +-----+-----+---------
˓→+-----+------------------+
+-----+-----+---------
˓→ +-----+------------------+
only showing top 4 rows
:: Python Code:
:: Comparison:
+-----+-----+---------
˓→ +-----+-----+
|
˓→ TV|Radio|Newspaper|Sales|tv+10|
TV Radio Newspaper Sales tv+10 +-----+-----+---------
˓→+-----+-----+
+-----+-----+---------
˓→ +-----+-----+
only showing top 4 rows
10.13 Join
lefts = spark.createDataFrame(leftp)
rights = spark.createDataFrame(rightp)
A B C D A F G H
0 A0 B0 C0 D0 4 A0 B4 C4 D4
1 A1 B1 C1 D1 5 A1 B5 C5 D5
2 A2 B2 C2 D2 6 A6 B6 C6 D6
3 A3 B3 C3 D3 7 A7 B7 C7 D7
:: Python Code:
leftp.merge(rightp,on='A',how='left')
#
lefts.join(rights,on='A',how='left')
.orderBy('A',ascending=True).show()
:: Comparison:
+---+---+---+---+----+----+----
˓→ +
| A| B| C| D| F| G|
˓→ H|
A B C D F G H +---+---+---+---+----+----+----
˓→ +
0 A0 B0 C0 D0 B4 C4 D4 | A0| B0| C0| D0| B4| C4|
˓→D4|
+---+---+---+---+----+----+----
˓→ +
:: Python Code:
leftp.merge(rightp,on='A',how='right')
#
lefts.join(rights,on='A',how='right')
.orderBy('A',ascending=True).show()
:: Comparison:
+---+----+----+----+---+---+---
˓→ +
| A| B| C| D| F| G|
˓→ H|
A B C D F G H +---+----+----+----+---+---+---
˓→ +
0 A0 B0 C0 D0 B4 C4 D4 | A0| B0| C0| D0| B4| C4|
˓→D4|
+---+----+----+----+---+---+---
˓→ +
:: Python Code:
leftp.merge(rightp,on='A',how='inner')
#
lefts.join(rights,on='A',how='inner')
.orderBy('A',ascending=True).show()
10.13. Join 51
Python Tips for Data Scientist
:: Comparison:
+---+---+---+---+---+---+---+
| A| B| C| D| F| G| H|
A B C D F G H +---+---+---+---+---+---+---+
0 A0 B0 C0 D0 B4 C4 D4 | A0| B0| C0| D0| B4| C4| D4|
1 A1 B1 C1 D1 B5 C5 D5 | A1| B1| C1| D1| B5| C5| D5|
+---+---+---+---+---+---+---+
:: Python Code:
leftp.merge(rightp,on='A',how='full')
#
lefts.join(rights,on='A',how='full')
.orderBy('A',ascending=True).show()
:: Comparison:
+---+----+----+----+----+----+-
˓→ ---+
| A| B| C| D| F| G|
˓→ H|
A B C D F G H +---+----+----+----+----+----+-
˓→---+
+---+----+----+----+----+----+-
˓→ ---+
:: Python Code:
:: Comparison:
+----+----+----+------+
|col1|col2|col3|concat|
col1 col2 col3 concat +----+----+----+------+
0 a 2 3 a2 | a| 2| 3| a2|
1 b 5 6 b5 | b| 5| 6| b5|
2 c 8 9 c8 | c| 8| 9| c8|
3 a 2 3 a2 | a| 2| 3| a2|
4 b 5 6 b5 | b| 5| 6| b5|
5 c 8 9 c8 | c| 8| 9| c8|
+----+----+----+------+
10.15 GroupBy
:: Python Code:
dp.groupby(['col1']).agg({'col2':'min','col3':'mean'})
#
ds.groupBy(['col1']).agg({'col2': 'min', 'col3': 'avg'}).show()
:: Comparison:
+----+---------+---------+
col2 col3 |col1|min(col2)|avg(col3)|
col1 +----+---------+---------+
a 2 3 | c| 8| 9.0|
b 5 6 | b| 5| 6.0|
c 8 9 | a| 2| 3.0|
+----+---------+---------+
10.16 Pivot
:: Python Code:
#
ds.groupBy(['col1']).pivot('col2').sum('col3').show()
:: Comparison:
+----+----+----+----+
col2 2 5 8 |col1| 2| 5| 8|
col1 +----+----+----+----+
a 6.0 NaN NaN | c|null|null| 18|
b NaN 12.0 NaN | b|null| 12|null|
c NaN NaN 18.0 | a| 6|null|null|
+----+----+----+----+
:: Python Code:
dp
spark.conf.set("spark.sql.session.timeZone", "UTC")
from pyspark.sql.types import DateType
ds.withColumn('date', F.from_unixtime('ts')).show() #.cast(DateType())
:: Comparison:
+---+----------+-------
˓→ ------------+
| A| ts|
˓→ date|
A ts datetime +---+----------+-------
˓→------------+
+---+----------+-------
˓→ ------------+
ELEVEN
KAGGLE COMPETITIONS
Chinese proverb
practice makes perfect.
11.1 TODO..
57
Python Tips for Data Scientist
TWELVE
PACKAGE WRAPPER
It’s super easy to wrap your own package in Python. I packed some functions which I frequently
used in my daily work. You can download and install it from My ststspy library. The hierarchical
structure and the directory structure of this package are as follows.
README.md
__init__.py
requirements.txt
setup.py
statspy
__init__.py
basics.py
tests.py
test
nb
t.test.ipynb
test1.py
3 directories, 9 files
From the above hierarchical structure, you will find that you have to have __init__.py in each
directory. I will explain the __init__.py file with the example below:
12.2 Set Up
try:
(continues on next page)
59
Python Tips for Data Scientist
try:
with open("requirements.txt") as f:
requirements = [x.strip() for x in f.read().splitlines() if x.
˓→strip()]
except IOError:
requirements = []
setup(name='statspy',
install_requires=requirements,
version='1.0',
description='Statistics python library',
author='Wenqiang Feng',
author_email='von198@gmail.com',
license="MIT",
url='git@github.com:runawayhorse001/statspy.git',
packages=find_packages(),
long_description=long_description,
long_description_content_type="text/markdown",
classifiers=[
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 3",
],
include_package_data=True
)
12.3 Requirements
pandas
numpy
scipy
patsy
matplotlib
12.4 ReadMe
# StatsPy
- clone
```{bash}
git clone git@github.com:runawayhorse001/statspy.git
```
- install
```{bash}
cd statspy
pip install -r requirements.txt
python setup.py install
```
- uninstall
```{bash}
pip uninstall statspy
```
- test
```{bash}
cd statspy/test
python test1.py
```
12.4. ReadMe 61
Python Tips for Data Scientist
THIRTEEN
In this chapter, you’ll learn how to upload your own package to PyPI.
If you do not have a PyPI accout, you need to register an account at https://pypi.org/account/register
.
.
PyAudit-1.0-py3-none-any.whl
PyAudit-1.0-py3.6.egg
PyAudit-1.0.tar.gz
63
Python Tips for Data Scientist
During the uploading processing, you need to provide your PyPI account username and
password:
FOURTEEN
In this chapter, you’ll learn how to deployment your model with flask. The main idea and
code (I made some essential modification to make it work for Python 3) are from the Git
repo:https://github.com/llSourcell/how_to_deploy_a_keras_model_to_production. So the copy-
right belongs to the original author.
You can use the following code to train and save your CNN model:
65
Python Tips for Data Scientist
#the data downloaded, shuffled and split between train and test sets
#if only all datasets were this easy to import and format
(x_train, y_train), (x_test, y_test) = mnist.load_data()
#more reshaping
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
model.add(Flatten())
#fully connected to get all relevant data
model.add(Dense(128, activation='relu'))
#one more dropout for convergence' sake :)
model.add(Dropout(0.5))
#output a softmax to squash the matrix into output probabilities
model.add(Dense(num_classes, activation='softmax'))
#Adaptive learning rate (adaDelta) is a popular form of gradient
˓→descent rivaled only by adam and adagrad
#train
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
#how well did it do?
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
#Generating HTML from within Python is not fun, and actually pretty
˓→cumbersome because you have to do the
@app.route('/')
def index():
#initModel()
#render out pre-built HTML file right on the index page
return render_template("index.html")
@app.route('/predict/',methods=['GET','POST'])
def predict():
#whenever the predict method is called, we're going
#to input the user drawn character as an image into the model
#perform inference, and return the classification
#get the raw data format of the image
imgData = request.get_data()
#print(imgData)
#encode it into a suitable format
convertImage(imgData)
print("debug")
#read the image into memory
x = cv2.imread('output/output.png',0)
#compute a bit-wise inversion so black becomes white and vice
˓→versa
x = np.invert(x)
#make it the right size
x = cv2.resize(x,(28,28))
#imshow(x)
#convert to a 4D tensor to feed into our model
x = x.reshape(1,28,28,1)
print("debug2")
#in our computation graph
with graph.as_default():
#perform the prediction
out = model.predict(x)
#print(out)
print(np.argmax(out,axis=1))
print("debug3")
#convert the response to a string
response = np.array_str(np.argmax(out,axis=1))
return response
python app.py
FIFTEEN
API BOOK
If you developed an amazing library or tool, you need to teach the users how to use it. Now a API
book is necessary and a good API book will save a lot of time for the users. The Sphinx provides
an awesome auto API book generator. The followings are my statistics python library: statspy
API demo book:
15.1.1 rnorm
15.1.2 dnorm
71
Python Tips for Data Scientist
15.1.3 runif
15.2.1 T-test
Parameters
SIXTEEN
MAIN REFERENCE
75
Python Tips for Data Scientist
[VanderPlas2016] Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working
with Data, 2016.
[McKinney2013] Wes McKinney. Python for Data Analysis, 2013.
[Georg2018] Georg Brandl. Sphinx Documentation, Release 1.7.10+, 2018.
77
Python Tips for Data Scientist
78 Bibliography
PYTHON MODULE INDEX
s
statspy.basics, 71
statspy.tests, 72
79
Python Tips for Data Scientist
R
rnorm() (in module statspy.basics), 71
S
statspy.basics (module), 71
statspy.tests (module), 72
T
t_test() (in module statspy.tests), 72
81