Python Pandas Interview Questions
Python Pandas Interview Questions
o Memory Efficient
o Data Alignment
o Reshaping
o Merge and join
o Time Series
AD
o Lists
o Dict of ndarrays
1. import pandas as pd
2. # a list of strings
3. a = ['Python', 'Pandas']
4. # Calling DataFrame constructor on list
5. info = pd.DataFrame(a)
6. print(info)
Output:
0
0 Python
1 Pandas
1. import pandas as pd
2. info = {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech',]}
3. info = pd.DataFrame(info)
4. print (info)
Output:
ID Department
0 101 B.Sc
1 102 B.Tech
2 103 M.Tech
o It is useful for a string variable that consists of only a few different values. If we
want to save some memory, we can convert a string variable to a categorical
variable.
o It is useful for the lexical order of a variable that is not the same as the logical
order (?one?, ?two?, ?three?) By converting into a categorical and specify an
order on the categories, sorting and min/max is responsible for using the logical
order instead of the lexical order.
o It is useful as a signal to other Python libraries because this column should be
treated as a categorical variable.
We can also create a Series from dict. If the dictionary object is being passed as an input
and the index is not specified, then the dictionary keys are taken in a sorted order to
construct the index.
AD
If index is passed, then values correspond to a particular label in the index will be
extracted from the dictionary.
1. import pandas as pd
2. import numpy as np
3. info = {'x' : 0., 'y' : 1., 'z' : 2.}
4. a = pd.Series(info)
5. print (a)
Output:
x 0.0
y 1.0
z 2.0
dtype: float64
pandas.Series.copy
Series.copy(deep=True)
The above statements make a deep copy that includes a copy of the data and the
indices. If we set the value of deep to False, it will neither copy the indices nor the data.
Note: If we set deep=True, the data will be copied, and the actual python objects will not be
copied recursively, only the reference to the object will be copied.
Output:
Empty DataFrame
Columns: []
Index: []
1. # importing the pandas library
2. import pandas as pd
3. info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
4. 'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
5.
6. info = pd.DataFrame(info)
7.
8. # Add a new column to an existing DataFrame object
9.
10. print ("Add new column by passing series")
11. info['three']=pd.Series([20,40,60],index=['a','b','c'])
12. print (info)
13. print ("Add new column using existing DataFrame columns")
14. info['four']=info['one']+info['three']
15. print (info)
Output:
Pandas allow adding the inputs to the index argument if you create a DataFrame. It will
make sure that you have the desired index. If you don?t specify inputs, the DataFrame
contains, by default, a numerically valued index that starts with 0 and ends on the last
row of the DataFrame.
If you want to remove the index from the DataFrame, you should have to do the
following:
Remove duplicate index values by resetting the index and drop the duplicate values
from the index column.
You can use the drop() method for deleting a column from the DataFrame.
The axis argument that is passed to the drop() method is either 0 if it indicates the rows
and 1 if it drops the columns.
You can pass the argument inplace and set it to True to delete the column without
reassign the DataFrame.
You can also delete the duplicate values from the column by using the drop_duplicates()
method.
You can use the drop() method to specify the index of the rows that we want to remove
from the DataFrame.
17) How to Rename the Index or Columns of a Pandas
DataFrame?
You can use the .rename method to give different values to the columns or the index
values of DataFrame.
1. import pandas as pd
2. p1 = pd.Series([2, 4, 6, 8, 10])
3. p2 = pd.Series([8, 10, 12, 14, 16])
4. p1[~p1.isin(p2)]
Solution
0 2
1 4
2 6
dtype: int64
20) How to get the items not common to both series A and series
B?
We get all the items of p1 and p2 not common to both using below example:
1. import pandas as pd
2. import numpy as np
3. p1 = pd.Series([2, 4, 6, 8, 10])
4. p2 = pd.Series([8, 10, 12, 14, 16])
5. p1[~p1.isin(p2)]
6. p_u = pd.Series(np.union1d(p1, p2)) # union
7. p_i = pd.Series(np.intersect1d(p1, p2)) # intersect
8. p_u[~p_u.isin(p_i)]
Output:
0 2
1 4
2 6
5 12
6 14
7 16
dtype: int64
21) How to get the minimum, 25th percentile, median, 75th, and
max of a numeric series?
We can compute the minimum, 25th percentile, median, 75th, and maximum of p as
below example:
1. import pandas as pd
2. import numpy as np
3. p = pd.Series(np.random.normal(14, 6, 22))
4. state = np.random.RandomState(120)
5. p = pd.Series(state.normal(14, 6, 22))
6. np.percentile(p, q=[0, 25, 50, 75, 100])
Output:
1. import pandas as pd
2. import numpy as np
3. p= pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))
4. p = pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))
5. p.value_counts()
Output:
s 4
r 4
q 3
p 3
u 3
1. import pandas as pd
2. import numpy as np
3. p = pd.Series(np.random.randint(1, 7, 35))
4. # Input
5. p = pd.Series(np.random.randint(1, 7, 35))
6. info = pd.DataFrame(p.values.reshape(7,5))
7. print(info)
Output:
0 1 2 3 4
0 3 2 5 5 1
1 3 2 5 5 5
2 1 3 1 2 6
3 1 1 1 2 2
4 3 5 3 3 3
5 2 5 3 6 4
6 3 6 6 6 5
name: Refers to the object. Its Default value is None. If it has one value, the passed
name will be substituted for the series name.
1. s = pd.Series(["a", "b", "c"],
2. name="vals")
3. s.to_frame()
Output:
vals
0 a
1 b
2 c
1. DataFrame.to_numpy(dtype=None, copy=False)
o By label
o By Actual value
By label
The DataFrame can be sorted by using the sort_index() method. It can be done by
passing the axis arguments and the order of sorting. The sorting is done on row labels in
ascending order by default.
By Actual Value
It is another kind through which sorting can be performed in the DataFrame. Like index
sorting, sort_values() is a method for sorting the values.
It also provides a feature in which we can specify the column name of the DataFrame
with which values are to be sorted. It is done by passing the 'by' argument.
Time series forecasting is the machine learning modeling that deals with the Time Series
data for predicting future values through Time Series modeling.
1. fromdatetime import datetime
2.
3. # Define dates as the strings
4. dmy_str1 = 'Wednesday, July 14, 2018'
5. dmy_str2 = '14/7/17'
6. dmy_str3 = '14-07-2017'
7.
8. # Define dates as the datetime objects
9. dmy_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')
10. dmy_dt2 = datetime.strptime(date_str2, '%m/%d/%y')
11. dmy_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')
12.
13. #Print the converted dates
14. print(dmy_dt1)
15. print(dmy_dt2)
16. print(dmy_dt3)
Output:
2017-07-14 00:00:00
2017-07-14 00:00:00
2018-07-14 00:00:00
33) What is Data Aggregation?
The main task of Data Aggregation is to apply some aggregation to one or more
columns. It uses the following:
o sum: It is used to return the sum of the values for the requested axis.
o min: It is used to return a minimum of the values for the requested axis.
o max: It is used to return a maximum values for the requested axis.
We can select any row and column of the DataFrame by passing the name of the rows
and columns. When you select it from the DataFrame, it becomes one-dimensional and
considered as Series.
o Filter Data
We can filter the data by providing some of the boolean expressions in DataFrame.
o Null values
A Null value occurs when no data is provided to the items. The various columns may
contain no values, which are usually represented as NaN.