Python Interview Prep Doc
Python Interview Prep Doc
1) LIBRARIES
2) Matplotlib vs Seaborn (SEABORN - data visualization library built on top of Matplotlib)
1) Ease of syntax
- Matplotlib - write more code to create visualizations.
- Seaborn - simplifies the process of creating appealing plots
import matplotlib.pyplot as plt
Requires explicit commands for setting figure size, scatter points, title, labels,
grid, and lines. More code and customization steps are involved.
Combines multiple steps into a single function call and automatically handles
color coding for the categories with the hue parameter, leading to more
concise and readable code.
2) Default Aesthetics
3) Statistical Visualizations
- limited built-in statistical plotting functions
- specializes in statistical visualizations and offers a wide array of built-in
statistical plotting functions
To create a box plot, you need to manually calculate statistics like quartiles
and then plot them.
Creating a box plot in Seaborn requires just one function call:
While powerful for general plotting, it lacks built-in functions for statistical
visualizations, requiring more manual calculations and code.
5) Customization
Can you explain a scenario where you would choose Seaborn over Matplotlib?
Answer: I would choose Seaborn when I need to create statistical plots, such as pair plots or
violin plots, that require quick visualization of relationships and distributions. Seaborn
simplifies the process with built-in themes and better default aesthetics, allowing me to
focus on the analysis rather than customization.
Answer: Common plot types in Matplotlib include line plots, scatter plots, bar charts,
histograms, pie charts, box plots, and error bars. These cover a wide range of visualization
needs, from simple trends to complex distributions.
Answer: Customization in Matplotlib can be done using various functions. You can change
the color and style of lines, adjust marker types, set titles and labels, customize axes limits,
and modify ticks. For example, using plt.title(), plt.xlabel(), and plt.ylabel() allows you to set
titles and labels for your plots.
Matplotlib-Specific Questions
o Answer: You can save a plot using the plt.savefig("filename.png") function. You can
specify different file formats such as PNG, JPG, PDF, or SVG by changing the file
extension. Additionally, you can adjust parameters like DPI for better resolution.
o Answer: Subplots allow you to create multiple plots in a single figure. You can create
them using plt.subplot(nrows, ncols, index) to specify the layout or
plt.subplots(nrows, ncols) to return a figure and an array of axes. For example,
plt.subplots(2, 2) creates a 2x2 grid of subplots.
o Answer: You can display multiple plots in one figure using subplots. For example
axs[1, 1].hist(y4)
plt.show()
Answer: The figure() function creates a new figure object, allowing you to manage the size,
resolution, and background color of your plots. It’s important for organizing multiple plots in
one window and setting specific properties for the figure.
Seaborn-Specific Questions
o Answer: A pair plot is a grid of scatter plots that displays relationships between
multiple pairs of variables in a dataset. It’s useful for visualizing the distribution of
variables and spotting correlations in high-dimensional data.
Answer: The hue parameter in Seaborn is used to color the data points based on a
categorical variable. This enhances the visualization by allowing you to differentiate between
groups within the same plot, making it easier to observe relationships.
Answer: You can create a heatmap in Seaborn using the sns.heatmap() function. This
function visualizes data in a matrix format with color coding to represent values, making it
useful for displaying correlations or frequencies.
Practical Questions
13. Given a dataset, how would you visualize the distribution of a numeric variable?
o Answer: I would use a histogram to visualize the distribution. In Matplotlib, I would
use plt.hist(data), or in Seaborn, I would use sns.histplot(data) to quickly plot the
distribution and add density curves if needed.
14. How would you visualize the relationship between two continuous variables?
o Answer: I would use a scatter plot for this purpose. In Matplotlib, I would use
plt.scatter(x, y), or in Seaborn, I could use sns.scatterplot(x='variable1', y='variable2',
data=data) to visualize the relationship and observe patterns.
15. Can you write code to create a bar chart using either library?
plt.show()
Scenario-Based Questions
16. If your plots are cluttered and hard to read, what steps would you take to improve them?
o Answer: I would simplify the plot by reducing the number of elements, using fewer
colors, and ensuring adequate spacing. I would also adjust the size of the plot, add
labels and legends for clarity, and consider using faceting to break down the data
into smaller visualizations.
o Answer: For time series data, I would typically use a line plot to visualize trends over
time. In Matplotlib, I would use plt.plot(x_dates, y_values), or in Seaborn, I could use
sns.lineplot(x='date', y='value', data=data) to visualize the data and include error
bands if necessary.
19 What are the differences between a Python list and a NumPy array?
NumPy arrays are homogeneous (all elements of the same type), while lists
can contain mixed types.
NumPy arrays provide more efficient memory usage and faster operations
due to optimized C implementation.
NumPy offers a wide range of mathematical operations that are not available
for lists.
Intermediate Questions
Advanced Questions
o Answer: You can handle missing data in NumPy arrays by using np.nan to represent
missing values. Functions like np.nanmean() can compute the mean while ignoring
NaN values.
Numpy vs Pandas
NumPy:
Primarily uses arrays (ndarray), which are homogeneous (all elements of the same type).
Pandas:
Uses Series (1D) and DataFrames (2D), which can hold mixed data types (e.g., integers, floats,
strings)
Designed for data manipulation and analysis, particularly for tabular data (like spreadsheets
or SQL tables).
Statsmodels is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data exploration.
SERIES
Answer: A Pandas Series is a one-dimensional array-like object that can hold data of any type
(integers, floats, strings, etc.) and is associated with an index. It's similar to a list or a
dictionary but comes with additional features for data manipulation and analysis.
If you have a Series with duplicate indices, how would you handle it?
Answer: You can use methods like groupby() to aggregate values or drop_duplicates() to
remove duplicates. You might also consider resetting the index with reset_index().
Answer: You can handle missing values using methods like fillna() to fill them with a specific
value, dropna() to remove them, or interpolate() to perform interpolation.