Python CA2
Python CA2
Declaration:
I declare that this Assignment is my individual work. I have not copied it from any other student’s
work or from any other source except where due acknowledgement is made explicitly in the text,
nor has any part been written for me by any other person.
1.1 Q1. LIST AT LEAST THREE REAL-WORLD SCENARIOS WHERE PANDAS CAN BE USED FOR DATA ANALYSIS.
Some real-world scenarios where Pandas can be used for data analysis:
1.Sports: Analyzing player performance statistics, tracking team trends, identifying factors that
contribute to wins and losses, optimizing training and strategies.
Pandas use:
• Loading and cleaning data from various sources (game scores, player statistics, sensor
readings, etc.).
• Calculating key performance metrics (averages, shooting percentages, assists, rebounds,
etc.).
• Visualizing trends and patterns (player performance over time, team comparisons, win-loss
distributions).
• Building predictive models to forecast player performance, game outcomes, or injury risks.
2. Social Media: Understanding user behaviour, identifying popular topics and trends, analyzing
sentiment and engagement, optimizing marketing campaigns.
Pandas use:
• Collecting and preparing social media data (tweets, posts, comments, likes, shares).
• Cleaning and preprocessing text data (removing noise, handling
emojis, stemming/lemmatizing words).
• Conducting sentiment analysis (classifying positive, negative, or neutral sentiment in text).
• Identifying trending topics and influencers.
• Visualizing social network structures and interactions.
Pandas use:
1.3 Q2. DESCRIBE THE PRIMARY DATA STRUCTURES IN PANDAS, NAMELY SERIES AND DATAFRAME. EXPLAIN
THE DIFFERENCES AND USE CASES FOR EACH.
Here's a description of Series and DataFrame, the primary data structures in Pandas, along with their
differences and use cases:
Series:
• Holds any data type: Numbers, strings, dates, booleans, or even custom objects.
o Index: A label for each value, often used for selection and alignment.
Use cases:
DataFrame:
• Collection of Series objects: Each column is a Series, and each row represents an
observation.
o Index: Labels for both rows and columns, enabling flexible access and manipulation.
Use cases:
• Representing tabular data, such as datasets imported from CSV, Excel, or databases.
2 PART 2: NUMPY
2.1 Q1. WRITE A BRIEF DESCRIPTION OF WHAT NUMPY IS AND WHY IT IS IMPORTANT FOR SCIENTIFIC
COMPUTING AND DATA ANALYSIS IN PYTHON
NumPy, short for Numerical Python, is a fundamental library for numerical computing in Python. It
provides support for large, multi-dimensional arrays and matrices, along with a collection of high-
level mathematical functions to operate on these arrays.
Key features and reasons why NumPy is important for scientific computing and data analysis in
Python include:
Efficient Array Operations: NumPy provides a powerful N-dimensional array object (ndarray), which
allows for efficient storage and manipulation of large datasets. The ndarray supports a variety of
data types and enables vectorized operations, which significantly enhances the performance of
numerical computations.
Broadcasting: NumPy's broadcasting capability enables operations on arrays of different shapes and
sizes, making it easier to perform element-wise operations without the need for explicit loops. This
enhances code readability and reduces the need for unnecessary duplication of data.
Memory Management: NumPy efficiently manages memory and provides tools for creating views on
arrays without copying data, saving both time and resources. This is particularly beneficial when
working with large datasets, as it minimizes memory overhead.
2.2 Q2. EXPLAIN THE SIGNIFICANCE OF NUMPY IN TERMS OF PERFORMANCE AND EFFICIENCY WHEN WORKING
WITH LARGE DATASETS AND NUMERICAL COMPUTATIONS.
When it comes to handling large datasets and complex numerical computations in Python, NumPy
reigns supreme in terms of performance and efficiency. Here's why:
1. Memory Efficiency:
▪ Contiguous Memory Layout: NumPy stores data in contiguous blocks of memory, unlike
Python lists which can be scattered. This allows for faster access and manipulation of
elements as data doesn't need to be searched across memory fragments.
▪ Optimized Data Types: NumPy offers specialized data types like float64 or int32 designed for
numerical operations. These are more compact and efficient than generic Python types like
"float" or "int", reducing memory footprint and boosting processing speed.
2. Vectorized Operations:
▪ Single Instruction, Multiple Data (SIMD): NumPy leverages vectorized operations, utilizing
SIMD instructions on modern CPUs. This allows performing the same operation on multiple
data elements simultaneously, leading to significant speedups compared to looping over
elements one by one.
▪ Broadcasting: NumPy automatically broadcasts operations between arrays of different sizes,
eliminating the need for manual loop-based iteration and further enhancing performance.
3. C-optimized Backend:
NumPy relies heavily on optimized C code under the hood, making it significantly faster than pure
Python implementations. This C code takes advantage of hardware capabilities and low-level
memory access, further pushing the boundaries of performance.
Concise syntax: NumPy provides vectorized functions and operators that eliminate the need for long
and intricate loops, simplifying code and making it more readable. This not only improves efficiency
but also reduces the risk of errors.
• Faster execution times: Analyzing large datasets and performing complex calculations
become significantly faster with NumPy compared to pure Python or other less optimized
libraries.
• Reduced CPU and memory usage: Smaller memory footprint and efficient computations
translate to lower resource consumption, enabling smooth processing of even massive
datasets on smaller machines.
• Simplified code and easier maintenance: Concise and readable code thanks to vectorization
improves maintainability and reduces debugging time.
UNIT 5
3 DATA VISUALIZATION
3.1 Q1. CREATE A MATPLOTLIB BAR PLOT SHOWING THE SALES OF PRODUCTS IN A STORE FOR A GIVEN
MONTH. LABEL THE AXES, ADD A TITLE, AND CUSTOMIZE THE APPEARANCE (E.G., COLOUR, WIDTH)
Code:
3.2 Q2. PROVIDE AT LEAST THREE EXAMPLES OF DATA VISUALIZATION SCENARIOS WHERE SEABORN IS THE
PREFERRED LIBRARY OVER MATPLOTLIB. DESCRIBE THE TYPE OF PLOTS OR CHARTS INVOLVED AND WHY
SEABORN IS A BETTER CHOICE.
Here are three examples where Seaborn often excels over Matplotlib for specific visualization tasks:
1. Visualizing Statistical Relationships:
• Plot types: Pair plots, joint plots, distributions, heatmaps, violin plots
Example: Visualizing correlations between multiple variables in a dataset using a pair plot, revealing
patterns and potential interactions.
2. Exploring Categorical Data:
• Plot types: Bar plots, box plots, violin plots, strip plots, point plots
Example: Comparing distributions of customer satisfaction scores across different product categories
using box plots to identify potential issues.
3. Handling Data with Facets:
Example: Comparing sales trends across regions and product categories using a faceted line plot to
identify regional differences and potential market opportunities.
In summary, Seaborn shines when:
• Aesthetic appeal and concise code are desired for effective communication.
UNIT 6
4 DESCRIBE THE THREE KEY STRUCTURES IN PLOTLY:
4.1 Q1. FIGURE, DATA, AND LAYOUT. EXPLAIN THE PURPOSE OF EACH STRUCTURE IN CREATING
VISUALIZATIONS.
1. Figure:
• The overall container: It acts as the canvas or window that holds all the elements of your
visualization.
• Foundation for visual elements: It provides the space where you'll create and arrange
plots, axes, titles, legends, annotations, and other visual components.
• Management and customization: It allows you to manage the overall size, aspect
ratio, background color, and other stylistic properties of the entire visualization.
2. Data:
• The heart of the visualization: It consists of the numerical values, categorical information, or
text that you want to visualize.
• Source and format: It can come from various sources like arrays, dataframes, or external
files, and it's typically structured in a format that visualization libraries can understand.
• Mapping to visual elements: It's used to create the visual representations within the
figure, such as bars in a bar chart, lines in a line plot, or points in a scatter plot.
3. Layout:
• Organization and arrangement: It determines the spatial arrangement of visual elements
within the figure, ensuring clarity and readability.
• Customization and control: It allows you to adjust spacing, margins, alignment, and the
overall visual hierarchy of elements to effectively guide the viewer's attention.
How they work together:
1. Create a figure: You typically start by creating a figure object to establish the overall
container for your visualization.
2. Load and prepare data: You then load your data, ensuring it's in a suitable format for the
visualization library you're using.
3. Map data to visual elements: You create visual elements like plots, axes, and
markers, mapping the data to their properties (e.g., x-axis values, y-axis values, colors, sizes).
4. Arrange elements within layout: You position and organize these visual elements within the
figure using layout tools, ensuring a clear and informative presentation.
5. Customize appearance: You can apply stylistic choices to both the figure and individual
elements to enhance readability and visual appeal.
4.2 Q2. LOAD A SALES DATASET WITH COLUMNS 'SALES,' CREATE A PLOTLY LINE CHART TO VISUALIZE THE
TOTAL SALES TREND. INCLUDE AXIS LABELS, A TITLE, AND CUSTOMIZE THE APPEARANCE.
Code