Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views10 pages

DMV Unit-4-1.pdf

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 10

DMV Unit-4

Data Wrangling
Hierarchical Indexing:-
● Hierarchical indexing, also known as multi-level indexing, is a powerful feature in pandas, a
popular data manipulation library in Python.
● It allows you to have multiple levels of indices on both the rows and columns of a DataFrame.
● This is particularly useful when dealing with higher-dimensional data that can be naturally
represented in a hierarchical structure.
● Commonly used in data structures like pandas DataFrames in Python.
● Each level of the index corresponds to a different dimension or category of the data.
● Supports the creation of interactive visualizations for dynamic exploration of hierarchical data.
● Resembles a tree structure, enabling a hierarchy of data representation.
● Code:-
1. Creation:
import pandas as pd
# Creating a DataFrame with a hierarchical index
df = pd.DataFrame(data, index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B',
2)], names=['index_1', 'index_2']))

2. Indexing and Slicing:


# Accessing data with hierarchical indexing
df.loc['A'] # Access all rows with index 'A'
df.loc['A', 1] # Access rows with index 'A' and sub-index 1
df.loc[('A', 1), 'column_name'] # Access a specific element

3. Stacking and Unstacking:


# Convert innermost index level to columns
df.unstack(level='index_1')

# Convert columns to innermost index level


df.stack(level='index_1')

4. Visualization:
# Plotting with hierarchical index
df.plot(kind='bar', rot=0)
Combining and Merging Data Sets Reshaping and Pivoting :
Combining and merging datasets, along with reshaping and pivoting, are essential operations in data
manipulation, especially when working with relational or structured data.

1. Concatenation:
Combines datasets along a particular axis (rows or columns). In Python, tools like Pandas
provide the concat function for this purpose.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], axis=0) # Concatenating along rows (axis=0)

2. Merging:
Combining datasets based on a common column (key). The Pandas merge function is
commonly used for this.
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [3, 4]})
result = pd.merge(df1, df2, on='key') # Merging on the 'key' column

3. Reshaping and Pivoting:


a) Reshaping with melt:
● Transforming wide-format data into long-format.
wide_df = pd.DataFrame({'Name': ['John', 'Jane'], 'Math': [90, 85], 'History': [88, 92]})
long_df = pd.melt(wide_df, id_vars=['Name'], var_name='Subject', value_name='Score')

b) Reshaping with pivot:


● Transforming long-format data into wide-format.
long_df = pd.DataFrame({'Name': ['John', 'Jane', 'John', 'Jane'],
'Subject': ['Math', 'Math', 'History', 'History'],
'Score': [90, 85, 88, 92]})
wide_df = long_df.pivot(index='Name', columns='Subject', values='Score')

These operations are crucial for preparing data for analysis and visualization. The choice of method
depends on the structure of the original data and the desired format for analysis or presentation.

Basics of matplotlib:-
Matplotlib is a popular Python library for creating static, interactive, and animated visualizations in
data analysis. Basics of Matplotlib:

1. Importing Matplotlib:
Import the matplotlib.pyplot module.
import matplotlib.pyplot as plt

2. Creating a Simple Plot:


Use the plot function to create a basic line plot.
x = [1, 2, 3, 4]
y = [10, 15, 13, 17]
plt.plot(x, y)
plt.show()

3. Adding Labels and Title:


Set labels for the x-axis, y-axis, and title.
plt.plot(x, y)
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt.title('Title of the Plot')
plt.show()

3. Customizing Line Styles and Colors:


Customize the appearance of the plot with different line styles and colors.
plt.plot(x, y, linestyle='--', color='red', marker='o')

4. Legends:
Add a legend to distinguish between multiple plots.
plt.plot(x, y, label='Line 1')
plt.plot(x, [20, 10, 15, 5], label='Line 2')
plt.legend()

5. Grid:
Display a grid in the plot.
plt.grid(True)

6. Plotting Different Types of Figures:


Matplotlib supports various plot types such as bar plots, scatter plots, histograms, etc.
plt.bar(x, y)
plt.scatter(x, y)
plt.hist(data, bins=10)

Plotting with pandas and seaborn:-


Plotting with Pandas:
Pandas provides built-in plotting functionality through the plot method, which is a wrapper around
Matplotlib. You can create various types of plots directly from a Pandas DataFrame or Series.
1. Line Plot:
import pandas as pd
# Assuming df is a DataFrame with columns 'x' and 'y'
df.plot(x='x', y='y', kind='line')

2. Scatter Plot:
df.plot(x='x', y='y', kind='scatter')

3. Bar Plot:
df.plot(x='categories', y='values', kind='bar')

4. Histogram:
df['column'].plot(kind='hist', bins=20)

Plotting with Seaborn:


Seaborn is a statistical data visualization library built on top of Matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics.
1. Distribution Plot:
Seaborn's distplot can be used for univariate distribution plots.
import seaborn as sns
# Assuming df is a DataFrame with a column 'values'
sns.distplot(df['values'], bins=20, kde=False, rug=True)

2. Scatter Plot with Regression Line:


sns.regplot(x='x', y='y', data=df)

3. Box Plot:
sns.boxplot(x='categories', y='values', data=df)

Other python visualization tools:-


1. Plotly:
Plotly is a versatile library for interactive and web-based visualizations. It supports a
wide range of chart types, including scatter plots, line charts, bar charts, and 3D plots.
Plotly can be used in both Python and JavaScript.
2. Geopandas:
Geopandas extends Pandas to handle spatial data. It allows for the creation of maps
and spatial visualizations with ease.

3. NetworkX:
NetworkX is a library for the creation, analysis, and visualization of complex networks
or graphs. It is particularly useful for visualizing relationships and structures in graph
data.
4. WordCloud:
The WordCloud library is specialized for creating word clouds from text data. It's useful
for visually representing the frequency of words in a corpus.
5. Dash:
Dash is a framework for building interactive web applications with Python. It enables
the creation of dashboards with interactive plots and widgets.
6. Matplotlib Basemap Toolkit:
Matplotlib's Basemap Toolkit allows for the creation of geographic maps and
visualizations.

Data Visualization Through Their Graph Representations

Data and Graphs:


Data visualization is a critical aspect of data analysis, enabling the exploration and communication of
patterns, trends, and insights within datasets. Visualizing data often involves creating graphs and
charts to represent information in a more digestible and interpretable form. Here are some common
types of data and corresponding graphs used in data visualization:
1. Continuous Data:
● Graphs: Line Plot, Scatter Plot
● Line Plot: Ideal for showing trends over a continuous interval, such as time series data.
● Scatter Plot: Useful for visualizing the relationship between two continuous variables.
2. Categorical Data:
● Graphs: Bar Chart, Pie Chart
● Bar Chart: Displays categorical data with rectangular bars. Suitable for comparing
values across different categories.
● Pie Chart: Represents parts of a whole. Useful for showing the proportion of each
category in a dataset.
3. Time Series Data:
● Graphs: Line Plot, Area Chart
● Line Plot: Tracks changes over time, illustrating trends and patterns.
● Area Chart: Similar to a line plot but emphasizes the area under the curve, making it
effective for showing cumulative values over time.
4. Hierarchical Data:
● Graphs: Tree Map, Sunburst Chart
● Tree Map: Displays hierarchical data as nested rectangles, allowing for the
visualization of hierarchies and proportions.
● Sunburst Chart: Radial version of a tree map, suitable for displaying hierarchical data
with a circular layout.
5. Multivariate Data:
● Graphs: Scatter Plot Matrix, Parallel Coordinates Plot
● Scatter Plot Matrix: Useful for visualizing relationships between multiple variables in a
matrix of scatter plots.
● Parallel Coordinates Plot: Represents multivariate data by plotting each variable on a
separate parallel axis.
6. Distribution of Data:
● Graphs: Histogram, Box Plot
● Histogram: Displays the distribution of a single variable.
● Box Plot: Illustrates the summary statistics of a dataset, including the median,
quartiles, and potential outliers.
7. Geospatial Data:
● Graphs: Choropleth Map, Bubble Map
● Choropleth Map: Uses color to represent a variable's value across geographical
regions.
● Bubble Map: Displays data points on a map with the size of each point representing a
variable.
8. Network Data:
● Graphs: Network Graph, Sankey Diagram
● Network Graph: Visualizes relationships between entities as nodes and edges.
● Sankey Diagram: Illustrates flow and connections between entities, often used for
visualizing energy or resource flows.

Graph Layout Techniques:-


techniques determine how nodes and edges are arranged in a visual representation, making it easier
to understand the structure and relationships within the data.

1. Force-Directed Layout:
● Description:
Nodes repel each other, and edges act as springs, creating a balanced layout.
● Use Case:
Effective for visualizing relationships and clusters in large networks.
2. Hierarchical Layout (Tree Layout):
● Description:
Organizes nodes in a hierarchical tree structure, often used for representing parent-
child relationships.
● Use Case:
Ideal for visualizing organizational structures.

3. Radial Layout:
● Description:
Nodes are placed in concentric circles, and edges connect them radially.
● Use Case:
Suitable for emphasizing a central node or representing radial hierarchies.
4. Grid Layout:
● Description:
Places nodes on a regular grid, simplifying the layout.
● Use Case:
Well-suited for small graphs with a clear structure.
5. Random Layout:
● Description:
Nodes are placed randomly without any specific organization.
● Use Case:
Quick visualization for exploratory analysis but not suitable for conveying specific
patterns.
6. Spatial Layout:
● Description:
Utilizes spatial coordinates (e.g., latitude and longitude) to position nodes.
● Use Case:
Commonly used for geospatial networks or graphs.
7. 3D Layout:
● Description:
Extends 2D layouts into three dimensions for more complex visualizations.
● Use Case:
Useful when additional depth information is needed.

Force-directed Techniques:
Force-directed techniques are widely used in data visualization, especially for representing complex
networks or graphs. These techniques simulate physical forces between nodes to determine their
positions in a layout. The nodes are treated as charged particles, and forces between them, such as
attraction and repulsion.

Force-Directed Algorithm Steps:


1. Initialization:
Nodes are assigned initial positions in the layout.
2. Force Calculation:
For each iteration, forces (attraction and repulsion) between nodes are calculated
based on their positions.
Forces:
● Repulsive Forces: Nodes repel each other to avoid overlapping and achieve a
more spread-out layout.
● Attractive Forces: Edges act as springs, pulling connected nodes closer
together.
3. Node Movement:
Nodes are moved based on the calculated forces. The movement is typically
proportional to the force acting on the node.
Equations of Motion:
● Newtonian Physics: The algorithms often use equations inspired by Newtonian
physics to model the motion of nodes in response to the forces acting on them.
● Damping: To prevent oscillations and achieve stability, a damping factor may
be introduced.
4. Iteration:
Steps 2 and 3 are repeated for multiple iterations until layout converges to a stable state.

Techniques:
1. Fruchterman-Reingold Algorithm:
Balances attractive and repulsive forces, minimizing edge crossings and distributing
nodes evenly.
2. Kamada-Kawai Algorithm:
Minimizes the total energy in the system, adjusting node positions to achieve a more
even distribution.
3. Barnes-Hut Algorithm:
Uses a quadtree or octree data structure to efficiently calculate forces, making it
suitable for large graphs.
4. ForceAtlas2 Algorithm:
An improved version of the ForceAtlas algorithm, with additional parameters for better
control and stability.

Multidimensional Scaling:-
Multidimensional Scaling (MDS) is a technique used in data visualization to represent the similarity or
dissimilarity between objects in a dataset in a reduced-dimensional space. MDS aims to preserve the
pairwise distances or dissimilarities between data points, providing a visual representation that
reflects the underlying structure of the data.
Key Concepts:
1. Dissimilarity Matrix:
MDS starts with a dissimilarity matrix that represents the pairwise dissimilarities
between data points. This matrix can be derived from various measures, such as
Euclidean distances, correlation coefficients, or other similarity metrics.

2. Stress:
Stress is a measure of how well the distances in the reduced-dimensional space
preserve the original dissimilarities. Lower stress values indicate a better fit. Stress can
be calculated using different formulas, such as Kruskal's stress formula.
3. Configuration:
The configuration is the arrangement of points in the reduced-dimensional space. MDS
aims to find a configuration that minimizes stress, effectively preserving the pairwise
dissimilarities as much as possible.

Steps in Multidimensional Scaling:


1. Input Dissimilarity Matrix:
Start with a dissimilarity matrix representing the pairwise dissimilarities between data
points.
2. Double-Centered Matrix:
Transform the dissimilarity matrix into a double-centered matrix to obtain a cross-
product matrix.
3. Eigenvalue Decomposition:
Perform eigenvalue decomposition on the double-centered matrix to extract
eigenvectors and eigenvalues.
4. Select Dimensions:
Select a subset of the eigenvectors corresponding to the desired number of
dimensions for the reduced space.
5. Configuration Matrix:
Construct a configuration matrix using the selected eigenvectors. Each row
corresponds to a data point, and each column corresponds to a dimension.
6. Stress Minimization:
Optimize the configuration to minimize stress by adjusting the positions of points in the
reduced-dimensional space. This is often done using iterative optimization algorithms.
7. Visualization:
Visualize the points in the reduced-dimensional space, where distances between
points approximately reflect the dissimilarities in the original data.

The Pulling Under Constraints Model:-


The "Pulling Under Constraints" model, also known as the "Torgerson's Guttman Scaling" model, is a
variant of Multidimensional Scaling (MDS) used in data visualization. In the context of MDS, the goal
is to represent objects in a lower-dimensional space such that their pairwise dissimilarities are
preserved as much as possible. The Pulling Under Constraints model introduces additional
constraints to guide the layout of the points in the reduced space.

Key features:
1. Objective:
Similar to classical MDS, the primary objective is to find a configuration of points in a
lower-dimensional space that minimizes the stress between the observed
dissimilarities and the distances in the reduced space.
2. Constraints:
The model incorporates additional constraints. These constraints can be used to
impose specific relationships or preferences on the arrangement of points.
3. Optimization:
The optimization process involves adjusting the positions of points in the reduced
space to simultaneously minimize the stress (discrepancy) and satisfy the specified
constraints.
4. Applications:
The Pulling Under Constraints model is particularly useful when there is additional
information or prior knowledge about the relationships between some points. This
information can be incorporated as constraints to guide the layout.
5. Linear Constraints:
Constraints are typically linear equations or inequalities that relate the coordinates of
the points. These constraints might represent known distances, relative positions, or
other geometric relationships.
6. Use Cases:
This model is employed when there is a need to balance the preservation of pairwise
dissimilarities with the incorporation of additional structural constraints.

Bipartite Graphs:-
Bipartite graphs are a type of graph where the set of nodes can be divided into two disjoint sets, and
edges only connect nodes from different sets. In other words, there are no edges between nodes
within the same set. Bipartite graphs are commonly used to represent relationships between two
distinct classes of objects.

4. Graph Algorithms:
■ Apply graph algorithms specifically designed for bipartite graphs, such as the
Hopcroft–Karp algorithm for maximum cardinality bipartite matching.
■ Algorithms can reveal important structural properties of the bipartite graph.

You might also like