DMV Unit-4-1.pdf
DMV Unit-4-1.pdf
DMV Unit-4-1.pdf
Data Wrangling
Hierarchical Indexing:-
● Hierarchical indexing, also known as multi-level indexing, is a powerful feature in pandas, a
popular data manipulation library in Python.
● It allows you to have multiple levels of indices on both the rows and columns of a DataFrame.
● This is particularly useful when dealing with higher-dimensional data that can be naturally
represented in a hierarchical structure.
● Commonly used in data structures like pandas DataFrames in Python.
● Each level of the index corresponds to a different dimension or category of the data.
● Supports the creation of interactive visualizations for dynamic exploration of hierarchical data.
● Resembles a tree structure, enabling a hierarchy of data representation.
● Code:-
1. Creation:
import pandas as pd
# Creating a DataFrame with a hierarchical index
df = pd.DataFrame(data, index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B',
2)], names=['index_1', 'index_2']))
4. Visualization:
# Plotting with hierarchical index
df.plot(kind='bar', rot=0)
Combining and Merging Data Sets Reshaping and Pivoting :
Combining and merging datasets, along with reshaping and pivoting, are essential operations in data
manipulation, especially when working with relational or structured data.
1. Concatenation:
Combines datasets along a particular axis (rows or columns). In Python, tools like Pandas
provide the concat function for this purpose.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], axis=0) # Concatenating along rows (axis=0)
2. Merging:
Combining datasets based on a common column (key). The Pandas merge function is
commonly used for this.
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'value': [3, 4]})
result = pd.merge(df1, df2, on='key') # Merging on the 'key' column
These operations are crucial for preparing data for analysis and visualization. The choice of method
depends on the structure of the original data and the desired format for analysis or presentation.
Basics of matplotlib:-
Matplotlib is a popular Python library for creating static, interactive, and animated visualizations in
data analysis. Basics of Matplotlib:
1. Importing Matplotlib:
Import the matplotlib.pyplot module.
import matplotlib.pyplot as plt
4. Legends:
Add a legend to distinguish between multiple plots.
plt.plot(x, y, label='Line 1')
plt.plot(x, [20, 10, 15, 5], label='Line 2')
plt.legend()
5. Grid:
Display a grid in the plot.
plt.grid(True)
2. Scatter Plot:
df.plot(x='x', y='y', kind='scatter')
3. Bar Plot:
df.plot(x='categories', y='values', kind='bar')
4. Histogram:
df['column'].plot(kind='hist', bins=20)
3. Box Plot:
sns.boxplot(x='categories', y='values', data=df)
3. NetworkX:
NetworkX is a library for the creation, analysis, and visualization of complex networks
or graphs. It is particularly useful for visualizing relationships and structures in graph
data.
4. WordCloud:
The WordCloud library is specialized for creating word clouds from text data. It's useful
for visually representing the frequency of words in a corpus.
5. Dash:
Dash is a framework for building interactive web applications with Python. It enables
the creation of dashboards with interactive plots and widgets.
6. Matplotlib Basemap Toolkit:
Matplotlib's Basemap Toolkit allows for the creation of geographic maps and
visualizations.
1. Force-Directed Layout:
● Description:
Nodes repel each other, and edges act as springs, creating a balanced layout.
● Use Case:
Effective for visualizing relationships and clusters in large networks.
2. Hierarchical Layout (Tree Layout):
● Description:
Organizes nodes in a hierarchical tree structure, often used for representing parent-
child relationships.
● Use Case:
Ideal for visualizing organizational structures.
3. Radial Layout:
● Description:
Nodes are placed in concentric circles, and edges connect them radially.
● Use Case:
Suitable for emphasizing a central node or representing radial hierarchies.
4. Grid Layout:
● Description:
Places nodes on a regular grid, simplifying the layout.
● Use Case:
Well-suited for small graphs with a clear structure.
5. Random Layout:
● Description:
Nodes are placed randomly without any specific organization.
● Use Case:
Quick visualization for exploratory analysis but not suitable for conveying specific
patterns.
6. Spatial Layout:
● Description:
Utilizes spatial coordinates (e.g., latitude and longitude) to position nodes.
● Use Case:
Commonly used for geospatial networks or graphs.
7. 3D Layout:
● Description:
Extends 2D layouts into three dimensions for more complex visualizations.
● Use Case:
Useful when additional depth information is needed.
Force-directed Techniques:
Force-directed techniques are widely used in data visualization, especially for representing complex
networks or graphs. These techniques simulate physical forces between nodes to determine their
positions in a layout. The nodes are treated as charged particles, and forces between them, such as
attraction and repulsion.
Techniques:
1. Fruchterman-Reingold Algorithm:
Balances attractive and repulsive forces, minimizing edge crossings and distributing
nodes evenly.
2. Kamada-Kawai Algorithm:
Minimizes the total energy in the system, adjusting node positions to achieve a more
even distribution.
3. Barnes-Hut Algorithm:
Uses a quadtree or octree data structure to efficiently calculate forces, making it
suitable for large graphs.
4. ForceAtlas2 Algorithm:
An improved version of the ForceAtlas algorithm, with additional parameters for better
control and stability.
Multidimensional Scaling:-
Multidimensional Scaling (MDS) is a technique used in data visualization to represent the similarity or
dissimilarity between objects in a dataset in a reduced-dimensional space. MDS aims to preserve the
pairwise distances or dissimilarities between data points, providing a visual representation that
reflects the underlying structure of the data.
Key Concepts:
1. Dissimilarity Matrix:
MDS starts with a dissimilarity matrix that represents the pairwise dissimilarities
between data points. This matrix can be derived from various measures, such as
Euclidean distances, correlation coefficients, or other similarity metrics.
2. Stress:
Stress is a measure of how well the distances in the reduced-dimensional space
preserve the original dissimilarities. Lower stress values indicate a better fit. Stress can
be calculated using different formulas, such as Kruskal's stress formula.
3. Configuration:
The configuration is the arrangement of points in the reduced-dimensional space. MDS
aims to find a configuration that minimizes stress, effectively preserving the pairwise
dissimilarities as much as possible.
Key features:
1. Objective:
Similar to classical MDS, the primary objective is to find a configuration of points in a
lower-dimensional space that minimizes the stress between the observed
dissimilarities and the distances in the reduced space.
2. Constraints:
The model incorporates additional constraints. These constraints can be used to
impose specific relationships or preferences on the arrangement of points.
3. Optimization:
The optimization process involves adjusting the positions of points in the reduced
space to simultaneously minimize the stress (discrepancy) and satisfy the specified
constraints.
4. Applications:
The Pulling Under Constraints model is particularly useful when there is additional
information or prior knowledge about the relationships between some points. This
information can be incorporated as constraints to guide the layout.
5. Linear Constraints:
Constraints are typically linear equations or inequalities that relate the coordinates of
the points. These constraints might represent known distances, relative positions, or
other geometric relationships.
6. Use Cases:
This model is employed when there is a need to balance the preservation of pairwise
dissimilarities with the incorporation of additional structural constraints.
Bipartite Graphs:-
Bipartite graphs are a type of graph where the set of nodes can be divided into two disjoint sets, and
edges only connect nodes from different sets. In other words, there are no edges between nodes
within the same set. Bipartite graphs are commonly used to represent relationships between two
distinct classes of objects.
4. Graph Algorithms:
■ Apply graph algorithms specifically designed for bipartite graphs, such as the
Hopcroft–Karp algorithm for maximum cardinality bipartite matching.
■ Algorithms can reveal important structural properties of the bipartite graph.