DWDV notes

DWDV
UNIT-1
Syllabus:
Data Wrangling: Need of data cleanup, data clean up basics – formatting,

outliers, duplicates, Normalizing and standardizing data.
Data Wrangling:
These days almost anything can be a valuable source of information. The

primary challenge lies in extracting the insights from the said information and
making sense of it, which is the point of Big Data. However, you also need to
prep the data first, which is Data Wrangling in a nutshell.
The nature of the information is that it requires a certain kind of organization to

be adequately assessed. This process requires a crystal clear understanding of
which operations need what sort of data. Let's look closer at wrangling data and
explain why it is so important.
What is Data Wrangling?
Sometimes, data Wrangling is referred to as data munging. It is the process of

transforming and mapping data from one "raw" data form into another format to
make it more appropriate and valuable for various downstream purposes such as
analytics. The goal of data wrangling is to assure quality and useful data. Data
analysts typically spend the majority of their time in the process of data
wrangling compared to the actual analysis of the data.
The process of data wrangling may include further munging, data visualization,
data aggregation, training a statistical model, and many other potential uses.
Data wrangling typically follows a set of general steps, which begin with
extracting the raw data from the data source, "munging" the raw data (e.g.,
sorting) or parsing the data into predefined data structures, and finally
depositing the resulting content into a data sink for storage and future use.
Wrangling the data is usually accompanied by Mapping. The term "Data

Mapping" refers to the element of the wrangling process that involves
identifying source data fields to their respective target data fields. While
Wrangling is dedicated to transforming data, Mapping is about connecting the
dots between different elements.
Importance of Data Wrangling
Some may question if the amount of work and time devoted to data wrangling is
worth the effort. A simple analogy will help you understand. The foundation of
a skyscraper is expensive and time-consuming before the above-ground
structure starts. Still, this solid foundation is extremely valuable for the building
to stand tall and serve its purpose for decades. Similarly, once the code and
infrastructure foundation are gathered for data handling, it will deliver
immediate results (sometimes almost instantly) for as long as the process is
relevant. However, skipping necessary data wrangling steps will lead to
significant downfalls, missed opportunities, and erroneous models that damage
the reputation of analysis within the organization.
Data wrangling software has become an indispensable part of data processing.

The primary importance of using data wrangling tools can be described as
follows:
o Making raw data usable. Accurately wrangled data guarantees that quality
data is entered into the downstream analysis.
o Getting all data from various sources into a centralized location so it can
be used.
o Piecing together raw data according to the required format and
understanding the business context of data.
o Automated data integration tools are used as data wrangling techniques
that clean and convert source data into a standard format that can be used
repeatedly according to end requirements. Businesses use this
standardized data to perform crucial, cross-data set analytics.
o Cleansing the data from the noise or flawed, missing elements.
o Data wrangling acts as a preparation stage for the data mining process,
which involves gathering data and making sense of it.
o Helping business users make concrete, timely decisions.
Data Wrangling Process

Data Wrangling is one of those technical terms that are more or less self-
descriptive. The term "wrangling" refers to rounding up information in a certain
way. This operation includes a sequence of the following processes:
1. Discovery: Before starting the wrangling process, it is critical to think

about what may lie beneath your data. It is crucial to think critically about
what results from you anticipate from your data and what you will use it
for once the wrangling process is complete. Once you've determined your
objectives, you can gather your data.
2. Organization: After you've gathered your raw data within a particular
dataset, you must structure your data. Due to the variety and complexity
of data types and sources, raw data is often overwhelming at first glance.
3. Cleaning: When your data is organized, you can begin cleaning your
data. Data cleaning involves removing outliers, formatting nulls, and
eliminating duplicate data. It is important to note that cleaning data
collected from web scraping methods might be more tedious than
cleaning data collected from a database. Essentially, web data can be
highly unstructured and require more time than structured datafrom a
database.
4. Data enrichment: This step requires that you take a step back from your
data to determine if you have enough data to proceed. Finishing the
wrangling process without enough data may compromise insights
gathered from further analysis. For example, investors looking to analyze
product review data will want a significant amount of data to portray the
market and increase investment intelligence
5. Validation: After determining you gathered enough data, you will need
to apply validation rules to your data. Validation rules, performed in
repetitive sequences, confirm that your data is consistent throughout your
dataset. Validation rules will also ensure quality as well as security. This
step follows similar logic utilized in data normalization, a data
standardization process involving validation rules.
6. Publishing: The final step of the data munging process is data
publishing. Data publishing involves preparing the data for future use.
This may include providing notes and documentation of your wrangling
process and creating access for other users and applications.
Use Case of Data Wrangling
Data munging is used for diverse use-cases as follows:
1. Fraud Detection: Using a data wrangling tool, a business can perform the
following:
o Distinguish corporate fraud by identifying unusual behavior by

examining detailed information like multi-party and multi-layered emails
or web chats.
o Support data security by allowing non-technical operators to examine and
wrangle data quickly to keep pace with billions of daily security tasks.
o Ensure precise and repeatable modeling outcomes by standardizing and
quantifying structured and unstructured data sets.
o Enhance compliance by ensuring your business complies with industry
and government standards by following security protocols during
integration.
2. Customer Behavior Analysis: A data-munging tool can quickly help your
business processes get precise insights via customer behavior analysis. It
empowers the marketing team to take business decisions into their hands and
make the best of them. You can use data wrangling tools to:
o Decrease the time spent on data preparation for analysis

o Quickly understand the business value of your data
o Allow your analytics team to utilize the customer behavior data directly
o Empower data scientists to discover data trends via data discovery and
visual profiling.
Data Wrangling Tools
There are different tools for data wrangling that can be used for gathering,
importing, structuring, and cleaning data before it can be fed into analytics and
BI apps. You can use automated tools for data wrangling, where the software
allows you to validate data mappings and scrutinize data samples at every step
of the transformation process. This helps to detect and correct errors in data
mapping quickly.
Automated data cleaning becomes necessary in businesses dealing with

exceptionally large data sets. The data team or data scientist is responsible for
Wrangling manual data cleaning processes. However, in smaller setups, non-
data professionals are responsible for cleaning data before leveraging it.
Various data wrangling methods range from munging data with scripts to
spreadsheets. Additionally, with some of the more recent all-in-one tools,
everyone utilizing the data can access and utilize their data wrangling tools.
Here are some of the more common data wrangling tools available.
o Spreadsheets / Excel Power Query is the most basic manual data

wrangling tool.
o OpenRefine - An automated data cleaning tool that requires programming
skills
o Tabula
It is a tool suited for all data types
o Google DataPrep
It is a data service that explores, cleans, and prepares data
o Data wrangler
It is a data cleaning and transforming tool
o Plotly (data wrangling with Python) is useful for maps

and chart data.
o CSVKit converts data.
Benefits of Data Wrangling
As previously mentioned, big data has become an integral part of business and
finance today. However, the full potential of said data is not always clear. Data
processes, such as data discovery, are useful for recognizing your data's
potential. But to fully unleash the power of your data, you will need to
implement data. Here are some of the key benefits of data wrangling.
o Data consistency: The organizational aspect of data wrangling offers a

resulting dataset that is more consistent. Data consistency is crucial for
business operations that involve collecting data input by consumers or
other human end-users. For example, if a human-end user submits
personal information incorrectly, such as making a duplicate customer
account, which would consequently impact further performance analysis.
o Improved insights: Data wrangling can provide statistical insights about
metadata by transforming the metadata to be more constant. These
insights are often the result of increased data consistency, as consistent
metadata allows automated tools to analyze the data faster and more
accurately. Particularly, if one were to build a model regarding projected
market performance, data wrangling would clean the metadata to allow
your model to run without any errors.
o Cost efficiency: As previously mentioned, because data-wrangling
allows for more efficient data analysis and model-building processes,
businesses will ultimately save money in the long run. For instance,
thoroughly cleaning and organizing data before sending it off for
integration will reduce errors and save developers time.
o Data wrangling helps to improve data usability as it converts data
into a compatible format for the end system.
o It helps to quickly build data flows within an intuitive user
interface and easily schedule and automate the data-flow process.
o Integrates various types of information and sources (like databases,
web services, files, etc.)
o Help users to process very large volumes of data easily and easily
share data-flow techniques.
Data Wrangling Formats
Depending on the type of data you are using, your final result will fall into four
final formats: de-normalized transactions, analytical base table (ABT), time
series, or document library. Let's take a closer look at these final formats, as
understanding these results will inform the first few steps of the data wrangling
process, which we discussed above.
o Transactional data: Transactional data refers to business operation

transactions. This data type involves detailed subjective information
about particular transactions, including client documentation, client
interactions, receipts, and notes regarding any external transactions.
o Analytical Base Table (ABT): Analytical Base Table data involves data
within a table with unique entries for each attribute column. ABT data is
the most common business data type as it involves various data types that
contribute to the most common data sources. Even more notable is that
ABT data is primarily used for AI and ML, which we will examine later.
o Time-series: Time series data involves data that has been divided by a
particular amount of time or data that has a relation with time,
particularly sequential time. For example, tracking data regarding an
application's downloads over a year or tracking traffic data over a month
would be considered time series data.
o Document library: Lastly, document library data is information that
involves a large amount of textual data, particularly text within a
document. While document libraries contain rather massive amounts of
data, automated data mining tools specifically designed for text mining
can help extract entire texts from documents for further analysis.
Data Wrangling Examples
Data wrangling techniques are used for various use cases. The most commonly
used examples of data wrangling are for:
o Merging several data sources into one data set for analysis
o Identifying gaps or empty cells in data and either filling or removing
them
o Deleting irrelevant or unnecessary data
o Identifying severe outliers in data and either explaining the
inconsistencies or deleting them to facilitate analysis
Businesses also use data wrangling tools to
o Detect corporate fraud

o Support data security
o Ensure accurate and recurring data modeling results
o Ensure business compliance with industry standards
o Perform Customer Behavior Analysis
o Reduce time spent on preparing data for analysis
o Promptly recognize the business value of your data
o Find out data trends
What is Formatting?
Formatting, in the context of data management, refers to the process of

structuring and arranging data to conform to certain rules or guidelines. It is a
necessary step in data analysis, ensuring consistent, clean, and ready-to-use
data. It serves as the linchpin for various operations like data extraction,
transformation, and loading (ETL), thereby laying the groundwork for
subsequent data analysis and insights.
Functionality and Features
Formatting allows for standardization and normalization of data. It aids in error

detection and data cleaning, setting the stage for reliable data analytics. In
addition, it supports diverse data types, encompassing structured and
unstructured data, facilitating seamless interoperability amongst various data
systems.
Benefits and Use Cases
Formatting provides numerous benefits, including improved data quality,

increased efficiency in data processing and analytics, and enhanced
compatibility between different systems and platforms. Its uses extend across
industries, enabling efficient data analysis for business intelligence, predictive
modeling, machine learning algorithms, and more.
Challenges and Limitations
Despite its benefits, formatting comes with challenges, such as handling

massive data volumes, managing complex data types, and maintaining data
integrity during transformation. In addition, it requires sophisticated tools and
technical expertise to manage effectively.
Integration with Data Lakehouse
Formatting plays a vital role in a data lakehouse environment. It facilitates the

ingestion of diverse data types into the lakehouse, transforming them into a
structured form suitable for querying and analysis. By organizing data
effectively in a data lakehouse, formatting operations enable efficient BI
reporting, AI modeling, and advanced analytics.
Security Aspects
While handling data formatting, it's critical to consider security. Ensuring data
privacy, access control, and data governance are crucial in the formatting
process. Innovative solutions like Dremio provide built-in data protection
measures, offering robust security during data formatting.
Performance
Efficient formatting significantly impacts data processing performance,

allowing for faster queries, smoother ETL processes, and optimized analytics.
Dremio's technology excels in this area, providing high-speed data formatting
and transformation capabilities.
FAQs
1. What is data formatting? Data formatting is the process of structuring data

according to certain guidelines to facilitate data usage and analysis.
2. Why is data formatting important? Formatting is critical in ensuring data
quality and consistency, enabling efficient data processing, analysis, and
interoperability.
3. How does formatting integrate into a data lakehouse? Formatting assists
with data ingestion into the lakehouse, transforming diverse data types into a
structured form for querying and analytics.
4. What are the challenges in data formatting? The primary challenges
include handling large data volumes, managing complex data types, and
maintaining data integrity during the transformation process.
5. How does Dremio assist with data formatting? Dremio offers high-speed
data formatting and transformation, along with robust security features,
providing a highly performant and secure approach to data formatting.
Glossary
Data Lakehouse: A hybrid architecture that combines the best features of data
lakes and data warehouses.
ETL: Extract, Transform, Load – a process in data warehousing.
Data Formatting: The process of structuring and arranging data according to

certain guidelines or rules.
Data Security: Measures to protect stored data from unauthorized access, data
corruption, or data breaches.
Data Performance: The speed and efficiency with which data can be processed
and analyzed.
Data cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. When
combining multiple data sources, there are many opportunities for data to be
duplicated or mislabeled. If data is incorrect, outcomes and algorithms are
unreliable, even though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because the processes will
vary from dataset to dataset. But it is crucial to establish a template for your
data cleaning process so you know you are doing it the right way every time.
What is the difference between data cleaning and data transformation?
Data cleaning is the process that removes data that does not belong in your
dataset. Data transformation is the process of converting data from one format
or structure into another. Transformation processes can also be referred to as
data wrangling, or data munging, transforming and mapping data from one
"raw" data form into another format for warehousing and analyzing. This article
focuses on the processes of cleaning that data.
How to clean data

While the techniques used for data cleaning may vary according to the types of
data your company stores, you can follow these basic steps to map out a
framework for your organization.
Step 1: Remove duplicate or irrelevant observations
1. Duplicate observations will happen most often during data collection.

When you combine data sets from multiple places, scrape data, or receive
data from clients or multiple departments, there are opportunities to
create duplicate data.
2. De-duplication is one of the largest areas to be considered in this process.
3. For example, if you want to analyze data regarding millennial customers,
but your dataset includes older generations, you might remove those
irrelevant observations. This can make analysis more efficient and
minimize distraction from your primary target—as well as creating a
more manageable and more performant dataset.
Step 2: Fix structural errors
1. Structural errors are when you measure or transfer data and notice strange
naming conventions, typos, or incorrect capitalization.
2. These inconsistencies can cause mislabeled categories or classes.
3. For example, you may find “N/A” and “Not Applicable” both appear, but
they should be analyzed as the same category.
Step 3: Filter unwanted outliers
1. Often, there will be one-off observations where, at a glance, they do not

appear to fit within the data you are analyzing.
2. If you have a legitimate reason to remove an outlier, like improper data-
entry, doing so will help the performance of the data you are working
with.
3. However, sometimes it is the appearance of an outlier that will prove a
theory you are working on.
4. Remember: just because an outlier exists, doesn’t mean it is incorrect.
This step is needed to determine the validity of that number. If an outlier
proves to be irrelevant for analysis or is a mistake, consider removing it.
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not accept missing
values. There are a couple of ways to deal with missing data. Neither is optimal,
but both can be considered.
1. As a first option, you can drop observations that have missing values, but
doing this will drop or lose information, so be mindful of this before you
remove it.
2. As a second option, you can input missing values based on other
observations; again, there is an opportunity to lose integrity of the data
because you may be operating from assumptions and not actual
observations.
3. As a third option, you might alter the way the data is used to effectively
navigate null values.
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these
questions as a part of basic validation:
 Does the data make sense?

 Does the data follow the appropriate rules for its field?
 Does it prove or disprove your working theory, or bring any insight to
light?
 Can you find trends in the data to help you form your next theory?
 If not, is that because of a data quality issue?
False conclusions because of incorrect or “dirty” data can inform poor business
strategy and decision-making. False conclusions can lead to an embarrassing
moment in a reporting meeting when you realize your data doesn’t stand up to
scrutiny. Before you get there, it is important to create a culture of quality data
in your organization. To do this, you should document the tools you might use
to create this culture and what data quality means to you.
Outliers
Outliers are extreme values that differ from most other data points in a dataset.
They can have a big impact on your statistical analyses and skew the results of
any hypothesis tests.
It’s important to carefully identify potential outliers in your dataset and deal
with them in an appropriate manner for accurate results.
There are four ways to identify outliers:
1. Sorting method
2. Data visualization method
3. Statistical tests (z scores)
4. Interquartile range method
What are outliers?

Outliers are values at the extreme ends of a dataset.
Some outliers represent true values from natural variation in the population.
Other outliers may result from incorrect data entry, equipment malfunctions, or
other measurement errors.
An outlier isn’t always a form of dirty or incorrect data, so you have to be

careful with them in data cleansing. What you should do with an outlier depends
on its most likely cause.
True outliers
True outliers should always be retained in your dataset because these just
represent natural variations in your sample.
Example: True outlierYou measure 100-meter running times for a

representative sample of 560 college students. Your data are normally
distributed with a couple of outliers on either end.
Most values are centered around the middle, as expected. But these extreme
values also represent natural variations because a variable like running time is
influenced by many other factors.
True outliers are also present in variables with skewed distributions where many
data points are spread far from the mean in one direction. It’s important to
select appropriate statistical tests or measures when you have
a skewed distribution or many outliers.
Other outliers
Outliers that don’t represent true values can come from many possible sources:
 Measurement errors
 Data entry or processing errors
 Unrepresentative sampling
Example: Other outliersYou repeat your running time measurements for a new
sample.
For one of the participants, you accidentally start the timer midway through
their sprint. You record this timing as their running time.
This data point is a big outlier in your dataset because it’s much lower than all
of the other times.
This type of outlier is problematic because it’s inaccurate and can distort
your research results.
Example: Distortion of results due to outliersYou calculate the average running

time for all participants using your data.
The average is much lower when you include the outlier compared to when you
exclude it. Your standard deviation also increases when you include the outlier,
so your statistical power is lower as well.
In practice, it can be difficult to tell different types of outliers apart. While you
can use calculations and statistical methods to detect outliers, classifying them
as true or false is usually a subjective process.
Four ways of calculating outliers

You can choose from several methods to detect outliers depending on your time
and resources.
 Sorting method
You can sort quantitative variables from low to high and scan for extremely
low or extremely high values. Flag any extreme values that you find.
This is a simple way to check whether you need to investigate certain data
points before using more sophisticated methods.
Example: Sorting methodYour dataset for a pilot experiment consists of 8

values.
18 156 9 176 16 1827 166 171
0 3
You sort the values from low to high and scan for extreme values.
9 156 163 166 171 176 180 1872
 Using visualizations
You can use software to visualize your data with a box plot, or a box-and-
whisker plot, so you can see the data distribution at a glance. This type of chart
highlights minimum and maximum values (the range), the median, and the
interquartile range for your data.
Many computer programs highlight an outlier on a chart with an asterisk, and
these will lie outside the bounds of the graph.
 Statistical outlier detection

Statistical outlier detection involves applying statistical tests or procedures to
identify extreme values.
You can convert extreme data points into z scores that tell you how many
standard deviations away they are from the mean.
If a value has a high enough or low enough z score, it can be considered an

outlier. As a rule of thumb, values with a z score greater than 3 or less than –3
are often determined to be outliers.
 Using the interquartile range

The interquartile range (IQR) tells you the range of the middle half of your
dataset. You can use the IQR to create “fences” around your data and then
define outliers as any values that fall outside those fences.
This method is helpful if you have a few values on the extreme ends of your
dataset, but you aren’t sure whether any of them might count as outliers.
Interquartile range method
1. Sort your data from low to high

2. Identify the first quartile (Q1), the median, and the third quartile (Q3).
3. Calculate your IQR = Q3 – Q1
4. Calculate your upper fence = Q3 + (1.5 * IQR)
5. Calculate your lower fence = Q1 – (1.5 * IQR)
6. Use your fences to highlight any outliers, all values that fall outside your
fences.
Your outliers are any values greater than your upper fence or less than your
lower fence.
Example: Using the interquartile range to find outliers

We’ll walk you through the popular IQR method for identifying outliers using a
step-by-step example.
Your dataset has 11 values. You have a couple of extreme values in your
dataset, so you’ll use the IQR method to check whether they are outliers.
2 37 24 28 35 22 31 53 41 64 29
6
Step 1: Sort your data from low to high

First, you’ll simply sort your data in ascending order.
2 24 26 28 29 31 35 37 41 53 64
2
Step 2: Identify the median, the first quartile (Q1), and the third quartile
(Q3)
The median is the value exactly in the middle of your dataset when all values
are ordered from low to high.
Since you have 11 values, the median is the 6th value. The median value is 31.
2 24 26 28 29 31 35 37 41 53 64
2
Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we
remove the median from our calculations.
The Q1 is the value in the middle of the first half of your dataset, excluding the
median. The first quartile value is 25.
22 24 26 28 29
Your Q3 value is in the middle of the second half of your dataset, excluding the
median. The third quartile value is 41.
35 37 41 53 64
Step 3: Calculate your IQR

The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to
calculate the IQR.
Formula Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Step 4: Calculate your upper fence

The upper fence is the boundary around the third quartile. It tells you that any
values exceeding the upper fence are outliers.
Formula Calculation
Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)
= 41 + 22.5
= 63.5
Step 5: Calculate your lower fence

The lower fence is the boundary around the first quartile. Any values less than
the lower fence are outliers.
Formula Calculation
Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)
= 26 – 22.5
= 3.5
Step 6: Use your fences to highlight any outliers
Go back to your sorted dataset from Step 1 and highlight any values that are
greater than the upper fence or less than your lower fence. These are your
outliers.
 Upper fence = 63.5

 Lower fence = 3.5
2 24 26 28 29 31 35 37 41 53 64
2
You find one outlier, 64, in your dataset.
Dealing with outliers

Once you’ve identified outliers, you’ll decide what to do with them. Your main
options are retaining or removing them from your dataset. This is similar to the
choice you’re faced with when dealing with missing data.
For each outlier, think about whether it’s a true value or an error before
deciding.
 Does the outlier line up with other measurements taken from the same
participant?
 Is this data point completely impossible or can it reasonably come from
your population?
 What’s the most likely source of the outlier? Is it a natural variation or an
error?
In general, you should try to accept outliers as much as possible unless it’s clear
that they represent errors or bad data.
Retain outliers
Just like with missing values, the most conservative option is to keep outliers in
your dataset. Keeping outliers is usually the better option when you’re not sure
if they are errors.
With a large sample, outliers are expected and more likely to occur. But each
outlier has less of an effect on your results when your sample is large enough.
The central tendency and variability of your data won’t be as affected by a
couple of extreme values when you have a large number of values.
If you have a small dataset, you may also want to retain as much data as
possible to make sure you have enough statistical power. If your dataset ends up
containing many outliers, you may need to use a statistical test that’s more
robust to them. Non-parametric statistical tests perform better for these data.
Remove outliers
Outlier removal means deleting extreme values from your dataset before you
perform statistical analyses. You aim to delete any dirty data while retaining
true extreme values.
It’s a tricky procedure because it’s often impossible to tell the two types apart
for sure. Deleting true outliers may lead to a biased dataset and an inaccurate
conclusion.
For this reason, you should only remove outliers if you have legitimate reasons
for doing so. It’s important to document each outlier you remove and your
reasons so that other researchers can follow your procedures.
What is Data Duplication?

Data duplication is a computational technique that removes multiple copies of data
that repeat. If the method is successfully used, storage utilization may be increased,
which might save capital cost because less storage media would be needed overall
to fulfill storage capacity requirements.
Data duplication is a technique that lowers storage overhead by getting rid of

duplicate data. This techniques guarantee that on a storage medium, such disc,
flash, or tape, only one distinct instance of data is kept. A pointer to the unique
data copy is used in place of redundant data blocks. Data duplication and
incremental backup are similar in that they copy just the data that has changed
since the last backup.
How Does Data Duplication Work?

 Inline and post-processing duplication are the two main types of duplication
techniques. They are designed for various backup situations.
 Data in a backup system is analyzed via inline duplication. Redundancies are
detected and eliminated during the process of writing data to the backup store.
When performing high-performance main storage activities, it is advised to
disable data duplication technologies since this might lead to a bottleneck and
need less backup storage.
 After data is written to storage, redundant data is eliminated using post-
processing duplication. A pointer to the data block's first iteration is used to
replace any duplicate data that has been found and eliminated. Users may
rapidly recover the most recent backup and deduplicate certain workloads using
the post-processing method.
 More storage space is needed for post-processing duplication than for inline
duplication.
Use Cases of Data Duplication

 Resolving identities scalable: The capacity to store and retrieve individual
data sets in a compressed manner is essential for entity or identity resolution
over big data collections. The use of duplication can streamline, accelerate, and
enhance entity resolution procedures.
 Virtual Desktop Infrastructure (VDI): Companies may easily supply their
employees computers by using Remote Desktop Services and
other VDI servers. Such technology may be used by an organization for a
number of purposes, such as remote access, consolidation, and application
deployment.
 Marketing using big data: Businesses that conduct extensive data-collection
marketing initiatives stand to gain a great deal from duplication. Big
data marketing is ideal for duplication since it necessitates the archiving and
storage of all acquired data, allowing for the lossless reduction of file and data
sizes.
 Cloud storage backup: For businesses with large volumes of data stored in the
cloud, cloud storage backups may be quite expensive. By reducing the file size
of the data being saved, duplication can result in considerable cost savings.
Advantages of Data Duplication

 Reduced expenses: Businesses may maximize the use of their storage
equipment by allocating storage more effectively. This can save your company
a significant amount of money because you're not paying as much on hardware
updates.
 Improved capacity for backups and storage: Since duplication only stores
unique data, it is feasible to provide more space for backups and drastically
reduce the amount of space required for storage.
 Better data recovery: By eliminating superfluous data from the mix, data
duplication accelerates backup recovery. It helps keep business continuity plans
viable while cutting down on downtime.
 Network optimization: Data duplication optimizes storage locally without
requiring network transmission. This makes accessible the bandwidth needed to
keep the network operating at peak speed, reliability, and performance.
Disadvantages of Data Duplication

 Inaccurate reporting: Proper reporting necessitates precise and duplicate-free
data. This is hampered by duplicate data. Reports produced from redundant data
are less trustworthy and unsuitable for decision-making.
 Lack of Personalization: For every business, tailoring experiences for
individual customers is crucial. You risk losing clients to other businesses if
you don't take action. Duplicate records can undermine your faith in your data,
which will make personalization challenging to use in your company.
 Storage Costs: Depending on the type of data you keep, duplicate records may
need a lot of space, which might raise storage expenses. Imagine you get an
email attachment of one megabyte that was sent by one hundred employees of
your organization. 100 MB of storage space will be needed to hold 100
instances of the attachment.
 Increases Bandwidth Requirements: Large amounts of network bandwidth
are needed when replicating data across several servers. Data transfers between
servers might put a load on your network and can raise operating expenses.
Difference Between Data Duplication and Compression

Data Duplication Compression
Data duplication is a technique that Data Compression is the process of

lowers storage overhead by getting rid encoding, reorganizing, or otherwise altering
of duplicate data. data to make it smaller.
Compression reduces the size of the data file

In Duplication, the data is grouped
by removing extraneous data, whitespace,
according to the shared blocks.
etc.
In Duplication Insignificant data loss

In Compression data loss is minimal
happens.
Duplication rates can be as low as 4:1,

Compression can reduce data size to a ratio
as high as 20:1, and in certain cases,
of 2:1 to 2.5:1.
as high as 200:1
Hash numbers and pointers cause

Fundamental information doesn't change.
significant changes to data.
Normalization vs Standardization
Feature scaling is one of the most important data preprocessing step in machine
learning. Algorithms that compute the distance between the features are biased
towards numerically larger values if the data is not scaled.
Tree-based algorithms are fairly insensitive to the scale of the features. Also,
feature scaling helps machine learning, and deep learning algorithms train and
converge faster.
There are some feature scaling techniques such as Normalization and
Standardization that are the most popular and at the same time, the most
confusing ones.
Normalization or Min-Max Scaling is used to transform features to be on a
similar scale. The new point is calculated as:
X_new = (X - X_min)/(X_max - X_min)
This scales the range to [0, 1] or sometimes [-1, 1]. Geometrically speaking,
transformation squishes the n-dimensional data into an n-dimensional unit
hypercube. Normalization is useful when there are no outliers as it cannot cope
up with them. Usually, we would scale age and not incomes because only a few
people have high incomes but the age is close to uniform.
Standardization or Z-Score Normalization is the transformation of features by
subtracting from mean and dividing by standard deviation. This is often called as
Z-score.
X_new = (X - mean)/Std
Standardization can be helpful in cases where the data follows a Gaussian
distribution. However, this does not have to be necessarily true. Geometrically
speaking, it translates the data to the mean vector of original data to the origin
and squishes or expands the points if std is 1 respectively. We can see that we are
just changing mean and standard deviation to a standard normal distribution
which is still normal thus the shape of the distribution is not affected.
Standardization does not get affected by outliers because there is no predefined
range of transformed features.
Difference between Normalization and Standardization

S.NO. Normalization Standardization
Minimum and maximum value of Mean and standard deviation is used

1.
features are used for scaling for scaling.
It is used when we want to ensure

It is used when features are of
2. zero mean and unit standard
different scales.
deviation.
Scales values between [0, 1] or [-1,

3. It is not bounded to a certain range.
1].
S.NO. Normalization Standardization
4. It is really affected by outliers. It is much less affected by outliers.
Scikit-Learn provides a transformer Scikit-Learn provides a transformer

5. called MinMaxScaler for called StandardScaler for
Normalization. standardization.
This transformation squishes the n- It translates the data to the mean

6. dimensional data into an n- vector of original data to the origin
dimensional unit hypercube. and squishes or expands.
It is useful when we don’t know It is useful when the feature

7.
about the distribution distribution is Normal or Gaussian.
It is a often called as Scaling It is a often called as Z-Score

8.
Normalization Normalization.
DWDV
UNIT-II
Syllabus:
Introduction of visual perception, visual representation of data, Gestalt principles,
information overloads. Creating visual representations, visualization reference
model, visual mapping, visual analytics, Design of visualization applications.
2.1 Introduction of Visual Perception

The role of visual perception
It is said that a picture is worth a thousand words. Why is it that we can understand complex
information on a visual but not from rows of tabular data? The answer to this lies in
understanding visual perception and a little bit about human memory.
What is visual perception?
Visual perception as the ability to interpret the surrounding environment by

processing information that is contained in visible light. The resulting perception is
also known as eyesight, sight, or vision.
How does visual perception affect data visualization?
The main purpose of data visualization is to aid in good decision making. To make good
decisions, we need to be able to understand trends, patterns, and relationships from a visual.
This is also known as drawing insights from data. Now here is the tricky part, we don’t see
images with our eyes; we see them with our brains. The experience of visual perception is in
fact what goes on inside our brains when we see a visual.
There are 3 key points to note:
1. Visual perception is selective. As you can imagine, if we tune our awareness to everything,
we will be very soon overwhelmed. So we selectively pay attention to things that catch our
attention.
2. Our eyes are drawn to familiar patterns. We see what we expect to see. Hence
visualization must take into account what people know and expect.
3. Our working memory is very limited. We will go in depth about memory in a bit, but just
understand that we can hold a very limited amount of information in our memory when
looking at a visual.Data visualization is in many some ways an external aid to support our working
memory.
 Visual perception is the act of seeing a visual or an image. This is handled by visual
cortex located at the rear of the brain. The visual cortex is extremely fast and
efficient.
 Cognition is the act of thinking, of processing information, making comparisons and
examining relationships. This is handled by the cerebral cortex located at the front of
the brain. The cerebral cortex is much slower and less efficient.
 Data visualization shifts the balance between perception and cognition to use our
brain’s capabilities to its advantage. This means more use of visual perception and
lesser use of cognition.
How human memory works and why is this important to visualization?
There are 3 types of memories that process information in our brain-
1. Iconic memory or Sensory memory

2. Working memory
3. Long term memory
The long term memory is where things we memorize or remember are stored. The iconic and
working memories are the ones that interact with visualizations, so let’s look at them in
depth.
Iconic memory or Sensory memory:
 When we see a visual, the information remains in the iconic memory for a tiny period
of time, less than a second.
 We process and store information automatically in this fraction of a second. This
process is called preattentive processing and it happens automatically, even before
we pay attention to the information.
 The preattentive process detects several visual attributes. Hence understanding how to
make a particular attribute stand out can help us create visuals that emphasize on the
more important information.
Working memory or Short term memory:
 This is the memory we use when we are actually working with a visual. The sensory
information that is of interest to us is processed in the working memory.
 Information stays here for about a minute and the capacity of our working memory is
between 5 to 9 similar items (Miller’s Law).
 The capacity of our working memory can be increased by a process called Chunking,
which is grouping similar items together.
 Data visualizations take advantage of chunking. When information is displayed in
the form of visuals that show meaningful patterns, more information can be chunked
together.
 Hence, when we look at a visual, we can process a great deal more information than
what we can when looking at the data in the form of a table.
2.2 Visual Representation of Data.
Data visualization is the representation of data through use of common graphics, such as
charts, plots, infographics and even animations. These visual displays of information
communicate complex data relationships and data-driven insights in a way that is easy to
understand.
What is Data Visualization?
 Data visualization translates complex data sets into visual formats that are easier for
the human brain to comprehend.
 The primary goal of data visualization is to make data more accessible and easier
to interpret, allowing users to identify patterns, trends, and outliers quickly.
 This is particularly important in the context of big data, where the sheer volume of
information can be overwhelming without effective visualization techniques.
Types of Data for Visualization

Performing accurate visualization of data is very critical to market research where both
numerical and categorical data can be visualized, which helps increase the impact of insights
and also helps in reducing the risk of analysis paralysis. So, data visualization is categorized
into the following categories:
 Numerical Data
 Categorical Data
Let’s understand the visualization of data via a diagram with its all categories.
Why is Data Visualization Important?

1. Let’s take an example. Suppose you compile visualization data of the company’s
profits from 2013 to 2023 and create a line chart.
2. It would be very easy to see the line going constantly up with a drop in just 2018. So
you can observe in a second that the company has had continuous profits in all the
years except a loss in 2018.
3. It would not be that easy to get this information so fast from a data table. This is just
one demonstration of the usefulness of data visualization.
Let’s see some more reasons why visualization of data is so important.
1. Data Visualization Discovers the Trends in Data

1. The most important thing that data visualization does is discover the trends in data.
After all, it is much easier to observe data trends when all the data is laid out in front
of you in a visual form as compared to data in a table.
2. For example, the screenshot below on visualization on Tableau demonstrates the sum
of sales made by each customer in descending order.
3. However, the color red denotes loss while grey denotes profits.
4. So it is very easy to observe from this visualization that even though some customers
may have huge sales, they are still at a loss.
5. This would be very difficult to observe from a table.
2. Data Visualization Provides a Perspective on the Data
1. Visualizing Data provides a perspective on data by showing its meaning in the larger
scheme of things.
2. It demonstrates how particular data references stand concerning the overall data
picture.
3. In the data visualization below, the data between sales and profit provides a data
perspective concerning these two measures.
4. It also demonstrates that there are very few sales above 12K and higher sales do not
necessarily mean a higher profit.
3. Data Visualization Puts the Data into the Correct Context
1. It isn’t easy to understand the context of the data with data visualization. Since
context provides the whole circumstances of the data, it is very difficult to grasp by
just reading numbers in a table.
2. In the below data visualization on Tableau, a TreeMap is used to demonstrate the
number of sales in each region of the United States.
3. It is very easy to understand from this data visualization that California has the largest
number of sales out of the total number since the rectangle for California is the
largest.
4. But this information is not easy to understand outside of context without visualizing
data.
4. Data Visualization Saves Time

1. It is definitely faster to gather some insights from the data using data visualization
rather than just studying a chart.
2. In the screenshot below on Tableau, it is very easy to identify the states that have
suffered a net loss rather than a profit.
3. This is because all the cells with a loss are coloured red using a heat map, so it is
obvious states have suffered a loss.
4. Compare this to a normal table where you would need to check each cell to see if it
has a negative value to determine a loss.
5. Visualizing Data can save a lot of time in this situation!
5. Data Visualization Tells a Data Story

1. Data visualization is also a medium to tell a data story to the viewers. The
visualization can be used to present the data facts in an easy-to-understand form while
telling a story and leading the viewers to an inevitable conclusion.
2. This data story, like any other type of story, should have a good beginning, a basic
plot, and an ending that it is leading towards.
3. For example, if a data analyst has to craft a data visualization for company executives
detailing the profits of various products, then the data story can start with the profits
and losses of multiple products and move on to recommendations on how to tackle the
losses.
4. To find out more points please refer to this article: Why is Data Visualization so
Important?
5. Now, that we have understood the basics of Data Visualization, along with its
importance, now will be discussing the Advantages, Disadvantages and Data Science
Pipeline (along with the diagram) which will help you to understand how data is
compiled through various checkpoints.
Benefits of data visualization
Data visualization can be used in many contexts in nearly every field, like public policy,
finance, marketing, retail, education, sports, history, and more. Here are the benefits of data
visualization:
 Storytelling: People are drawn to colours and patterns in clothing, arts and culture,
architecture, and more. Data is no different—colours and patterns allow us to visualize the
story within the data.
 Accessibility: Information is shared in an accessible, easy-to-understand manner for a variety
of audiences.
 Visualize relationships: It’s easier to spot the relationships and patterns within a data set when
the information is presented in a graph or chart.
 Exploration: More accessible data means more opportunities to explore, collaborate, and
inform actionable decisions.
Tools for visualizing data
There are plenty of data visualization tools out there to suit your needs. Before committing to
one, consider researching whether you need an open-source site or could simply create a graph
using Excel or Google Charts. The following are common data visualization tools that could
suit your needs.
 Tableau
 Google Charts
 Dundas BI
 Power BI
 JupyteR
 Infogram
 ChartBlocks
 D3.js
 FusionCharts
 Grafana
Types of data visualization
Visualizing data can be as simple as a bar graph or scatter plot but becomes powerful when
analysing, for example, the median age of the United States Congress vis-a-vis the median age
of Americans. Here are some common types of data visualizations:
 Table: A table is data displayed in rows and columns, which can be easily created in a Word
document or Excel spreadsheet.
 Chart or graph: Information is presented in tabular form with data displayed along an x and y
axis, usually with bars, points, or lines, to represent data in comparison. An infographic is a
special type of chart that combines visuals and words to illustrate the data.
o Gantt chart: A Gantt chart is a bar chart that portrays a timeline and tasks specifically used in
project management.
o Pie chart: A pie chart divides data into percentages featured in “slices” of a pie, all adding up
to 100%.
 Geospatial visualization: Data is depicted in map form with shapes and colors that illustrate
the relationship between specific locations, such as a choropleth or heat map.
 Dashboard: Data and visualizations are displayed, usually for business purposes, to help
analysts understand and present data.
Data visualization examples

Using data visualization tools, different types of charts and graphs can be created to illustrate
important data. These are a few examples of data visualization in the real world:
 Data science: Data scientists and researchers have access to libraries using programming
languages or tools such as Python or R, which they use to understand and identify patterns in
data sets. Tools help these data professionals work more efficiently by coding research with
colors, plots, lines, and shapes.
 Marketing: Tracking data such as web traffic and social media analytics can help marketers
analyze how customers find their products and whether they are early adopters or more of a
laggard buyer. Charts and graphs can synthesize data for marketers and stakeholders to better
understand these trends.
 Finance: Investors and advisors focused on buying and selling stocks, bonds, dividends, and
other commodities will analyze the movement of prices over time to determine which are worth
purchasing for short- or long-term periods. Line graphs help financial analysts visualize this
data, toggling between months, years, and even decades.
 Health policy: Policymakers can use choropleth maps, which are divided by geographical area
(nations, states, continents) by colors. They can, for example, use these maps to demonstrate
the mortality rates of cancer or ebola in different parts of the world.
Types of Data Visualization Techniques

Various types of visualizations cater to diverse data sets and analytical goals.
1. Bar Charts: Ideal for comparing categorical data or displaying frequencies, bar charts
offer a clear visual representation of values.
2. Line Charts: Perfect for illustrating trends over time, line charts connect data points to
reveal patterns and fluctuations.
3. Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to
understand proportions and percentages.
4. Scatter Plots: Showcase relationships between two variables, identifying patterns and
outliers through scattered data points.
5. Histograms: Depict the distribution of a continuous variable, providing insights into the
underlying data patterns.
6. Heatmaps: Visualize complex data sets through color-coding, emphasizing variations
and correlations in a matrix.
7. Box Plots: Unveil statistical summaries such as median, quartiles, and outliers, aiding in
data distribution analysis.
8. Area Charts: Similar to line charts but with the area under the line filled, these charts
accentuate cumulative data patterns.
9. Bubble Charts: Enhance scatter plots by introducing a third dimension through varying
bubble sizes, revealing additional insights.
10. Treemaps: Efficiently represent hierarchical data structures, breaking down categories
into nested rectangles.
11. Violin Plots: Violin plots combine aspects of box plots and kernel density plots,
providing a detailed representation of the distribution of data.
12. Word Clouds: Word clouds are visual representations of text data where words are sized
based on their frequency.
13. 3D Surface Plots: 3D surface plots visualize three-dimensional data, illustrating how a
response variable changes in relation to two predictor variables.
14. Network Graphs: Network graphs represent relationships between entities using nodes
and edges. They are useful for visualizing connections in complex systems, such as social
networks, transportation networks, or organizational structures.
15. Sankey Diagrams: Sankey diagrams visualize flow and quantity relationships between
multiple entities. Often used in process engineering or energy flow analysis.
Visualization of data not only simplifies complex information but also enhances decision-
making processes. Choosing the right type of visualization helps to unveil hidden patterns
and trends within the data, making informed and impactful conclusions.
Advantages and Disadvantages of Data Visualization

Advantages of Data Visualization:
 Enhanced Comparison: Visualizing performances of two elements or scenarios
streamlines analysis, saving time compared to traditional data examination.
 Improved Methodology: Representing data graphically offers a superior understanding
of situations, exemplified by tools like Google Trends illustrating industry trends in
graphical forms.
 Efficient Data Sharing: Visual data presentation facilitates effective communication,
making information more digestible and engaging compared to sharing raw data.
 Sales Analysis: Data visualization aids sales professionals in comprehending product
sales trends, identifying influencing factors through tools like heat maps, and
understanding customer types, geography impacts, and repeat customer behaviours.
 Identifying Event Relations: Discovering correlations between events helps businesses
understand external factors affecting their performance, such as online sales surges during
festive seasons.
 Exploring Opportunities and Trends: Data visualization empowers business leaders to
uncover patterns and opportunities within vast datasets, enabling a deeper understanding
of customer behaviours and insights into emerging business trends.
Disadvantages of Data Visualization:

 Can be time-consuming: Creating visualizations can be a time-consuming process,
especially when dealing with large and complex datasets.
 Can be misleading: While data visualization can help identify patterns and relationships
in data, it can also be misleading if not done correctly. Visualizations can create the
impression of patterns or trends that may not exist, leading to incorrect conclusions and
poor decision-making.
 Can be difficult to interpret: Some types of visualizations, such as those that involve
3D or interactive elements, can be difficult to interpret and understand.
 May not be suitable for all types of data: Certain types of data, such as text or audio
data, may not lend themselves well to visualization. In these cases, alternative methods of
analysis may be more appropriate.
 May not be accessible to all users: Some users may have visual impairments or other
disabilities that make it difficult or impossible for them to interpret visualizations. In
these cases, alternative methods of presenting data may be necessary to ensure
accessibility.
Best Practices for Visualization Data

Effective data visualization is crucial for conveying insights accurately. Follow these best
practices to create compelling and understandable visualizations:
1. Audience-Centric Approach: Tailor visualizations to your audience’s knowledge level,
ensuring clarity and relevance. Consider their familiarity with data interpretation and
adjust the complexity of visual elements accordingly.
2. Design Clarity and Consistency: Choose appropriate chart types, simplify visual
elements, and maintain a consistent color scheme and legible fonts. This ensures a clear,
cohesive, and easily interpretable visualization.
3. Contextual Communication: Provide context through clear labels, titles, annotations,
and acknowledgments of data sources. This helps viewers understand the significance of
the information presented and builds transparency and credibility.
4. Engaging and Accessible Design: Design interactive features thoughtfully, ensuring they
enhance comprehension. Additionally, prioritize accessibility by testing visualizations for
responsiveness and accommodating various audience needs, fostering an inclusive and
engaging experience.
Use-Cases and Applications of Data Visualization
1. Business Intelligence and Reporting

In the realm of Business Intelligence and Reporting, organizations leverage sophisticated
tools to enhance decision-making processes. This involves the implementation of
comprehensive dashboards designed for tracking key performance indicators (KPIs) and
essential business metrics. Additionally, businesses engage in thorough trend analysis to
discern patterns and anomalies within sales, revenue, and other critical datasets. These visual
insights play a pivotal role in facilitating strategic decision-making, empowering stakeholders
to respond promptly to market dynamics.
2. Financial Analysis
Financial Analysis in the corporate landscape involves the utilization of visual
representations to aid in investment decision-making. Visualizing stock prices and market
trends provides valuable insights for investors. Furthermore, organizations conduct
comparative analyses of budgeted versus actual expenditures, gaining a comprehensive
understanding of financial performance. Visualizations of cash flow and financial statements
contribute to a clearer assessment of overall financial health, aiding in the formulation of
robust financial strategies.
3. Healthcare
Within the Healthcare sector, the adoption of visualizations is instrumental in conveying
complex information. Visual representations are employed to communicate patient outcomes
and assess treatment efficacy, fostering a more accessible understanding for healthcare
professionals and stakeholders. Moreover, visual depictions of disease spread and
epidemiological data are critical in supporting public health efforts. Through visual analytics,
healthcare organizations achieve efficient allocation and utilization of resources, ensuring
optimal delivery of healthcare services.
4. Marketing and Sales
In the domain of Marketing and Sales, data visualization becomes a powerful tool for
understanding customer behavior. Segmentation and behavior analysis are facilitated through
visually intuitive charts, providing insights that inform targeted marketing strategies.
Conversion funnel visualizations offer a comprehensive view of the customer journey,
enabling organizations to optimize their sales processes. Visual analytics of social media
engagement and campaign performance further enhance marketing strategies, allowing for
more effective and targeted outreach.
5. Human Resources
Human Resources departments leverage data visualization to streamline processes and
enhance workforce management. The development of employee performance dashboards
facilitates efficient HR operations. Workforce demographics and diversity metrics are
visually represented, supporting inclusive practices within organizations. Additionally,
analytics for recruitment and retention strategies are enhanced through visual insights,
contributing to more effective talent management.
Data Visualization in Big Data

In the contemporary landscape of information management, the synergy between data
visualization and big data has become increasingly crucial for organizations seeking
actionable insights from vast and complex datasets. Data visualization, through graphical
representation techniques such as charts, graphs, and heatmaps, plays a pivotal role in
distilling intricate patterns and trends inherent in massive datasets.
 It acts as a transformative bridge between raw data and meaningful insights, enabling
stakeholders to comprehend complex relationships and make informed decisions.
 In tandem, big data, characterized by the exponential growth and diversity of information,
provides the substantive foundation for these visualizations.
As organizations grapple with the challenges and opportunities presented by the sheer
volume, velocity, and variety of data, the integration of data visualization becomes an
indispensable strategy for extracting value and fostering a deeper understanding of complex
information. The marriage of data visualization and big data not only enhances
interpretability but also empowers decision-makers to derive actionable intelligence from the
vast reservoirs of information available in today’s data-driven landscape.
2.3 Gestalt Principles
Gestalt principles for perpetual grouping and figure-ground segregation. From ‘Gestalt
Principles for Attention and Segmentation in Natural and Artificial Vision Systems’ by G.
Kootstra, N. Bergstrom, D. Kragic (2011).
What Are the Gestalt Principles?
Developed by German psychologists, the Gestalt principles—also known as the Gestalt laws
of perceptual organization—describe how we interpret the complex world around us. They
explain why a series of flashing lights appear to be moving, for instance, and why we can
read this sentence: notli ket his ort hat.
The six Gestalt principles or laws are:
1. Law of similarity
2. Law of pragnanz
3. Law of proximity
4. Law of continuity
5. Law of closure
6. Law of common region
History of the Gestalt Principles

Have you noticed how alternately flashing lights, such as neon signs or strands of lights, can
look like a single light that is moving back and forth? This optical illusion is known as the phi
phenomenon. Discovered by German psychologist Max Wertheimer, this illusion of
movement became a basis for Gestalt psychology.1
Gestalt psychology focuses on how our minds organize and interpret visual data. 2 It
emphasizes that the whole of anything is greater than its parts.
Based upon this belief, Wertheimer along with Gestalt psychologists Wolfgang Köhler and
Kurt Koffka, developed a set of rules to explain how we group smaller objects to form larger
ones (perceptual organization). They called these rules the Gestalt laws of perceptual
organization.
Law of Similarity
The law of similarity states that similar things tend to appear grouped together. Grouping can
occur in both auditory and visual stimuli.
In the image at the top of this page, for example, you probably see two separate groupings of
colored circles as rows rather than just a collection of dots.
Law of Prägnanz
The law of prägnanz is sometimes called the law of simplicity. This law holds that when
you're presented with a set of ambiguous or complex objects, your brain will make them
appear as simple as possible.3
An example of this can be experienced with the Olympic logo. When you look at the logo,
you see overlapping circles rather than an assortment of curved, connected lines.
Law of Proximity
According to the law of proximity, things that are close together seem more related than
things that are spaced farther apart. 4 Put another way, when objects are close to each other,
we also tend to group them together.
To see this Gestalt principle in action, look at the image at the top of the page. The circles on
the left appear to be part of one grouping while those on the right appear to be part of another.
This is due to the law of proximity.
Law of Continuity
The law of continuity holds that points that are connected by straight or curving lines are seen
in a way that follows the smoothest path. In other words, elements in a line or curve seem
more related to one another than those positioned randomly.
Law of Closure
According to the law of closure, we perceive elements as belonging to the same group if they
seem to complete some entity. Our brains often ignore contradictory information and fill in
gaps in information.
In the image at the top of the page, you probably see the shape of a diamond. This is because,
according to this Gestalt principle, your brain fills in the missing gaps in order to create a
meaningful image.
Law of Common Region
The Gestalt law of common region says that when elements are located in the same closed
region, we perceive them as belonging to the same group. What does this mean?
Look at the last image at the top of the page. The circles are right next to each other so that
the dot at the end of one circle is actually closer to the dot at the end of the neighbouring
circle. Despite how close those two dots are, we see the dots inside the circles as belonging
together.
Takeaways
The Gestalt principles help us understand some of the ways in which perception works.
Research continues to offer insights into our perception and how we see the world. These
principles play a role in perception, but it is also important to remember that they can
sometimes lead to incorrect perceptions.
It is also important to recognize that while these principles are referred to as laws of
perceptual organization, they are actually heuristics or shortcuts. Heuristics are usually
designed for speed, which is why our perceptual systems sometimes make mistakes and we
experience perceptual inaccuracies.
2.4 Information Overloads

What Is Information Overload?
Information overload, also known as the information explosion, is when you attempt to take
in too much information at once. It occurs while multitasking, conversing or using the
internet. Learning more about such overload and how to prevent it can help you improve the
way you gather and process data and information.
1. Information overload refers to the feeling of having too much information to process
or pay attention to within a short time span.
2. The development of modern information technology has been a major contributor to
such overload on several fronts, including the volume of content available, the ease of
disseminating it and the size of the audience it has reached.
3. Prioritising information and exercising greater discretion can prevent such overload.
For instance, when you receive an email, you can usually determine from the subject
line if you require to respond right away or if you can review it later.
What Causes It?
The different causes behind information explosion are:
More data than is manageable
Any topic you search for online returns millions of results. You rapidly experience an
overload of information when you combine this with the countless regularly published books
and hundreds of e-books that are available for purchase. It can be tough to process all of this
information in a finite time.
Inundation with unwanted information
Social media and email are two major sources of a lot of unsolicited information. Users
continually receive advertisements and spam emails on a regular basis. There are social
media notifications for news feeds that users may not even follow and email groups that are
no longer pertinent to their business. All of these result in an increase in the amount of
unsolicited information they process to obtain the information they want.
Increased information speed
Besides consuming too much information, it is also becoming difficult to follow the speed of
information flow. Before you can review a piece of information, a new update replaces it. A
lot of times, you may not even require reading through all the materials. It is important to set
limits on how much information you are going to consume in a day.
Reduced information value
The idea that information was valuable formerly served as the foundation for the information
age. The abundance of information that is currently available has affected the perceived
worth of this information. This may be true for all kinds of information since we may not
always be able to determine what is crucial, what is merely redundant and what is useless.
Why Avoid It?

Here are some reasons why it is important to avoid overloading yourself with information:
Boost productivity
Trying to do too many things at once can hinder your capacity to work quickly and
consistently generate high-calibre work. Consider preparing a to-do list of important tasks for
the day before you start working. Try to keep the list unchanged as the day passes. At the end
of the day, you might have completed all the tasks on your list, which may make you feel
satisfied with your productivity.
Gain mental clarity and focus
Too much information can be difficult to process. It may also lead to more distractions since
you may not be clear about your priorities. When you eliminate distractions, lower your stress
level and permit yourself to focus, you can obtain mental clarity.
Improve decision making
Trying to consider too many points at one time can slow down decision-making and possibly
prevent you from finding the best solution. Work on related projects at regular intervals after
you determine your top priorities and reduce the strain of excess information. You are less
likely to become upset or stuck making decisions because of an excessive amount of
information if you schedule similar chores next to one another in 30-minute intervals.
Communicate clearly
When you have too many priorities, it can affect your ability to communicate efficiently.
Instead of focusing on all things at the same time, consider performing one task at a time.
This can help you assimilate information at a suitable pace. Finally, getting rid of excess
information can help you focus and feel confident about your duties.
10 Tips for Avoiding Overload
Here are ten tips you can try to avoid information explosion and regain your focus:
1. Modify your information consumption
Though there are some types of information this does not apply to, like emails from your
team leader or notices from your doctor, there are other times you can choose not to consume
information. This includes the time you spend on social media sites, checking news outlets or
reading articles. There may be instances where you can reduce your overall consumption of
information or the types of information you allow yourself to process.
2. Set an information limit for yourself
Setting an information limit means being intentional about the knowledge you gain and
understanding why you do it. Consider tracking details about the information you take in
each week and record how it makes you feel. This can help you understand what has been
contributing to any feelings of information explosion.
3. Consume information with a plan
Understanding what you want to know before you search for data can reduce the data load.
This can help you avoid getting side-tracked and spending too much time reading through
irrelevant information. You can prevent constant scrolling through social media feed or going
through unnecessary things on the internet by adopting this practice.
4. Schedule a time to curate sources
Rather than engaging with each source of knowledge as you see it, consider first making an
itemised list of the sources you intend to visit and consume. Ensure you are not
referring to anything else beyond the list. This can help you avoid stopping on too many
sources that not only take your time but may distract you from the information you require
for processing and understanding.
5. Settle on sources before searching
In your search, start to identify a few reliable sources for each type of information you are
likely to consume on a regular basis. Allow yourself two sources per type of information you
seek. These may be sources that consistently provide reliable information, so you can
eliminate time spent searching and auditing sources and overloading yourself with
unnecessary information.
6. Skim articles
Skimming an article means quickly scanning through it to identify key points and returning to
those for extra details. Many sources organise themselves into headers and bullet points.
Skimming articles saves time and allows you to only consume the information that is most
relevant to you.
7. Subscribe to summary emails or newsletters
Summary emails and newsletters convey a lot of information within a few words. There are
many sources that provide summary emails or newsletters that can save you time. For
example, you can subscribe to a news email that quickly briefs you on recent developments,
rather than reading through several news articles on a topic.
8. Turn off your phone and web notifications
Many phone applications and websites can send notifications directly to your phone's home
screen or overlay them on your workspace. They can also display notifications with a ping.
You might consider disabling these notifications during off-hours from work. For non-work-
related notifications, you can turn them off completely.
9. Use filters and blockers while browsing
You may remove different kinds of information from your computer and browser by
installing a variety of filters and blockers. You can use filters to avoid going to social media
sites and ban particular websites or languages to encourage more concentrated work.
Advertisements have the potential to disrupt your workflow and break up your concentration,
which blockers may be able to stop.
10. Take a walk outside or practice meditation
If you find it challenging to concentrate or perhaps feel you are being exposed to too much
information, think about finding techniques to relax your mind. You can regain focus by
going for a walk outside without your phone or while listening to music. To focus and regain
a sound mental state, you can also consider engaging in guided or unguided meditation.
2.5 Creating Visual Representations

Creating visual representations of data is essential for communicating insights effectively.
Here are key steps and techniques to help you create impactful visualizations:
Steps to Create Visual Representations
1. Define Your Objective:

o Identify the purpose of the visualization. What question are you trying to
answer? What insights do you want to convey?
2. Know Your Audience:
o Understand who will be viewing the visualization. Tailor the complexity and
style based on their expertise and needs.
3. Choose the Right Type of Visualization:
o Bar Charts: Great for comparing categorical data.
o Line Graphs: Ideal for showing trends over time.
o Scatter Plots: Useful for exploring relationships between two continuous
variables.
o Heat Maps: Effective for displaying data density or intensity across two
dimensions.
o Pie Charts: Best for showing proportions (use sparingly).
4. Gather and Prepare Data:
o Ensure your data is clean and structured. Remove duplicates, handle missing
values, and ensure consistency.
5. Select Tools:
o Use visualization tools that suit your needs. Common options include:
 Tableau: For interactive dashboards.
 Power BI: For business analytics.
 Python (Matplotlib, Seaborn): For customizable visualizations.
 R (ggplot2): For statistical graphics.
 Excel: For quick and easy charts.
6. Design the Visualization:
o Use Color Wisely: Employ color to highlight key data points, but avoid
overwhelming the viewer.
o Label Clearly: Ensure axes, titles, and legends are clear and descriptive.
o Maintain Simplicity: Avoid unnecessary elements that may distract from the
main message.
o Incorporate Interactivity: If applicable, allow users to interact with the data
(e.g., filters, tooltips).
7. Review and Iterate:
o Get feedback on your visualization. Make adjustments based on audience
reactions and clarity.
8. Tell a Story:
o Frame your visualizations within a narrative to guide viewers through your
findings. Use annotations to highlight key insights.
2.6. A visualization reference model
A visualization reference model is a model that represents a wide range of data in a cohesive
way, and is used for information visualization. It's a software architecture pattern that models
the process of visualization as a series of steps. The model includes: Collecting source data,
transforming data to appropriate formats, Mapping data to visual representations, and
Supporting view transformation through user interactions.
The result of the process is an interactive visualization that helps users complete tasks or gain
insights from their data.
Ed Chi developed the information visualization reference model in 1999, originally calling it
the data state model.
Reference Model for Visualization. Visualization can be described as a mapping of data

to visual form that supports human interaction for making visual sense. Raw Data:
idiosyncratic formats Data Tables: relations (cases by variables) + meta data Visual
Structures: spatial substrates + marks + graphical properties Views: graphical
parameters (position, scaling, clipping, etc.)
A visualization reference model helps structure the process of creating visual representations
of data. It typically outlines the key components and steps involved in effective data
visualization. Here’s a breakdown of a common visualization reference model:
Visualization Reference Model description
1. Data Understanding
o Data Collection: Gather data from various sources (databases, APIs, surveys).
o Data Exploration: Analyze data characteristics, distributions, and patterns to
identify insights.
2. Data Preparation
o Cleaning: Remove duplicates, handle missing values, and correct
inconsistencies.
o Transformation: Normalize data, aggregate values, or derive new metrics as
needed.
o Integration: Combine data from different sources to create a comprehensive
dataset.
3. Choosing Visualization Types
o Purpose-Based Selection: Determine the type of visualization based on the
data and the questions being addressed.
 Comparative: Bar charts, column charts.
 Trends: Line charts, area charts.
 Distribution: Histograms, box plots.
 Relationship: Scatter plots, bubble charts.
 Composition: Pie charts, stacked bar charts.
4. Design Principles
o Clarity: Ensure that visualizations convey information clearly and intuitively.
o Simplicity: Avoid unnecessary complexity; focus on key messages.
o Color Theory: Use color strategically to enhance understanding and maintain
accessibility.
o Typography: Use readable fonts and appropriate sizes for clarity.
5. Interactivity
o Dynamic Elements: Incorporate features like tooltips, filters, and zooming to
allow users to explore data.
o Responsive Design: Ensure visualizations are adaptable to different screen
sizes and devices.
6. Storytelling with Data
Contextualization: Frame visualizations within a narrative to guide the
o
audience through insights.
o Annotations: Use text and markers to highlight key points or trends in the
data.
7. Feedback and Iteration
o User Testing: Gather feedback from users to identify areas for improvement.
o Iteration: Refine visualizations based on feedback and changing data needs.
8. Deployment and Sharing
o Publishing: Share visualizations via dashboards, reports, or web applications.
o Collaboration: Enable collaboration and discussion around the visualized
data.
2.7. Visual Mapping
Mapping Data Properties to Visual Properties
Creating a data visualization involves mapping variables in a dataset onto visual elements in
the data visualization. The structural similarities between the dataset and the visual elements
let us ‘look through’ the visualization to perceive the data structure.
Mapping Data Onto Visual Objects
1. This takes care of the visual object by itself. But how does a mapping work between
the visual properties of this object and the values and relationships – i.e. the structure
– of a dataset?
2. First, notice that many of these visual properties are continuous and, as such, can be
used to represent continuous variables.
3. For example, a single data value of a continuous variable might be mapped on to the
angle of a single short line, or onto the horizontal position of a point.
4. Values of categorical variables can be represented by discrete visual variables – for
example, each month might be represented by a shape with a different number of
sides
5. If we are dealing with fewer data variables than visual variables, we may assign
constant values to those visual variables that are not used (e.g. make all shapes the
same colour if colour is not being used to represent a variable)..
Proximity
1. The proximity variable deserves some further consideration, with respect to how it
represents relationships between two or more data variables, as well as relationships
in the dataset more generally.
2. In such a case, we naturally assume that there is a relationship between the values
because they are combined in the same object.
3. However, in the case of the proximity variable, relationships can represented more
explicitly, either as their own objects (e.g. lines) or as a spatial relationship between
objects (e.g. distance).
4. The proximity property is unique in that it has, effectively, a discrete, mode and a
continuous mode.
5. In the case of categorical variables, or any variables with discrete values, we can
represent the relationship between the variable values as its own visual object, or set
of objects, with their own specified properties.
6. For example, we might use a set of lines to connect a series of point objects,
indicating that all of these objects are related to a particular value of a categorical
variable.
7. And we might then connect the values in another category using lines with different
visual properties. This is a commonly used strategy when creating line graphs.
8. Alternatively, we might place all of the visual objects connected to a particular value
of a categorical variable in their own square, using different squares for each value of
the variable.
Tables
1. A table is another interesting example of using spatial relationships to indicate data

relationships.
2. Although we might not typically think of it in this light, a table is at least partially a
visualization, because it uses the position of the variable values in the table to indicate
which values are connected to a particular data point (by virtue of the values sharing
the same row), and also to a particular variable (by virtue of the values sharing the
same column).
3. These spatial relationships – and by extension data relationships – are typically made
even more obvious by adding framing grid lines or colors to the table visualization.
4. Here, we can think of the specific data point labels and the specific variable labels
themselves as values of two more fundamental categorical variables – data point and
variable – with every individual value in a dataset describable as being in
relationships with the other dataset values based on the value of these variables.
5. Additional relationships between variables are derived from these more fundamental
relationships.
Unconventional Mappings
Data visualization practice has a number of traditional ‘go to’ strategies for setting up the
mappings between data and visual variables, as well as the framing for these mappings. The
result is the common stable of workhorse visualizations we are so familiar with.
For example, when presented with two continuous variables, the default strategy is to
visualize the relationship by mapping these onto the horizontal and vertical positions on the
page, choosing points as the shape, and framing these visual objects using two axes lines,
labelled with information showing the relationship between the horizontal and vertical space
and the values of each variable.
However, there are many other options for representing two continuous variables.
For example, using the data show in Figure 1, one numeric variable could be represented by
the diameter of a circle, and the other numeric variable could be represented by the shade of
the circle.
To map the variables we carry out a transformation of the data variable values, mapping them
on to the visual variable values. The resulting shapes are framed in a grid. This visualization,
shown in Figure 2, is quite distinct from the traditional scatterplot, but represents the same
information.
Figure 3 shows some variations created by visualizing a smaller portion of the dataset (from
Figure 2, the first four rows and columns of the grid). The top left shows the original
representation from Figure 2. On the top right, four, rather than two, visual variables have
been recruited – circle diameter, colour hue, colour saturation and colour lightness. Here,
circle diameter and colour saturation both represent the values of the first variable, and colour
hue and lightness both represent the values of the second variable. The bottom two
visualizations show only the values of the first and second variables respectively, with other
visual variables held constant.
By breaking down the nature of the mappings between dataset variables on the one hand, and
visual variables on the other, I hope to encourage experimentation with the way that these
variables are mapped and combined. This framework should also make it easier to automate
the generation of visualizations, and allow for the generation of novel visualizations by, for
example, randomly generating mappings between these two types of variables. However, that
will be a topic for another blog article.
2.8. What is Visual Analytics:

What is Visual Analytics:
Definition : Visual Analytics, according to Thomas, J., Cook, K in his essay

titled Illuminating the Path: Research and Development Agenda for Visual Analytics (2005)
“is the science of analytical reasoning supported by interactive visual interfaces.”
1. Visual Analytics is like using pictures to understand lots of data.

2. It’s like drawing maps, charts, or graphs from this data to spot patterns, trends, or
weird things.
3. It helps people make decisions based on what they can see and understand easily.
4. In simple terms, Visual Analytics turns numbers into pictures that tell a story, making
it easier for people to understand and use the information they have.
5. Visual Analytics combines computer science, statistics, and art to turn large amounts
of complex data into understandable, interactive visuals like charts, maps, and graphs.
Purpose:
1. To put it in simpler terms, Visual analytic may be explained as a kind of inquiry in

which data that provides insight into solving a problem is displayed in an interactive,
graphical manner.
2. Visual Analytics can be perceived as an integrated approach that combines
visualization, human factors, and data analysis.
3. Visual Analytics in the context of visualization relates to the areas of Information
Visualization and Computer Graphics, and with respect to data analysis, it benefits
largely from methodologies of information retrieval, data management & knowledge
representation as well as data mining.
4. A Visual Analytic system often uses a specific software dashboard to present
analytics results visually.
For example, the dashboard screens might have different types of engines involving visual
graphs, pie charts or infographics tools, where, after computational algorithms work, the
results are displayed on the screen. The Visual Analytics interface makes it easy for a human
user to understand the results, and also make changes simultaneously that further directs the
computer’s algorithmic process.
How Visual Analytics Works
Visual Analytics typically involves four stages:

1. Data Preparation: Data is collected, cleaned, and prepared for analysis.
2. Visual Exploration: Data is visualized using various techniques such as scatter plots,
histograms, and heat maps.
3. Data Analysis: Data is analyzed using machine learning and other analytics tools to
identify patterns and trends.
4. Interpretation: Insights are interpreted and communicated to stakeholders through
interactive dashboards, reports, and presentations.
Visual Analytics is not data visualization. Visual Analytics utilizes Data visualization.
Visual Analytics uses machine learning and other tools to automatically sort through these
datasets and find patterns or trends. But it also relies on human judgment, as people can use
the visuals to explore the data for themselves, asking their own questions and looking for
their own answers.
With Visual Analytics, you can find patterns that you might not see with traditional analysis
because it’s easier to spot these patterns when they’re represented visually. It can help with
anything from tracking sales to predicting the weather, making it a powerful tool for
decision-making.
Benefits of Visual Analytics
1. Faster Insights: By using interactive visualizations, you can quickly identify

patterns, trends, and outliers in data, leading to faster insights.
2. Improved Decision Making: Visual Analytics can help users make better decisions
by providing more accurate and relevant information.
3. Increased Efficiency: By streamlining the data analysis process, Visual Analytics
can save time and resources while improving the quality of insights.
4. Enhanced Collaboration: Visual Analytics enables teams to work together and
share insights more effectively, leading to better outcomes.
Tools and Techniques for Visual Analytics
There are several tools and techniques that can be used for Visual Analytics, including:
1. Tableau: A popular data visualization tool that allows users to create interactive
dashboards and reports.
2. D3.js: A JavaScript library for creating interactive and dynamic visualizations.
3. Python: A popular programming language for data analysis and machine learning,
with libraries such as Pandas and Matplotlib for data visualization.
4. Machine Learning Algorithms: Techniques such as clustering, regression, and
classification can be used to identify patterns and trends in data.
Data visualization and visual analytics are both important tools for understanding
data. However, there are some key differences between the two.
 Data visualization is the process of representing data in a visual format, such as
charts, graphs, or maps. The goal of data visualization is to make data more
understandable and accessible to humans.
 Visual analytics is a more complex process that involves using interactive visual
interfaces to explore, analyze, and understand large and complex datasets. Visual
analytics can be used to identify patterns, trends, and anomalies in data. It can also be
used to make predictions and to support decision-making.
In other words, data visualization is about showing data, while visual analytics is
about understanding data. Here is a table that summarizes the key differences between data
visualization and visual analytics:
Feature Data Visualization Visual Analytics
To make data more

Purpose understandable and accessible To explore, analyze, and understand large and complex datasets
to humans
Creating visual representations Using interactive visual interfaces to explore, analyze, and
Process
of data understand data
Tools Charts, graphs, maps, etc. Data mining algorithms, statistical analysis, machine learning, etc.
Output Visual representations of data Insights into data
2.9 Design of visualization applications

Different applications of data visualisation
1. Healthcare Industries
A dashboard that visualises a patient's history might aid a current or new doctor in
comprehending a patient's health. It might give faster care facilities based on illness in
the event of an emergency. Instead than sifting through hundreds of pages of
information, data visualisation may assist in finding trends
Health care is a time-consuming procedure, and the majority of it is spent evaluating
prior reports. By boosting response time, data visualisation provides a superior selling
point. It gives matrices that make analysis easier, resulting in a faster reaction time.
2. Business intelligence
When compared to local options, cloud connection can provide the cost-effective
“heavy lifting” of processor-intensive analytics, allowing users to see bigger volumes
of data from numerous sources to help speed up decision-making.
Because such systems can be diverse, comprised of multiple components, and may use
their own data storage and interfaces for access to stored data, additional integrated
tools, such as those geared toward business intelligence (BI), help provide a cohesive
view of an organization's entire data system (e.g., web services, databases, historians,
etc.).
Multiple datasets can be correlated using analytics/BI tools, which allow for searches
using a common set of filters and/or parameters. The acquired data may then be
displayed in a standardised manner using these technologies, giving logical "shaping"
and better comparison grounds for end users.
3. Military
It's a matter of life and death for the military; having clarity of actionable data is
critical, and taking the appropriate action requires having clarity of data to pull out
actionable insights.
The adversary is present in the field today, as well as posing a danger through digital
warfare and cybersecurity. It is critical to collect data from a variety of sources, both
organised and unstructured. The volume of data is enormous, and data visualisation
technologies are essential for rapid delivery of accurate information in the most
condensed form feasible. A greater grasp of past data allows for more accurate
forecasting.
4. Finance Industries
For exploring/explaining data of linked customers, understanding consumer behaviour,
having a clear flow of information, the efficiency of decision making, and so on, data
visualisation tools are becoming a requirement for financial sectors.

For associated organisations and businesses, data visualisation aids in the creation of
patterns, which aids in better investment strategy. For improved business prospects,
data visualisation emphasises the most recent trends.
5. Data science
Data scientists generally create visualisations for their personal use or to communicate
information to a small group of people. Visualization libraries for the specified
programming languages and tools are used to create the visual representations.
Open source programming languages, such as Python, and proprietary tools built for
complicated data analysis are commonly used by data scientists and academics. These
data scientists and researchers use data visualisation to better comprehend data sets and
spot patterns and trends that might otherwise go undiscovered.
6. Marketing
In marketing analytics, data visualisation is a boon. We may use visuals and reports to
analyse various patterns and trends analysis, such as sales analysis, market research
analysis, customer analysis, defect analysis, cost analysis, and forecasting. These
studies serve as a foundation for marketing and sales.

Visual aids can assist your audience grasp your main message by visually engaging
them and visually engaging them. The major advantage of visualising data is that it can
communicate a point faster than a boring spreadsheet.
In b2b firms, data-driven yearly reports and presentations don't fulfil the needs of
people who are seeing the information. They are unable to grasp the art of engaging
with their audience in a meaningful or memorable manner. Your audience will be more
interested in your facts if you present them as visual statistics, and you will be more
inclined to act on your discoveries.
7. Food delivery apps

When you place an order for food on your phone, it is given to the nearest delivery
person. There is a lot of math involved here, such as the distance between the delivery
executive's present position and the restaurant, as well as the time it takes to get to the
customer's location.
Customer orders, delivery location, GPS service, tweets, social media messages, verbal
comments, pictures, videos, reviews, comparative analyses, blogs, and updates have all
become common ways of data transmission.
Users may obtain data on average wait times, delivery experiences, other records,
customer service, meal taste, menu options, loyalty and reward point programmes, and
product stock and inventory data with the help of the data.
8. Real estate business

Brokers and agents seldom have the time to undertake in-depth research and analysis
on their own. Showing a buyer or seller comparable home prices in their
neighbourhood on a map, illustrating average time on the market, creating a sense of
urgency among prospective buyers and managing sellers' expectations, and attracting
viewers to your social media sites are all examples of common data visualisation
applications.
If a chart is difficult to understand, it is likely to be misinterpreted or disregarded. It is
also seen to be critical to offer data that is as current as feasible. The market may not
alter overnight, but if the data is too old, seasonal swings and other trends may be
overlooked.
Clients will be pulled to the graphics and to you as a broker or agent if they perceive
that you know the market. If you display data in a compelling and straightforward
fashion, they will be drawn to the graphics and to you as a broker or agent.
9. Education
Users may visually engage with data, answer questions quickly, make more accurate,
data-informed decisions, and share their results with others using intuitive, interactive
dashboards.
The ability to monitor students' progress throughout the semester, allowing advisers to
act quickly with outreach to failing students. When they provide end users access to
interactive, self-service analytic visualisations as well as ad hoc visual data discovery
and exploration, they make quick insights accessible to everyone – even those with
little prior experience with analytics
10. E-commerce
In e-commerce, any chance to improve the customer experience should be taken. The
key to running a successful internet business is getting rapid insights. This is feasible
with data visualisation because crossing data shows features that would otherwise be
hidden.
Your marketing team may use data visualisation to produce excellent content for your
audience that is rich in unique information. Data may be utilised to produce attractive
narrative through the use of infographics, which can easily and quickly communicate
findings.
Patterns may be seen all throughout the data. You can immediately and readily detect
them if you make them visible. These behaviours indicate a variety of consumer trends,
providing you with knowledge to help you attract new clients and close sales.
DWDV
UNIT - III:
Syllabus:
Classification of visualization systems, Interaction and visualization techniques
misleading, Visualization of one, two and multi-dimensional data, text and text
documents.
3.1 Classification of visualization systems
There are several ways to categorize and think about different kinds of visualizations. Here
are four of the most useful.
The first two are unrelated to the others; the last two are related to each other.
(i) Complexity
1. One way to classify a data visualization is by counting how many different data
dimensions it represents.
2. By this we mean the number of discrete types of information that are visually encoded
in a diagram.
3. For example, a simple line graph may show the price of a companys stock on
different days: thats two data dimensions. If multiple companies are shown (and
therefore compared), there are now three dimensions; if trading volume per day is
added to the graph, there are four.
1. The above figure shows Four data dimensions in this graph. Adding more points
within any of these dimensions’ won’t change the graphs complexity.
2. This count of the number of data dimensions can be described as the level
of complexity of the visualization.
3. As visualizations become more complex, they are more challenging to design well,
and can be more difficult to learn from.
4. For that reason, visualizations with no more than three or four dimensions of data are
the most common though visualizations with six, seven, or more dimensions can be
found.
5. The way to succeed in the face of this challenge is to be intentional about which
property to use for each dimension, and iterate or change encodings as the design
evolves.
6. The second challenge for designing more complex visualizations is that there are
relatively few well-known conventions, metaphors, defaults, and best practices to rely
on.
7. Because the safety net of convention may not exist, there is more of a burden on the
designer to make good choices that can be easily understood by the reader.
(ii) Infographics versus Data Visualization

1. You may have heard the terms infographics and data visualization used in different
ways, or interchangeably in different contexts, or even casually by the same person in
a single sentence.
2. infographic to refer to representations of information perceived as casual, funny, or
frivolous, and visualization to refer to designs perceived to be more serious, rigorous,
or academic.
3. The truth is, even though the art of representing statistical information visually is
hundreds of years old, the vocabulary of the field is still evolving and settling.
4. Among the general public, there is still confusion over what these two terms mean,
but within the information design community, definitions for these terms are
solidifying.
5. In short: The distinction between infographics and data visualizations (or information
visualizations) is based on both form and origin (see the figure below)
The above Figure shows The difference between infographics and data visualization may be
loosely determined by the method of generation, the quantity of data represented, and the
degree of aesthetic treatment applied.
Infographics
the term infographics is useful for referring to any visual representation of data that is:
 manually drawn (and therefore a custom treatment of the information);

 specific to the data at hand (and therefore nontrivial to recreate with different data);
 aesthetically rich (strong visual content meant to draw the eye and hold interest); and
 relatively data-poor (because each piece of information must be manually encoded).
Data Visualization
By contrast, we suggest that the terms data visualization and information

visualization (casually, data viz and info viz) are useful for referring to any visual
representation of data that is:
 algorithmically drawn (may have custom touches but is largely rendered with the help
of computerized methods);
 easy to regenerate with different data (the same form may be repurposed to represent
different datasets with similar dimensions or characteristics);
 often aesthetically barren (data is not decorated); and
 relatively data-rich (large volumes of data are welcome and viable, in contrast to
infographics).
Data visualizations are initially designed by a human, but are then drawn algorithmically with
graphing, charting, or diagramming software. The advantage of this approach is that it is
relatively simple to update or regenerate the visualization with more or new data. While they
may show great volumes of data, information visualizations are often less aesthetically rich
than infographics.
(iii) Exploration versus Explanation
there are two categories of data visualization: exploration and explanation. The two serve
different purposes, and so there are tools and approaches that may be appropriate only for one
and not the other.
For this reason, it is important to understand the distinction, so that you can be sure you are
using tools and approaches appropriate to the task at hand.
Exploration
1. Exploratory data visualizations are appropriate when you have a whole bunch of data
and youre not sure whats in it.
2. When you need to get a sense of whats inside your data set, translating it into a visual
medium can help you quickly identify its features, including interesting curves, lines,
trends, or anomalous outliers.
3. Exploration is generally best done at a high level of granularity. There may be a
whole lot of noise in your data, but if you oversimplify or strip out too much
information, you could end up missing something important.
4. This type of visualization is typically part of the data analysis phase, and is used to
find the story the data has to tell you.
Explanation
1. By contrast, explanatory data visualization is appropriate when you already know

what the data has to say, and you are trying to tell that story to somebody else. It
could be the head of your department, a grant committee, or the general public.
2. Whoever your audience is, the story you are trying to tell (or the answer you are
trying to share) is known to you at the outset, and therefore you can design to
specifically accommodate and highlight that story.
3. In other words, you’ll need to make certain editorial decisions about which
information stays in, and which is distracting or irrelevant and should come out.
4. This is a process of selecting focused data that will support the story you are trying to
tell.
5. If exploratory data visualization is part of the data analysis phase, then explanatory
data visualization is part of the presentation phase.
6. Such a visualization may stand on its own, or may be part of a larger presentation,
such as a speech, a newspaper article, or a report. In these scenarios, there is some
supporting narrative written or verbal that further explains things.
(iv) Informative versus Persuasive versus Visual Art

1. there are three main categories of explanatory visualizations based on the
relationships between the three necessary players: the designer, the reader, and the
data.
2. it discusses designing visualizations of data with known parameters and stories.
The Designer-Reader-Data Trinity
1. It is useful to think of an effective explanatory data visualization as being supported

by a three-legged stool consisting of the designer, the reader, and the data.
2. Each of these legs exerts a force, or contributes a separate perspective, that must be
taken into consideration for a visualization to be stable and successful
3. Each of the three legs of the stool has a unique relationship to the other two.
4. While it is necessary to account for the needs and perspective of all three in each
visualization project, the dominant relationship will ultimately determine which
category of visualization is needed, see Figure below
Figure: The nature of the visualization depends on which relationship (between two of the
three components) is dominant.
Informative
1. An informative visualization primarily serves the relationship between the reader and
the data. It aims for a neutral presentation of the facts in such a way that will educate
the reader (though not necessarily persuade him).
2. Informative visualizations are often associated with broad data sets, and seek to distill
the content into a manageably consumable form.
3. Ideally, they form the bulk of visualizations that the average person encounters on a
day-to-day basis whether thats at work, in the newspaper, or on a service-providers
website. The Burning Man Infographic (Figure is an example of informative
visualization.)
Persuasive
1. A persuasive visualization primarily serves the relationship between the designer and
the reader.
2. It is useful when the designer wishes to change the readers mind about something.
3. It represents a very specific point of view, and advocates a change of opinion or
action on the part of the reader.
4. In this category of visualization, the data represented is specifically chosen for the
purpose of supporting the designers point of view, and is presented carefully so as to
convince the reader of same. See also: propaganda.
5. A good example of persuasive visualization is the Joint Economic Committee
minoritys rendition of the proposed Democratic health care plan in 2010
Visual Art
1. The third category, visual art, primarily serves the relationship between the designer
and the data.
2. Visual art is unlike the previous two categories in that it often
entails unidirectional encoding of information, meaning that the reader may not be
able to decode the visual presentation to understand the underlying information.
3. Whereas both informative and persuasive visualizations are meant to be easily
decodable bidirectional in their encoding visual art merely translates the data into a
visual form.
4. The designer may intend only to condense it, translate it into a new medium, or make
it beautiful; she may not intend for the reader to be able to extract anything from it
other than enjoyment.
5. This category of visualization is sometimes more easily recognized than others. For
example, Nora Ligorano and Marshall Reese designed a project that converts Twitter
streams into a woven fiber-optic tapestry
6. . A worthy pursuit in its own right, perhaps, but better clearly labelled as visual art,
and not confused with informative visualization.
3.2 Interaction and visualization techniques misleading

Data visualization is a critical aspect of data analysis, as it helps organizations make sense of
large amounts of data and gain insights that are not immediately obvious. However, data
visualization can also be misleading if not done correctly. Misleading data visualizations can
lead to incorrect conclusions, misinterpretations, and ultimately, poor decision-making.
Understanding the factors that contribute to misleading data visualizations is critical for
organizations that want to gain meaningful insights from their data and make informed
decisions. By avoiding these examples of misleading data visualization, organizations can
ensure that their data visualizations are accurate, meaningful, and actionable.
we will explore 5 common examples of misleading data visualization and provide guidelines
for avoiding these pitfalls. Data visualization is the graphical representation of data in the
form of charts, graphs, maps, and other interactive visual elements. The purpose of data
visualization is to help users understand, analyze, and communicate data insights more
effectively.
By converting raw data into a visual format, data visualization enables users to identify
patterns, trends, and relationships in the data, making it easier to identify key insights and
make informed decisions.
Data Visualization Dashboard
A data visualization dashboard is a visual display of data that provides real-time insights
into business performance and trends. The goal of a dashboard is to present data in a way that
is easy to understand, meaningful, and actionable.
Some common features of a data visualization dashboard include:
1. Dashboards display real-time data updates to give users an up-to-date view of the business.
2. Dashboards often include interactive features, such as drill-down and drill-up capabilities, to allow
users to explore the data more deeply.
3. Dashboards can be customized to display the data that is most important to the user, such as specific
metrics, KPIs, or business goals.
4. Dashboards often include multiple visualizations, such as bar charts, line charts, pie charts, and
tables, to provide a comprehensive view of the data.
5. Dashboards often include data filtering capabilities, such as date ranges and other filters, to allow
users to view specific subsets of the data.
6. Dashboards should be designed to be accessible to all users, including those with disabilities, to
ensure that everyone can gain insights from the data.
7. Dashboards should be optimized for viewing on mobile devices to allow users to access the data
from anywhere, at any time.
While we’re on the subject of the features of data visualization dashboards, allow us to
introduce you to DotNetReport – the ultimate software for dashboards – later in the article.
Below are some of the most examples of misleading visualizations and how they can be
avoided:
1. Truncated Y-Axis
A truncated Y-axis is a common mistake in data visualization where the scale of the Y-axis is
artificially shortened to make changes in the data appear more significant. This can lead to
misleading visualizations and incorrect conclusions.
Example:
For example, if a Y-axis is truncated to show the numbers displayed to be overstated or

understated, which directly affects the user’s response to “How much do you think Y is
bigger than X?” This can give the impression that the changes in the data are more significant
than they are.
Solution:
To avoid this, it is important to use an appropriate scale for the Y-axis that accurately reflects
the data. This means that the Y-axis should be wide enough to show all relevant changes in
the data, regardless of how small they may seem. Additionally, organizations should consider
using annotations and other contextual information, such as error bars or confidence intervals.
By avoiding truncated Y-axis, organizations can ensure that their data visualizations are
accurate, meaningful, and actionable.
2. Cherry-Picking Data
Cherry-picking data is the act of selecting only the data that supports a desired conclusion
while ignoring or downplaying data that contradicts it. This is a common mistake in data
visualization and can lead to misleading visualizations and incorrect conclusions. It is
important to consider the context and limitations of the data when creating a visualization.
Example:
This graphic is particularly misleading because of how pronounced the lines

are. Additionally, even though the results seem to be given as a percentage, not all of them
add up to 100. As a result, this picture is out of proportion and provides a poor depiction of
the available data.
Solution:
To avoid cherry-picking data, it is important to consider all relevant data when creating a
visualization. This includes data that supports and data that contradicts the desired
conclusion. By including all relevant data, organizations can ensure that their visualizations
accurately reflect the full picture. Finally, organizations should consider using appropriate
statistical methods, such as regression analysis or hypothesis testing, to ensure that their
visualizations are accurate and not influenced by outliers or other factors.
3. Dualing Data
Dualing data refers to the practice of comparing two or more sets of data in a way that creates
a misleading or incorrect conclusion. This can occur when data is presented in a way that
gives an unfair advantage to one set of data over the other.
Example:
Dualing data can occur when different sets of data are plotted on different scales or when one
set of data is highlighted or emphasized while the other is not as in the example above. The
findings demonstrate an increase in abortions and a decrease in cancer-related health
treatments. This misleading image only depicts a vague trend or pattern without any
meaningful context and lacks any values on its axis. This can give a distorted picture of the
relationship between the data sets and lead to incorrect conclusions.
Solution:
To avoid dualing data, it is important to present data in a fair and unbiased way. This can
include using the same scales and axes for all sets of data and providing equal emphasis and
attention to all data sets.Additionally, organizations should consider using appropriate
statistical methods, such as regression analysis or hypothesis testing, to ensure that their data
visualizations are not influenced by outliers or other factors that may distort the relationship
between the data sets.
4. Using The Wrong Chart Type
Using the wrong chart type is a common mistake in data visualization that can lead to
misleading or incorrect conclusions. Different chart types are designed to visualize different
types of data and relationships, and using the wrong chart type can result in a distorted or
inaccurate picture of the data.
Example:
For example, using a bar chart to display continuous data or using a pie chart to display a
large number of categories can result in a confusing or misleading visualization.
Solution:
To avoid using the wrong chart type, it is important to carefully consider the data and the
relationship that needs to be visualized. Additionally, organizations should consider using
multiple chart types to visualize different aspects of the data, such as using a bar chart to
show the distribution of a categorical variable and a line chart to show changes in a
continuous variable over time.
5. Correlation VS Causation
Correlation and causation are two important concepts in data analysis. Correlation refers to a
statistical relationship between two variables, indicating that as one variable changes, the
other variable also changes. Causation, on the other hand, refers to a causal relationship
between two variables, indicating that a change in one variable directly causes a change in the
other variable. It is important to understand the difference between correlation and causation
because confusing the two can lead to incorrect conclusions and misleading visualizations.
Example:
For example, a strong correlation between two variables does not necessarily imply causation
and vice versa. To ensure that data visualizations accurately reflect the relationship between
variables, it is important to carefully consider the data and to consider other potential factors
that may influence the relationship. This can include using regression analysis or hypothesis
testing to test for causal relationships. Additionally, organizations should always consider the
context and limitations of the data when creating visualizations and drawing conclusions.
Following the fundamentals of data visualization is the only way we can make
sure effective data visualization has been achieved.
1. Understand The Data:
To avoid misleading visualizations, it’s important to have a good understanding of the data.
This includes understanding the structure, types, and distribution of the data. This will help
you choose the right type of visualization, scales, and axis labels that accurately represent the
data.
2. Choose The Right Type of Visualization:
The type of visualization used should match the type of data and the message that needs to be
conveyed. For example, bar charts are often used for comparing quantities, while line charts
are often used to show trends over time.
3. Use Appropriate Labels:
Using appropriate scales and axis labels is critical to accurately represent the data. For
example, using a logarithmic scale instead of a linear scale can make it difficult to accurately
compare data.
4. Provide Context And Annotations:
Adding contexts such as annotations, captions, and reference lines can help users understand
the data and its significance.
5. Test And Iterate:
It’s important to test and iterate the visualization to make sure it effectively conveys the
desired message. Get feedback from the audience and make necessary changes.
6. Consider Accessibility:
Make sure the visualization is accessible to all users, including those with disabilities. This
can be done by using clear, concise text, appropriate colors, and avoiding clutter.
7. Use A Large Sample Pool:
Using a small sample size can lead to inaccurate representations of the data and can lead to
incorrect conclusions.
8. Avoid Cherry-Picking Data:
Don’t try to fit a preconceived narrative or to show a desired outcome. This can lead to
misleading visualizations and incorrect conclusions.
9. Consider Outliers:
In data visualization, outliers can have a significant impact on the overall picture that is
presented. Include the outliers in the visualization to accurately represent the data by plotting
the data and looking for points that are significantly different from the rest of the data.Once
outliers have been identified, consider how to handle them in their visualizations.
10. Use DNR’s Reporting Tool
DotNet Report is a reporting tool that allows organizations to create, customize, and embed
reports in their applications. DNR provides several features and tools to help organizations
avoid misleading data visualizations:
3.3 Visualization of one, two and multi-dimensional data

Visualizing one, two, and multi-dimensional data involves different techniques depending on
the complexity and nature of the data. Below are examples of how data with varying
dimensions can be visualized effectively.
1. One-Dimensional Data Visualization
One-dimensional data consists of a single variable or attribute. It is the simplest type of data
and is often visualized in ways that allow us to see the distribution, frequency, or trends of a
single variable.
Common Visualizations:
 Histograms: Shows the distribution of a single numeric variable by dividing data into
intervals (bins) and displaying the frequency of observations in each bin.
o Use Case: Visualizing the distribution of exam scores.
 Line Charts: Used to display data points connected by straight lines. Typically used for time-
series data.
o Use Case: Tracking stock prices or temperature over time.
 Bar Charts: Represents categorical data with rectangular bars. Each bar's length represents the
value of a particular category.
o Use Case: Showing sales of different product categories.
Examples:
 Height Distribution of People: Visualized using a histogram.

 Daily Temperature: Visualized using a line chart.
2. Two-Dimensional Data Visualization
Two-dimensional data consists of two variables or attributes, often referred to as bivariate

data. The goal is to display the relationship between two variables.
 Scatter Plot: Plots two numerical variables as points on an x-y coordinate plane to show
correlations or relationships.
o Use Case: Plotting the relationship between hours studied and exam scores.
 Heatmaps: Uses color to represent the values of two variables in a grid format. Often used for
showing the intensity or concentration of values.
o Use Case: Visualizing correlations between multiple variables.
 Bubble Chart: An extension of a scatter plot, where the size of the bubble represents a third
variable.
o Use Case: Plotting the relationship between population size and GDP, with bubble
size representing life expectancy.
o
 Bar Plot with Two Axes: Shows two variables where one axis represents categories and the
other represents the numerical values.
o Use Case: Comparing the revenue and profit of companies.
Examples:
 Weight vs. Height: Visualized using a scatter plot to show the relationship between these two
variables.
 Temperature Across Regions: Visualized using a heatmap where regions are on one axis and
time on another.
3. Multi-Dimensional Data Visualization

Multi-dimensional data, often referred to as high-dimensional data, consists of more than
two variables. Visualizing such data requires more advanced techniques to represent multiple
relationships simultaneously.
 Parallel Coordinates Plot: Used for visualizing high-dimensional data by drawing

each data point as a line that crosses multiple parallel axes. Each axis represents one
dimension.
o Use Case: Comparing multiple features of different cars (e.g., weight, horsepower,
fuel efficiency).
o
 3D Scatter Plot: Extends the scatter plot to three dimensions, where x, y, and z
represent three variables.
o Use Case: Plotting relationships between three financial metrics (e.g., revenue, profit,
and market share).
 Radar Chart (Spider Chart): Displays multivariate data as lines or polygons on a

radial grid, where each axis represents one variable.
o Use Case: Visualizing performance metrics across multiple categories.
o
 Heatmap with Dendrogram (Clustered Heatmap): Combines heatmaps with
hierarchical clustering, where the rows and columns are clustered based on
similarities.
o Use Case: Gene expression data analysis in bioinformatics.
 Dimensionality Reduction Techniques (PCA, t-SNE): Used to reduce the number

of dimensions for easier visualization. Data is projected into 2D or 3D while
preserving relationships between points.
o Use Case: Visualizing high-dimensional datasets like image data or customer
attributes.
 Treemaps: Used to represent hierarchical data where nested rectangles represent

different dimensions and their sizes are proportional to a numeric value.
o Use Case: Visualizing the market share of different sectors and companies.
o
 Scatter Plot Matrix (Pair Plot): Displays all pairs of variables in a multi-
dimensional dataset as individual scatter plots in a grid.
o Use Case: Visualizing relationships between multiple numerical variables in a
dataset.
Examples:
 Iris Dataset (4D): Visualized using a parallel coordinates plot or PCA to reduce dimensions
and create a 2D scatter plot.
 Customer Segmentation: Visualized using t-SNE to reduce high-dimensional customer
features into 2D space for clustering.
 Multi-factor Stock Analysis: Use of a radar chart to show stock performance based on
different factors like price-to-earnings ratio, dividend yield, etc.
Summary Table:
Dimension Visualization Techniques Example Use Cases
One- Distribution of scores, time-series

Histograms, Line Charts, Bar Charts
Dimensional data, sales by category
Two- Scatter Plot, Heatmap, Bubble Chart, Bar Plot Correlation between variables,
Dimensional with Two Axes comparisons across categories
Parallel Coordinates, 3D Scatter Plot, Radar

Multi- High-dimensional data analysis,
Chart, Heatmap with Dendrogram, PCA,
Dimensional clustering, and comparisons
Treemaps
3.4 text and text documents in data visualization
Text and text documents in data visualization refer to the process of transforming
unstructured textual data into visual formats that help users understand the content, patterns,
and relationships within the text. Since text data is vast and complex, visualizing it can
provide significant insights that are otherwise difficult to derive from plain text. Below are
some common techniques and approaches for text visualization.
1. Word Cloud
 Description: A word cloud (or tag cloud) is a visual representation of word frequency in a
text, where the size of each word indicates its frequency or importance in the document.
 Use Case: Quickly summarizing key themes or topics in a document, such as analyzing
customer reviews, social media posts, or research papers.
 Strengths: Simple to create, gives a quick snapshot of frequently occurring terms.
 Limitations: Doesn't show the relationship between words or the context in which they
appear.
2. Word Tree
 Description: A word tree shows a hierarchical view of words, focusing on how specific words
(usually the root word) are followed by other words in a sequence.
 Use Case: Useful for understanding phrases, common word associations, and exploring
repeated patterns or themes in documents.
 Strengths: Displays context around keywords, shows relationships between words.
 Limitations: Limited to analyzing short phrases and small-scale text.
(Example Image)
3. Document Term Matrix (Heatmap)
 Description: A Document Term Matrix (DTM) is a table where each row corresponds to a
document, and each column corresponds to a term, showing the frequency of each term in
each document. Visualizing this matrix as a heatmap highlights word usage across multiple
documents.
 Use Case: Analyzing the frequency and distribution of specific terms across a large set of
documents, such as comparing themes across different research papers or news articles.
 Strengths: Highlights frequent terms and compares term occurrence across multiple
documents.
 Limitations: Does not account for semantics, limited by the number of terms and documents.
4. Topic Modeling (LDA) Visualization
 Description: Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that
identifies topics in a set of documents. LDA visualizations often present these topics as
clusters or show how different topics are related.
 Use Case: Analyzing large collections of text to discover underlying themes or topics without
manually reading all the content, such as discovering topics in a large collection of news
articles or product reviews.
 Strengths: Shows hidden structure and thematic relationships in large sets of unstructured
text.
 Limitations: Requires tuning, may not work well with small datasets.
5. Sentiment Analysis Visualization
 Description: Sentiment analysis visualizes the emotional tone of text data by assigning
sentiment scores (e.g., positive, negative, or neutral) to documents, sentences, or phrases. The
results are often visualized through line charts (over time), pie charts (distribution), or bar
charts.
 Use Case: Tracking customer sentiment in social media posts, reviews, or survey responses.
 Strengths: Helps gauge overall mood or opinion from a large collection of text.
 Limitations: Sentiment detection can be inaccurate due to sarcasm, ambiguity, or language
nuances.
6. Network Diagrams (Text Relationship Networks)
 Description: Network diagrams visualize relationships between words or topics by treating

them as nodes connected by edges, which represent associations or co-occurrences in text.
 Use Case: Mapping relationships between entities in a document, such as exploring character
interactions in literature or tracking frequently mentioned products or terms in news articles.
 Strengths: Highlights connections and dependencies between terms.
 Limitations: May become overly complex for large datasets.
7. Text Summarization Visualization
 Description: Automatic text summarization tools extract the most important sentences or
phrases from a document, which can be visually presented to highlight key points, either in a
condensed list form or as a visual timeline of document events.
 Use Case: Summarizing long reports, news articles, or academic papers to quickly understand
the most critical points.
 Strengths: Reduces the need to read large amounts of text.
 Limitations: May miss nuances or important details.
8. Timeline Visualization (for Documents)
 Description: Timelines can be used to track and visualize key events, discussions, or changes
in sentiment over time in text data, such as social media posts, news reports, or journal
entries.
 Use Case: Monitoring the progression of a specific topic or issue over time, such as the
unfolding of a political debate or a brand’s reputation.
 Strengths: Shows temporal patterns in data, such as trends and shifts in tone or frequency.
 Limitations: Limited to datasets with clear time markers.
9. N-gram Analysis
 Description: N-gram visualizations display sequences of "n" words that occur together in text,
typically shown in charts or graphs that highlight frequent word combinations.
 Use Case: Analyzing common phrases or word combinations in text documents (e.g.,
common product features in reviews, frequent phrases in customer complaints).
 Strengths: Reveals patterns in word usage that can indicate key themes or topics.
 Limitations: Works best for shorter text fragments or corpora.
(Example Image)
10. Hierarchical Document Visualization (Tree-based)
 Description: Hierarchical text visualizations use tree structures to represent the structure of a
document or collection of documents. For instance, large text collections (e.g., books, reports)
can be visualized as hierarchical trees, where nodes represent chapters, sections, or topics.
 Use Case: Visualizing the structure of a long document (e.g., books, legal documents) to
understand its organization or topic hierarchy.
 Strengths: Useful for visualizing and navi
 gating large, complex documents.
 Limitations: Can be difficult to understand with very large datasets or poorly structured
documents.
Challenges of Visualizing Text Data:
 Dimensionality: Text data is inherently high-dimensional (with each word representing a

different dimension). Reducing this complexity without losing meaning can be difficult.
 Ambiguity: Words may have multiple meanings depending on the context, which can make
accurate visualization challenging.
 Scalability: As text data increases, visualizations can become cluttered, making it hard to
extract meaningful patterns.
Tools for Text Data Visualization:
1. Wordle / WordItOut: For generating word clouds.

2. Voyant Tools: A suite of text analysis tools with visualizations.
3. TensorFlow's Embedding Projector: For visualizing high-dimensional text embeddings.
4. D3.js: A JavaScript library for creating custom visualizations, including text-based ones.
5. LDAvis: A tool for visualizing topics generated by topic models like LDA.
6. Gephi: For network visualization, often used to explore relationships between words or
entities.
DWDV
UNIT - IV
Syllabus:
Visualization of groups, trees, graphs, clusters, networks, software,
Metaphorical visualization
4.2 Visualization of Trees:

 A tree visualization displays hierarchical data with a collection of nodes (data points)
and edges (hierarchical relations between nodes).
 Visualizing tree structures is essential for understanding hierarchical data,
relationships, and dependencies.
 Trees are commonly used in various fields, including computer science (data
structures), biology (phylogenetic trees), and organizational structures. Here are some
common techniques for visualizing trees:
1. Tree Diagrams:
A tree diagram visually represents hierarchical relationships using branches. Each node
represents an element, and branches connect parent nodes to child nodes.
Tree diagrams are methods for illustrating hierarchy.
● The diagram is a tree-like structure, with connecting lines extending from a central
node.
● The central node is often referred to as the "root node," the connecting lines as
"branches," the connected members as "nodes," and finally the "leaf nodes" being the
members with no further extensions.
● Simple shapes such as rectangles or circles are commonly used as nodes, with
descriptive text within or underneath the shape.
● Tree diagrams are effective means to display organizational hierarchy, such as a chain
of command, and they provide clear information regarding reporting lines and
departmental structure.
● They can also be used to visualize family relations and descent, which is known as a
"family tree."
1. Tree Map
Tree maps are a variation of a tree diagram, used to represent hierarchy. The tree map
represents hierarchal structure, while using the size or area of the squares to represent
quantity. Each category is assigned a rectangle, with subcategories displayed inside the large
rectangle, in proportionate size against each other. The area of the parent category is thus a
sum of its sub-categories, with a clear part-to-whole relationship displayed. No connecting
lines or branches are required, as in the case of a tree diagram. The shapes within the tree
map are created using coding algorithms, and thus need special software. A tree map can be
used to illustrate relative expenditures within a budget, with the area of the squares
representing number of amounts allocated to each budget category.
2. Mind Map
Mind maps are a kind of network diagram representing ideas and concepts and how they are
connected. The mind map, sometimes called a "brainstorm," begins with a central idea, with
categories extending out from this node. Further subcategories are extended from these
categories, and so on. The diagram thus acts like a tree, with ideas stemming out from its
branches, and sub-branches. This tool is useful for idea generation, organizing thoughts, and
structuring information and is thus useful in the initial stages of a project.
Mind maps can be used for a simple task such as writing a letter, or a complex task such as
strategic analysis. Mind maps can be created alone or in groups. In a workshop setting,
collaborative mind maps are also effective in improving team work and generating
consensus.
3. Radial tree
Radial Tree chart is a radial chart for visual organization of information, one of the versions
of MindMap. Such a chart in focus always has one central element (idea, phrase, keyword),
which starts the search for new related ideas / topics / keywords.
Radial tree layout is a deformation of Dendrogram and CompactBox. It is an appropriate

layout method for tree data structure. Please use it with Tree Graph. As the demo below, you
can deploy it in layout while instantiating Graph.
4. Phylogenetic Trees:
A phylogenetic tree is a visual representation of the relationship between different organisms,

showing the path through evolutionary time from a common ancestor to different
descendants. Trees can represent relationships ranging from the entire history of life on
earth, down to individuals in a population.
The diagram below shows a tree of 3 taxa (a singular taxon is a taxonomic unit; could be a
species or a gene).
1. This is a bifurcating tree. The vertical lines, called branches, represent a lineage,
and nodes are where they diverge, representing a speciation event from a common
ancestor. The trunk at the base of the tree, is actually called the root. The root node
represents the most recent common ancestor of all of the taxa represented on the
tree.
2. Time is also represented, proceeding from the oldest at the bottom to the most recent
at the top. What this particular tree tells us is that taxon A and taxon B are more
closely related to each other than either taxon is to taxon C
3. The reason is that taxon A and taxon B share a more recent common ancestor than
they do with taxon C.
4. A group of taxa that includes a common ancestor and all of its descendants is called
a clade. A clade is also said to be monophyletic. A group that excludes one or
more descendants is paraphyletic; a group that excludes the common ancestor is said
to be polyphyletic.
4.3 Visualization of Graphs:

Graph visualization is a way to visually represent data and the relationships between entities
in that data. It's also known as network visualization or link analysis.
Graph visualization can help you understand patterns and structures in data that might be
difficult to identify using other methods. For example, you can use graph visualization to
show how goods are transported in a supply chain, how parts of an IT system are connected,
or transactions between accounts.
A graph visualization is a visual representation of connected data. It shows both individual
entities and the relationships between them. It can also be referred to as network
visualization or link analysis.
A graph visualization is displayed as a network, with individual data points connected to
others via links that represent how they are connected. A network visualization can represent
any number of things: how goods are transported from one place to another within a supply
chain, how different parts of an IT system are connected, or transactions between accounts.
Graph: The basics

The first step to understanding graph visualization is understanding what a graph is. Also
called network, a graph is a collection of nodes (or vertices) and edges - also called links or
relationships. Each node represents a single data point such as a person, a phone number, a
supplier, a bank account, a contract, etc. Each edge represents how two nodes are connected:
a person possesses a bank account, for example.
Graph data is stored in a graph database such as Neo4j, Azure Cosmos DB or Memgraph.
Graph analytics provides algorithms that help data scientists and data-driven analysts answer
questions or make predictions. This way of representing data is well suited for scenarios
involving connections and networks of entities, like supply chain networks,
telecommunication networks, networks of suspected fraudsters, and much more.
Visualizing data as a graph

When the nodes and edges of a graph are displayed as a graph visualization, it becomes
intuitive to explore the connections within this data.
Dedicated algorithms, called layouts, calculate the node positions and display the data on two
(sometimes three) dimensional spaces. Some examples of layouts are force-directed where
larger or more important elements are closer to the center, or radial layout, where nodes are
arranged in concentric circles, showing dependencies.
A graph visualization from Linkurious Enterprise shows individual data points (nodes) and
how they are connected (edges)
These visualizations are data modeled as graphs. Any type of data asset that contains
information about connections can be modeled and visualized as a graph, even data initially
stored in a tabular way. For instance, the data from our example above could be extracted
from a simple spreadsheet as depicted below.
The data could also be stored in a relational database or in a graph database, a system
optimized for the storage and analysis of complex and connected data.
In the end, graph visualization is a way to better understand and manipulate connected data.
Different Types of Graphs for Data Visualization

Data can be a jumble of numbers and facts. Charts and graphs turn that jumble into pictures
that make sense. Types are:
1. Bar Graphs
Bar graphs are one of the most commonly used types of graphs for data visualization. They
represent data using rectangular bars where the length of each bar corresponds to the value it
represents. Bar graphs are effective for comparing data across different categories or groups.
Bar Graph Example

Advantages of Bar Graphs
● Highlighting Trends: Bar graphs are effective at highlighting trends and patterns in data,
making it easy for viewers to identify relationships and comparisons between different
categories or groups.
● Customizations: Bar graphs can be easily customized to suit specific visualization needs,
such as adjusting colors, labels, and styles to enhance clarity and aesthetics.
● Space Efficiency: Bar graphs can efficiently represent large datasets in a compact space,
allowing for the visualization of multiple variables or categories without overwhelming
the viewer.
Disadvantages of Bar Graphs
● Limited Details: Bar graphs may not provide detailed information about individual data
points within each category, limiting the depth of analysis compared to other visualization
methods.
● Misleading Scaling: If the scale of the y-axis is manipulated or misrepresented, bar
graphs can potentially distort the perception of data and lead to misinterpretation.
● Overcrowding: When too many categories or variables are included in a single bar
graph, it can become overcrowded and difficult to read, reducing its effectiveness in
conveying clear insights.
2. Line Graphs
Line graphs are used to display data over time or continuous intervals. They consist of points
connected by lines, with each point representing a specific value at a particular time or
interval. Line graphs are useful for showing trends and patterns in data. Perfect for showing
trends over time, like tracking website traffic or how something changes.
Line Graph Example
Advantages of Line Graphs
● Clarity: Line graphs provide a clear representation of trends and patterns over time or
across continuous intervals.
● Visual Appeal: The simplicity and elegance of line graphs make them visually appealing
and easy to interpret.
● Comparison: Line graphs allow for easy comparison of multiple data series on the same
graph, enabling quick insights into relationships and trends.
Disadvantages of Line Graphs
● Data Simplification: Line graphs may oversimplify complex data sets, potentially
obscuring nuances or outliers.
● Limited Representation: Line graphs are most effective for representing continuous data
over time or intervals and may not be suitable for all types of data, such as categorical or
discrete data.
3. Pie Charts
Pie charts are circular graphs divided into sectors, where each sector represents a proportion
of the whole. The size of each sector corresponds to the percentage or proportion of the total
data it represents. Pie charts are effective for showing the composition of a whole and
comparing different categories as parts of a whole.
Pie Chart Example
Advantages of Pie Charts
● Easy to create: Pie charts can be quickly generated using various software tools or even
by hand, making them accessible for visualizing data without specialized knowledge or
skills.
● Visually appealing: The circular shape and vibrant colors of pie charts make them
visually appealing, attracting the viewer’s attention and making the data more engaging.
● Simple and easy to understand: Pie charts present data in a straightforward manner,
making it easy for viewers to grasp the relative proportions of different categories at a
glance.
Disadvantages of Using a Pie Chart
● Limited trend analysis: Pie charts are not ideal for showing trends or changes over time
since they represent static snapshots of data at a single point in time.
● Limited data slice: Pie charts become less effective when too many categories are
included, as smaller slices can be difficult to distinguish and interpret accurately. They
are best suited for representing a few categories with distinct differences in proportions.
4. Scatter Plots
Scatter plots are useed to visualize the relationship between two variables. Each data point in
a scatter plot represents a value for both variables, and the position of the point on the graph
indicates the values of the variables. Scatter plots are useful for identifying patterns and
relationships between variables, such as correlation or trends.
Scatter Chart Example
Advantages of Using Scatter Plots
● Revealing Trends and Relationships: Scatter plots are excellent for visually identifying
patterns, trends, and relationships between two variables. They allow for the exploration
of correlations and dependencies within the data.
● Easy to Understand: Scatter plots provide a straightforward visual representation of data
points, making them easy for viewers to interpret and understand without requiring
complex statistical knowledge.
● Highlight Outliers: Scatter plots make it easy to identify outliers or anomalous data
points that deviate significantly from the overall pattern. This can be crucial for detecting
unusual behavior or data errors within the dataset.
Disadvantages of Using Scatter Plot Charts
● Limited to Two Variables: Scatter plots are limited to visualizing relationships between
two variables. While this simplicity can be advantageous for focused analysis, it also
means they cannot represent interactions between more than two variables
simultaneously.
● Not Ideal for Precise Comparisons: While scatter plots are excellent for identifying
trends and relationships, they may not be ideal for making precise comparisons between
data points. Other types of graphs, such as bar charts or box plots, may be better suited for
comparing specific values or distributions within the data.
●
5. Area Charts:
Area charts are similar to line graphs but with the area below the line filled in with
color. They are used to represent cumulative totals or stacked data over time. Area charts are
effective for showing changes in composition over time and comparing the contributions of
different categories to the total.
Area Chart Example
Advantages of Using Area Charts
● Visually Appealing: Area charts are aesthetically pleasing and can effectively capture
the audience’s attention due to their colorful and filled-in nature.
● Great for Trends: They are excellent for visualizing trends over time, as the filled area
under the line emphasizes the magnitude of change, making it easy to identify patterns
and fluctuations.
● Compares Well: Area charts allow for easy comparison between different categories or
datasets, especially when multiple areas are displayed on the same chart. This
comparative aspect aids in highlighting relative changes and proportions.
Disadvantages of Using Area Charts
● Limited Data Sets: Area charts may not be suitable for displaying large or complex
datasets, as the filled areas can overlap and obscure details, making it challenging to
interpret the data accurately.
● Not for Precise Values: Area charts are less effective for conveying precise numerical
values, as the emphasis is on trends and proportions rather than exact measurements. This
can be a limitation when precise data accuracy is crucial for analysis or decision-making.
6. Radar Charts:
A radar chart, also known as a spider chart or a web chart, is a graphical method of displaying
multivariate data in the form of a two-dimensional chart. It is particularly useful for
visualizing the relative values of multiple quantitative variables across several categories.
Radar charts compare things across many aspects, like how different employees perform in
various skills.
Radar Chart Example
Advantages of Using Radar Chart
● Highlighting Strengths and Weaknesses: Radar charts allow for the clear visualization
of strengths and weaknesses across multiple variables, making it easy to identify areas of
excellence and areas for improvement.
● Easy Comparisons: The radial nature of radar charts facilitates easy comparison of
different variables or categories, as each axis represents a different dimension of the data,
enabling quick visual assessment.
● Handling Many Variables: Radar charts are particularly useful for handling datasets
with many variables, as each variable can be represented by a separate axis, allowing for
comprehensive visualization of multidimensional data.
Disadvantages of Using Radar Chart
● Scaling Issues: Radar charts can present scaling issues, especially when variables have
different units or scales. Inaccurate scaling can distort the representation of data, leading
to misinterpretation or misunderstanding.
● Misleading Comparisons: Due to the circular nature of radar charts, the area enclosed by
each shape can be misleading when comparing variables. Small differences in values can
result in disproportionately large visual differences, potentially leading to
misinterpretation of data.
7. Histograms:
Histograms are similar to bar graphs but are used specifically to represent the distribution of
continuous data. In histograms, the data is divided into intervals, or bins, and the height of
each bar represents the frequency or count of data points within that interval.
Example of Histogram
Advantages of using Histogram
● Easy to understand: Histograms provide a visual representation of the distribution of

data, making it easy for viewers to grasp the overall pattern.
● Identify Patterns: Histograms allow for the identification of patterns and trends within
the data, such as skewness, peaks, or gaps.
● Compare Data Sets: Histograms enable comparisons between different datasets, helping
to identify similarities or differences in their distributions.
Disadvantages of using Histogram
● Not for small datasets: Histograms may not be suitable for very small datasets as they
require a sufficient amount of data to accurately represent the distribution.
● Limited details: Histograms provide a summary of the data distribution but may lack
detailed information about individual data points, such as specific values or outliers.
8. Pareto Charts:
A Pareto chart is a specific type of chart that combines both bar and line graphs. It’s named
after Vilfredo Pareto, an Italian economist who first noted the 80/20 principle, which states
that roughly 80% of effects come from 20% of causes. Pareto charts are used to highlight the
most significant factors among a set of many factors.
Pareto Chart Example
Advantages of using a Pareto Chart
● Simple to understand: Pareto charts present data in a straightforward manner, making it

easy for viewers to grasp the most significant factors at a glance.
● Visually identify key factors: By arranging data in descending order of importance,
Pareto charts allow users to quickly identify the most critical factors contributing to a
problem or outcome.
● Focus resources effectively: With the ability to prioritize factors based on their impact,
Pareto charts help organizations allocate resources efficiently by addressing the most
significant issues first.
Disadvantages of Using a Pareto Chart
● Limited Data Exploration: Pareto charts primarily focus on identifying the most critical
factors, which may lead to overlooking nuances or subtle trends present in the data.
● Assumes 80/20 rule applies: The Pareto principle, which suggests that roughly 80% of
effects come from 20% of causes, is a foundational concept behind Pareto charts.
However, this assumption may not always hold true in every situation, potentially leading
to misinterpretation or oversimplification of complex data relationships.
9. Waterfall Charts
Waterfall charts are a type of data visualization tool that display the cumulative effect of
sequentially introduced positive or negative values. They are particularly useful for
understanding the cumulative impact of different factors contributing to a total or final value.
Waterfall Charts Example
Advantages of Using a Waterfall Chart
● Clear Breakdown of Changes: Waterfall charts provide a clear and visual breakdown of
changes in data over a series of categories or stages, making it easy to understand the
cumulative effect of each change.
● Easy to Identify the Impact: By displaying the incremental additions or subtractions of
values, waterfall charts make it easy to identify the impact of each component on the
overall total.
● Focus on the Journey: Waterfall charts emphasize the journey of data transformation,
showing how values evolve from one stage to another, which can help in understanding
the flow of data changes.
Disadvantages of Using a Waterfall Chart
● Complexity with Too Many Categories: Waterfall charts can become complex and
cluttered when there are too many categories or stages involved, potentially leading to
confusion and difficulty in interpreting the data.
● Not Ideal for Comparisons: While waterfall charts are effective for illustrating changes
over a sequence of categories, they may not be suitable for direct comparisons between
different datasets or groups, as they primarily focus on showing the cumulative effect of
changes rather than individual values.
4.4 Visualization of Clusters:

Cluster visualization is a tool that displays data clusters in a visual format, making it easier to
understand and interpret. It can be used to:
● Uncover patterns
By using appropriate clustering algorithms and visualization best practices, you can find
hidden patterns and insights in your data.
● Present results
You can use cluster visualization to present the results of clustering to people from the
application side.
● Explore data
You can use cluster visualization to explore the data and tune clustering algorithms.
Here are some ways to visualize clusters:

● Interactive map:
You can use an interactive map to see an overview of your cluster sets and drill into each
cluster set to view subclusters.
An interactive map paints geographic details over a basic map tile layer provided by a third-
party provider. CDP Data Visualization offers several overlay layers for data display. It is an
excellent choice for displaying large amounts of geo-based information in relevant detail.
● The base Map Servers include either Google Maps or Mapbox.

● The overlay layers that display data include the Heatmap, Cluster, Circles, and Routes
and pins options.
To learn how to use the interactive maps in their various forms, read the following topics.
● Basic interactive map

CDP Data Visualization enables you to create a basic Interactive Map visual.
● Choropleth interactive maps
In an Interactive Map visual, CDP Data Visualization enables you to create
choropleth maps. A choropleth map shows geographical areas that are shaded in
proportion to the value of the measurement that is displayed on the map.
● Interactive map with multiple dimensions
CDP Data Visualization enables you to display multiple dimensions in Interactive
Map visuals.
● Interactive map with multiple measures
CDP Data Visualization enables you to display multiple measures in Interactive Map
visuals.
● Changing the map server for interactive maps
CDP Data Visualization uses two third-party options for its Interactive Map visuals:
Google and Mapbox.
● Changing layer options for interactive maps
CDP Data Visualization enables you to make adjustments to all layers of an
Interactive Map visual.
● Plotting routes on interactive maps
CDP Data Visualization enables you to plot routes over an Interactive Map.
● Using alphabetic values in interactive maps
CDP Data Visualization Interactive Map visuals support both numeric and non-
numeric values on the Color shelf. As a result, aspects of named categories can be
viewed simultaneously on the same visual as a distinct series.
● Segmenting data qualitatively in interactive maps
CDP Data Visualization enables you to plot qualitative ranges of data by specifying
the data segmentation on the Color shelf of an Interactive Map visual.
● Shelves for interactive maps
Overview of shelves for CDP Data Visualization Interactive Map visuals.
● Dendrogram plot:
You can use a dendrogram plot to visualize the hierarchy of clusters. It shows the order in
which clusters have been merged or divided and shows the similarity or distance between
data points.
A dendrogram is a diagram that shows the hierarchical relationship between objects. It is
most commonly created as an output from hierarchical clustering. The main use of a
dendrogram is to work out the best way to allocate objects to clusters. The dendrogram below
shows the hierarchical clustering of six observations shown on the scatterplot to the
left. (Dendrogram is often miswritten as dendogram.)
Here are some things to know about dendrogram plots:

● U-shaped lines
A dendrogram is made up of many U-shaped lines that connect data points. The top of the U-
shaped line indicates a cluster merge, and the two legs of the U indicate which clusters were
merged.
● Distance
The height of each U-shaped line represents the distance between the two data points it
connects.
● Leaf nodes
If the original data set has 30 or fewer data points, each leaf in the dendrogram corresponds to
one data point. If there are more than 30 data points, some leaves will correspond to more
than one data point.
● Use
The main use of a dendrogram is to determine how to best allocate objects to clusters.
● Computational biology
Dendrograms are often used in computational biology to illustrate the clustering of genes or
samples.
● Annotate visualization:
You can annotate the visualization with meta-information to put the derived clusters into the
context of the application.
Annotations provide a way to mark points on a visualization with rich events. They are
visualized as vertical lines and icons on all graph panels. When you hover over an annotation,
you can get event description and event tags. The text field can include links to other systems
with more detail.
You can annotate visualizations in three ways:

● Directly in the panel, using the built-in annotations query
● Using the HTTP API
● Configuring annotation queries in the dashboard settings
Annotations are supported for the following visualization types:
● Time series
● State timeline
● Candlestick
4.5 Visualization of Networks:

Network visualization, also known as graph visualization or link analysis, is the process of
creating visual representations of networks of connected data. It's used to identify
relationships between data points, which can help improve performance visibility, root cause
analysis, and efficiency.
Network visualization can be used for a variety of purposes, including:
● Network topology mapping
Understanding the physical layout and current status of a network, such as in-office devices
or data centers
● Tracking network health and performance
Monitoring the real-time health and performance of network components like servers and
routers
● Onboarding new network engineers
Quickly providing a network engineer with an understanding of the network's logical and
physical connections
● Communicating complex information
Clearly explaining a situation to customers, managers, and engineers so they can address
specific problems
Network visualization can be performed at many scales, and there are many different types of
network visualization to choose from. The type of visualization that's best depends on the
type of data and the relationships that need to be shown.
⮚ Types of network visualization:

Outlined below are some of the common types of network visualization:
Network maps: Network maps help you visually depict your entire network architecture,
including device connections and data flows. There are two main types of network maps:
static and dynamic network maps.
o Static network maps provide an unalterable view of your network layout. These
maps are helpful in tracking your network topology.
o Dynamic network maps provide up-to-date information about the recent changes in
your network. These maps offer real-time information about your network device
health, uptime status, configuration settings, and data flows.
Charts and graphs: Charts and graphs help you keep a tab on your overall network health
by allowing you to compare different network metrics on a standard timeline. For instance,
you can compare various metrics like CPU utilization and throughput over a specific time
interval to identify trends and optimize your network accordingly. Similarly, you can
compare the performance metrics of network devices via charts.
Dashboards: Modern network visualization tools provide intuitive dashboards designed to
aid your network visualization efforts. These dashboards help you view and analyze your
network's critical performance metrics in a single place. The network-wide view offered by
these dashboards allows you to quickly identify faulty devices, configuration errors, and
malicious traffic patterns and take corrective action immediately to ensure uninterrupted
operations.
Important of network visualization:
With the increasing size and complexity of enterprise networks, identifying faulty devices,
incorrect configurations, and malicious network traffic has become a considerable challenge.
Network visualization helps you precisely understand your network architecture, including
device connections and data flows. For instance, visualizing your network via graphs, charts,
and maps allows you to rapidly identify faulty or misconfigured devices. Network
visualization is also critical from a network monitoring and analysis perspective. Any slight
change in network layout or device uptime status is instantly visible on the network
maps generated by visualization tools, helping streamline your network management efforts.
Gaining comprehensive knowledge of your network layout through visualization tools helps
you proactively resolve issues before they lead to network disruption. Additionally, charts
and graphs based on your network data allow you to assess resource utilization, traffic
volumes, and other key trends for better capacity planning. You can manually create network
maps to visualize your network, but it takes a lot of effort and time. Therefore, using
automated network visualization tools is preferable to obtain an up-to-date view of your
network.
Benefits of network visualization:
Faster troubleshooting: Network maps, such as performance graphs, provide an accurate

graphical picture of device interdependencies and data flows in your network and can be vital
for accelerated troubleshooting. Modern network visualization tools have a reliable alerting
system to give you instant notifications about critical network issues so you can remediate
them in a timely manner and ensure uninterrupted operations.
Improved network visibility: The enhanced network visibility provided by automatic
network visibility tools allows you to monitor your network effectively. Dynamic maps offer
you even better clarity by providing real-time information regarding your network's data
flows and architectural changes.
Simplified network planning: Network visualization tools help you rapidly identify faulty
network devices and replace them with new equipment to maintain a healthy infrastructure.
These tools also help you successfully lay out a blueprint for future network upgrades by
offering necessary information regarding the current device dependencies and resource
utilization rate in your network.
Higher staff productivity: Network operations staff can perform more effectively when they
have a solid understanding of your network architecture. Network visualization through maps
and charts helps them quickly understand a network's logical and physical layout. It allows
them to rapidly identify the problematic areas in a network and take corrective action rather
than wasting time on problem identification, helping to improve network staff productivity.
Improved inventory management: Network visualization enables you to perform
better network inventory management. The dynamic maps and reports generated by network
visualization tools help you identify all the devices connected to your network and maintain
an up-to-date record of your network assets. These maps are updated automatically whenever
a device is added or removed from the network to help you maintain accurate inventory
records.
Simplified network presentation: Network visualization allows you to use interactive charts
and diagrams to display critical information about a complex, distributed network in an
intuitive manner. It can be handy while training newly recruited network engineers in your
organization. A network topology map can help them quickly understand the complexity of
your network architecture. Similarly, network visualization can be helpful during compliance
audits because it allows you to explain your network's compliance status via rich visuals.
Use of network visualization:

Outlined below are some of the scenarios when network visualization can be helpful for your
business:
Rapid network expansion: The size and complexity of your network grow significantly
during business expansion. Discovering new devices, configuration settings, and data flows
in a large, distributed network can be challenging. Network visualization via maps can help
you understand how different devices are arranged in your network. These maps or diagrams
also allow you to understand the complex relationships between various network
components.
Network troubleshooting: Network visualization can be helpful when you want to quickly
locate the underlying cause of network faults or issues in an extensive network. Good
network visualization tools can help you troubleshoot problems much quicker by highlighting
the complex device dependencies in your network. These tools also allow you to analyze the
uptime and performance of devices connected to your network. Furthermore, the data-rich
reports combined with rich visuals of your network layout provided by these tools help you
identify faulty parts of your network in one glance. Modern network visualization tools can
also offer instant notifications regarding critical network issues for quick incident response.
4.6. Visualization of Software:

Software visualization refers to the graphical representation of software systems,
code, and related information to aid in understanding, analyzing, and communicating various
aspects of software development.
Software visualization tools are crucial for various reasons in modern software
development. Firstly, these tools provide developers with a clear and comprehensive
understanding of complex software systems. By visualizing code structure, dependencies, and
execution paths, developers can identify potential bottlenecks, design flaws, or areas for
optimization more efficiently. This understanding ultimately leads to improved code quality,
reduced technical debt, and enhanced maintainability of the software.
Secondly, software visualization tools aid in communication and collaboration among
team members. Visual representations of code are often more accessible and understandable
than lines of text, making it easier for developers to convey ideas, discuss architectural
decisions, and onboard new team members. Additionally, these tools facilitate
interdisciplinary collaboration by enabling developers to communicate effectively with non-
technical stakeholders such as project managers, designers, and clients. Overall, the use of
software visualization tools fosters better teamwork, reduces misunderstandings, and
accelerates the development process.
⮚ Types of Software Visualization:

● Code Visualization
Code Structure Visualization: Represents the organization and structure of code, including
classes, modules, and their relationships.
Code Dependency Visualization: Illustrates dependencies between different components or
modules in a software system.
● Execution Visualization
Runtime Behavior Visualization: Shows the dynamic behavior of a program during

execution, helping developers understand the flow of control, data, and interactions between
different components.
● Data Flow Visualization
Data Flow Diagrams: Depicts how data moves through a system, showing the flow of
information between various components.
● System Architecture Visualization
System Overview Diagrams: Provide a high-level view of the entire software system,
including its components and their interactions.
● Version Control Visualization
Version History Graphs: Represents the evolution of a codebase over time, including
branches, merges, and changes made by different contributors.
● Performance Vizualization
Performance Profiling Charts: Visualizes the performance characteristics of a software

system, helping identify bottlenecks and areas for optimization.
● Debugging Visualization
Debugging Visualizations: Aids developers in understanding the execution flow, variable

values, and program state during the debugging process.
● Security Visualization
Security Flow Diagrams: Illustrates potential security vulnerabilities and attack vectors
within a software system.
● User Interface (UI) Visualization
User Interface Prototypes: Visualizes the layout and design of user interfaces, helping
designers and developers collaborate on the visual aspects of software.
Software visualization tools are designed to help developers, architects, and other
stakeholders understand, analyze, and communicate various aspects of software systems.
These tools often present information about code structure, dependencies, runtime behavior,
and other relevant metrics in a visual format.
⮚ Top Software Visualization Tools:
● CodeMap
Visual Studio, a widely used integrated development environment (IDE) designed for
Microsoft technologies, incorporates a functionality known as Code Maps. This feature
empowers developers to graphically represent code dependencies, call hierarchies and
relationships among various components within the codebase.
● SonarQube
SonarQube serves as a continuous inspection tool, offering diverse visualizations to assess

code quality and security. It provides code metrics, identifies issues, and can generate visual
reports, aiding teams in comprehending the overall health of their codebase.
● Jarchitect
JArchitect, a static analysis tool designed for Java, delivers a range of visualizations to assist
developers in comprehending code structure, and dependencies, and pinpointing areas that
require enhancement. It seamlessly integrates with both Visual Studio and Eclipse.
● Eclipse Mat
MAT stands as a robust tool designed for scrutinizing Java heap dumps. While its main
emphasis lies in-memory analysis, it offers visualizations that aid developers in pinpointing
memory leaks and comprehending patterns of memory consumption.
● D3js
D3.js is a JavaScript library crafted for generating dynamic, interactive data visualizations
within web browsers. While it is not explicitly tailored for software visualization, developers
can harness its capabilities to construct personalized visualizations for data related to code.
Software visualization tools and techniques can include static visualizations (based on
code analysis without execution) and dynamic visualizations (based on runtime behavior).
These software visualization tools enhance comprehension, collaboration, and decision-
making in software development processes.
4.7 Visualization of Metaphorical Visualization:

Metaphorical visualization is a powerful technique that employs metaphors to represent
complex ideas, data, or systems, making them more accessible and relatable. By translating
abstract concepts into concrete imagery, metaphorical visualization can help audiences grasp
intricate information more intuitively. This method is particularly beneficial in fields like data
science, education, and project management, where understanding nuances is critical.
The Power of Metaphors
Metaphors create connections between unfamiliar concepts and familiar experiences. By

framing a complex idea within the context of something more recognizable, they facilitate
comprehension and retention. For example, comparing a company's growth to a "tree" allows
stakeholders to visualize branches of development, showing how different departments or
products contribute to the overall health of the organization.
Common Metaphorical Visualizations:
1. Tree Metaphor
The tree metaphor is a compelling way to represent relationships, hierarchies, and

connections in various contexts. It serves as a powerful visualization technique that can
simplify complex data and make abstract concepts more tangible. Here’s a detailed
exploration of how the tree metaphor can be effectively used in metaphorical visualizations.
Structure of the Tree Metaphor
1. Root:
o Description: The root of the tree represents the foundational idea, concept, or
primary entity from which all other branches emerge.
o Example: In an organizational chart, the root might symbolize the company itself,
highlighting its core mission and values.
2. Trunk:
o Description: The trunk symbolizes the main support structure, connecting the root to
the branches. It represents the key functions or divisions that support the overall
entity.
o Example: For a software project, the trunk could represent major components like
"Frontend," "Backend," and "Database."
3. Branches:
o Description: Branches extend from the trunk, representing sub-categories,
departments, or specific aspects of the main entity. Each branch can further divide
into smaller branches.
o Example: In an educational setting, branches could represent different subjects or
departments, such as "Mathematics," "Science," and "Humanities."
4. Leaves:
o Description: Leaves symbolize the finer details or individual components within
each branch. They represent specific tasks, projects, or elements within a category.
o Example: Under the "Science" branch, leaves could represent subjects like
"Biology," "Chemistry," and "Physics."
5. Fruits/Flowers:
o Description: Fruits or flowers can represent the outcomes, achievements, or goals
that result from the healthy growth of the tree.
o Example: In a business context, fruits might represent successful projects, increased
revenue, or customer satisfaction metrics.
Applications of the Tree Metaphor
1. Organizational Structure:
o Visual Representation: A tree diagram can show the hierarchy within an
organization. The CEO at the root, departments as branches, and teams as leaves
create a clear visual of how the organization operates.
o Benefit: This visualization helps employees understand their role within the larger
context and clarifies reporting structures.
2. Project Management:
o Visual Representation: A project tree can represent major project phases as
branches, with individual tasks as leaves. This layout allows project managers to
visualize dependencies and workflows.
o Benefit: It enables teams to track progress and identify areas that require attention or
resources.
3. Knowledge Representation:
o Visual Representation: In education, a tree metaphor can illustrate the relationships
between topics. For instance, a subject like "History" could have branches for
"Ancient," "Modern," and "Contemporary," with further subdivisions.
o Benefit: This method aids in knowledge retention by visually linking related concepts
and helping learners see the bigger picture.
4. Software Architecture:
o Visual Representation: A software system can be visualized as a tree, where the root
represents the main application, branches represent modules, and leaves represent
individual functions or classes.
o Benefit: This visualization clarifies how different components interact and
dependencies within the system.
5. Decision-Making Processes:
o Visual Representation: A decision tree can guide users through a series of choices,
with each branch representing a different decision path.
o Benefit: This approach simplifies complex decision-making processes and helps
users visualize potential outcomes.
Creating Effective Tree Visualizations
1. Simplicity and Clarity:

o Ensure that the tree is not overly complex. Limit the number of branches to maintain
clarity, focusing on essential elements to communicate the core message.
2. Consistent Styling:
o Use consistent colors, shapes, and sizes to differentiate branches and leaves. This
consistency helps users quickly understand the structure and hierarchy.
3. Interactive Features:
o Where applicable, incorporate interactive elements that allow users to expand or
collapse branches. This interactivity can enhance engagement and exploration.
4. Descriptive Labels:
o Clearly label each branch and leaf with meaningful titles or descriptions. This
practice enhances comprehension and aids in navigation.
5. Iterate and Gather Feedback:
o Test your tree visualization with users to gather feedback. Iteration based on user
input can lead to improvements and a more effective representation.
2. River/Flow Metaphor:
The river or flow metaphor is a powerful visualization technique used to represent processes,
data flows, or the movement of elements over time. This metaphor leverages the natural
imagery of rivers—smooth, flowing, and often meandering—to convey concepts related to
progression, direction, and change. Here’s a comprehensive exploration of the river/flow
metaphor, its applications, and best practices for creating effective visualizations.
Structure of the River Metaphor
1. Source:
o Description: The starting point of the river symbolizes the origin of a process or flow
of information.
o Example: In a user journey, the source might represent the initial interaction a user
has with a product or service.
2. Flow Path:
o Description: The main body of the river represents the journey or process itself,
illustrating how data or elements move through various stages.
o Example: In a workflow visualization, the flow path could represent the steps taken
from project initiation to completion.
3. Bends and Turns:
o Description: Bends in the river can indicate decision points, changes in direction, or
alternative paths.
o Example: In project management, bends may represent pivot points where teams
decide to adjust their approach based on feedback or new information.
4. Branches:
o Description: Just as rivers may have tributaries, flow visualizations can include
branches that represent diverging paths or processes.
o Example: In a marketing funnel, branches could show different customer segments
or strategies leading to various outcomes.
5. Obstacles and Challenges:
o Description: Rocks, waterfalls, or dams in the river can symbolize obstacles or
challenges encountered along the way.
o Example: In software development, these might represent bottlenecks or hurdles that
teams need to address.
6. Mouth:
o Description: The endpoint of the river signifies the conclusion of the process or flow,
where outcomes are realized.
o Example: In a sales pipeline, the mouth could represent the final sale or conversion.
Applications of the River/Flow Metaphor
1. User Journey Mapping:

o Visual Representation: A river can depict the flow of a user's interactions with a
product, highlighting key touchpoints and experiences.
o Benefit: This visualization helps identify pain points and opportunities for improving
user experience.
2. Workflow and Process Visualization:
o Visual Representation: Flowcharts can take on a river-like form to represent the
steps in a process, showing how tasks move from one stage to another.
o Benefit: This approach clarifies the progression of tasks, helping teams understand
dependencies and improve efficiency.
3. Data Flow Diagrams:
o Visual Representation: In data management, rivers can represent the flow of
information between systems or components, illustrating how data moves and
transforms.
o Benefit: This visualization aids in identifying data sources, sinks, and transformation
processes.
o Visual Representation: A project timeline can be visualized as a river, showing
phases of the project with milestones represented as landmarks along the way.
o Benefit: This layout allows stakeholders to track progress and foresee potential
challenges.
5. Change Management:
o Visual Representation: Rivers can illustrate the process of change within an
organization, from initiation through adoption to stabilization.
o Benefit: This metaphor helps communicate the fluidity of change and the steps
necessary to navigate it successfully.
Creating Effective River/Flow Visualizations
1. Clarity of Flow:
o Ensure that the flow path is clear and intuitive. Avoid cluttering the visualization with
unnecessary details that could confuse the audience.
2. Use of Color and Style:
o Utilize colors to represent different aspects of the flow. For example, a gradient might
indicate progression or status (e.g., red for challenges, green for successful steps).
3. Labeling:
o Clearly label key points along the flow, including sources, branches, and obstacles.
This practice enhances understanding and navigation.
4. Interactive Elements:
o Incorporate interactive features that allow users to explore the flow, such as clicking
on branches for more details or hovering over obstacles to see explanations.
5. Iterate and Gather Feedback:
o After creating the visualization, seek feedback from users to ensure that the metaphor
effectively communicates the intended message. Iteration based on user input can
improve clarity and usability.
The river/flow metaphor is a versatile and effective way to visualize processes, journeys, and data
flows. By utilizing this metaphor, individuals and organizations can communicate complex
information more clearly and intuitively. Whether in user journey mapping, workflow visualization,
or change management, the river metaphor provides a framework that enhances understanding and
fosters better decision-making. As we continue to navigate complex systems and processes, the river
metaphor remains a valuable tool in our visualization toolkit, guiding us toward clarity and insight.
3. Map Metaphor:
The map metaphor is a powerful visualization technique that represents information spatially,
allowing users to navigate complex data and relationships through familiar geographic or
schematic representations. This metaphor can simplify abstract concepts by grounding them
in a more relatable context, enhancing understanding and engagement. Here’s an in-depth
exploration of the map metaphor, its applications, and best practices for effective
visualizations.
Structure of the Map Metaphor
1. Landmarks:
o Description: Key features or points of interest on the map symbolize significant data
points, entities, or concepts.
o Example: In a customer journey map, landmarks could represent critical touchpoints
such as onboarding, purchase, or customer support.
2. Paths/Routes:
o Description: Lines or routes connecting landmarks represent relationships, processes,
or the flow of information.
o Example: In a workflow visualization, paths could illustrate the sequence of steps
taken to complete a task.
3. Regions/Zones:
o Description: Areas on the map can represent different categories, segments, or
phases, allowing for grouping and comparison.
o Example: In a marketing strategy map, regions could delineate target demographics
or market segments.
4. Scale and Compass:
o Description: A scale indicates the level of detail or scope of the map, while a
compass provides orientation, helping users understand directionality.
o Example: In a project roadmap, the scale might indicate the timeline of phases, while
the compass helps orient the viewer to key milestones.
5. Symbols and Icons:
o Description: Visual symbols can convey specific meanings or characteristics of data
points, enhancing comprehension.
o Example: In a sales territory map, different symbols could represent various sales
channels or customer types.
Applications of the Map Metaphor
1. Geographic Data Visualization:

o Visual Representation: Maps can represent spatial data, such as population density,
sales territories, or resource distribution.
o Benefit: This approach allows users to easily identify geographic patterns, trends, and
opportunities.
2. Customer Journey Mapping:
o Visual Representation: A map can illustrate the journey a customer takes through
different touchpoints, from awareness to purchase.
o Benefit: This visualization helps identify pain points and opportunities for enhancing
customer experience.
o Visual Representation: A project roadmap can be visualized as a map, highlighting
different phases, milestones, and dependencies.
o Benefit: This layout provides a clear overview of project progress and helps
stakeholders understand timelines and objectives.
4. Knowledge Representation:
o Visual Representation: Concept maps can visually represent relationships between
ideas, with nodes as concepts and links as relationships.
o Benefit: This method enhances understanding of complex subjects by illustrating
connections between topics.
5. Strategic Planning:
o Visual Representation: Strategic plans can be mapped out, showing goals,
initiatives, and how they relate to one another.
o Benefit: This visualization helps align team efforts and clarifies the path toward
achieving strategic objectives.
Creating Effective Map Visualizations
1. Clarity and Simplicity:

o Ensure that the map is clear and easy to navigate. Avoid overcrowding with too many
landmarks or routes that can overwhelm the viewer.
2. Consistent Design Elements:
o Use consistent symbols, colors, and styles throughout the map to enhance recognition
and understanding. This consistency aids in navigation and comprehension.
o Incorporate interactive elements that allow users to zoom in/out, click on landmarks
for details, or filter information based on specific criteria. Interactivity enhances
engagement and exploration.
o Clearly label key features, routes, and regions. Descriptive labels help users
understand the significance of each element and navigate the map more effectively.
5. Feedback and Iteration:
o After creating the map, gather feedback from users to ensure it effectively
communicates the intended message. Iterative improvements based on user input can
enhance clarity and usability.
The map metaphor is a versatile and impactful way to visualize complex information.
By employing spatial representations, individuals and organizations can communicate
relationships, processes, and concepts more clearly and intuitively.
Whether in geographic data visualization, customer journey mapping, or project

management, the map metaphor enhances understanding and engagement, guiding viewers
through intricate data landscapes
4. Lego Metaphor:
The Lego metaphor is a creative and engaging visualization technique that uses the familiar
imagery of Lego blocks to represent components, ideas, or processes. This metaphor is
particularly effective in illustrating modularity, integration, and relationships among different
parts, making complex concepts more accessible and relatable. Here’s an in-depth look at the
Lego metaphor, its applications, and best practices for effective visualizations.
Structure of the Lego Metaphor
1. Blocks:
o Description: Each Lego block represents a discrete component, idea, or data point.
The size, shape, and color of the blocks can convey different meanings or attributes.
o Example: In a software architecture diagram, blocks might represent different
modules or functions within the system.
2. Connections:
o Description: The way blocks are connected illustrates relationships, dependencies, or
interactions between components.
o Example: In a project plan, blocks can represent tasks that connect to show
dependencies, with lines indicating the flow of work.
3. Layers:
o Description: Stacking blocks can represent hierarchies or layers of complexity,
where each layer adds depth to the overall structure.
o Example: In an organizational chart, different layers of blocks could represent
various levels of management or departments.
4. Customization:
o Description: The ability to mix and match blocks allows for flexibility and
customization in visualizations, showing how different components can be combined
or altered.
o Example: In product development, different blocks might represent features that can
be added or removed based on user feedback.
5. Color Coding:
o Description: Different colors can represent categories, statuses, or types of
components, enhancing visual clarity and understanding.
o Example: In a project management visualization, blocks could be color-coded to
indicate task status (e.g., red for overdue, green for completed).
Applications of the Lego Metaphor
1. Software Development:
o Visual Representation: The Lego metaphor can illustrate software architecture,
where individual blocks represent different modules or services.
o Benefit: This approach clarifies how components fit together and interact, aiding
developers in understanding system design.
o Visual Representation: In project planning, blocks can represent tasks or milestones,
showing how they connect and depend on one another.
o Benefit: This visualization helps teams visualize workflows and identify potential
bottlenecks.
3. Organizational Structure:
o Visual Representation: An organizational chart can use the Lego metaphor to show
departments as blocks, illustrating how they interconnect.
o Benefit: This layout highlights collaboration and communication pathways within the
organization.
4. Product Development:
o Visual Representation: Different features of a product can be represented as Lego
blocks, illustrating how they can be combined to create a final product.
o Benefit: This approach encourages brainstorming and flexibility in feature selection
based on user needs.
5. Education and Learning:
o Visual Representation: The Lego metaphor can be used in educational settings to
represent concepts and their relationships, fostering interactive learning.
o Benefit: This hands-on approach enhances student engagement and understanding of
complex topics.
Creating Effective Lego Visualizations
1. Simplicity and Clarity:

o Ensure that the visualization remains simple and easy to understand. Avoid
overcrowding with too many blocks or connections that can overwhelm the viewer.
2. Consistent Design:
o Use consistent block sizes, shapes, and colors to enhance recognition and
understanding. This consistency helps users navigate the visualization more easily.
o Incorporate interactive elements that allow users to click on blocks for more
information or to rearrange them. This interactivity fosters engagement and
exploration.
o Clearly label each block with meaningful titles or descriptions. This practice
enhances comprehension and aids in navigation.
o Gather feedback from users after creating the visualization to ensure it communicates
the intended message effectively. Iterative improvements based on user input can
enhance clarity and usability.
5. Cloud Metaphor:
The cloud metaphor is a popular visualization technique that uses cloud imagery to represent
concepts such as ideas, data, or relationships in a way that is both engaging and intuitive.
This metaphor often conveys the notion of vastness, connectivity, and the dynamic nature of
information. Here’s an in-depth exploration of the cloud metaphor, its applications, and best
practices for creating effective visualizations.
Structure of the Cloud Metaphor

1. Cloud Shape:
o Description: The cloud itself serves as a central figure, representing a collective
concept, theme, or data set. Its amorphous shape suggests fluidity and adaptability.
o Example: A cloud could symbolize a topic in a brainstorming session, where various
sub-ideas branch out from it.
2. Words and Phrases:
o Description: Words or phrases can be embedded within or around the cloud,
indicating key themes or elements related to the central idea. The size of each word
often reflects its significance or frequency.
o Example: In a word cloud, larger words represent more frequently mentioned
concepts, providing a quick visual summary of key topics.
3. Connections and Lines:
o Description: Lines can connect different elements within the cloud or link multiple
clouds together, illustrating relationships and interactions.
o Example: In a knowledge map, connections may show how different concepts are
interrelated, enhancing understanding of complex ideas.
4. Color Variations:
o Description: Different colors can indicate categories, statuses, or types of data,
adding another layer of meaning to the visualization.
o Example: In a cloud metaphor representing customer feedback, different colors
might categorize comments by sentiment (positive, negative, neutral).
5. Background Elements:
o Description: Background imagery or patterns can enhance the visualization,
providing context without overwhelming the central cloud concept.
o Example: A gradient or abstract pattern behind a cloud can symbolize trends over
time, such as growth or decline.
Applications of the Cloud Metaphor
1. Word Clouds:
o Visual Representation: Word clouds are a common application of the cloud
metaphor, visually representing the frequency of terms in a body of text.
o Benefit: They quickly highlight key themes and topics, making it easy to understand
the essence of large text data at a glance.
2. Brainstorming and Ideation:
o Visual Representation: During brainstorming sessions, clouds can represent main
ideas, with sub-ideas branching out to illustrate connections and relationships.
o Benefit: This approach encourages creative thinking and helps teams visualize the
flow of ideas.
3. Data Representation:
o Visual Representation: Clouds can represent datasets, with variations in size, color,
or shape indicating different characteristics or metrics.
o Benefit: This visualization method helps convey complex data in a visually appealing
manner, making it easier to identify trends.
4. Customer Feedback Analysis:
o Visual Representation: A cloud metaphor can represent customer sentiments or
feedback, with the size of each comment reflecting its frequency or significance.
o Benefit: This visualization allows teams to quickly gauge customer opinions and
identify areas for improvement.
5. Knowledge Mapping:
o Visual Representation: In knowledge management, clouds can illustrate
relationships between concepts, showing how ideas connect and influence one
another.
o Benefit: This approach helps in understanding the broader context of information and
enhances knowledge sharing.
Creating Effective Cloud Visualizations
1. Clarity and Readability:

o Ensure that the text within the cloud is legible, with clear font choices and appropriate
sizing. Avoid overcrowding the cloud with too many words, which can dilute the
message.
2. Consistent Color Scheme:
o Use a consistent color palette to enhance visual appeal while ensuring that colors
have meaningful distinctions (e.g., different sentiments or categories).
o Incorporate interactive elements that allow users to click on words or concepts for
more information, enhancing engagement and exploration.
o Clearly label the central cloud concept and any sub-elements to provide context and
enhance understanding.
o After creating the visualization, gather feedback to ensure it effectively
communicates the intended message. Iterative adjustments based on user input can
lead to improvements.
6. Garden Metaphor
The garden metaphor is a rich and evocative visualization technique that represents growth,
diversity, and interconnectedness. This metaphor draws on the imagery of a garden, where
various plants, flowers, and elements coexist and thrive together, symbolizing ideas,
processes, or relationships in a visually engaging way. Here’s an in-depth exploration of the
garden metaphor, its applications, and best practices for creating effective visualizations.
Structure of the Garden Metaphor
1. Plants and Flowers:

o Description: Each plant or flower represents an idea, component, or data point. The
type, size, and color of the plants can convey different meanings or attributes.
o Example: In a project management visualization, flowers might represent individual
tasks, with larger blooms signifying more critical tasks.
2. Soil:
o Description: The soil symbolizes the foundational elements that support growth, such
as core values, resources, or knowledge.
o Example: In an organizational context, soil could represent the company culture or
foundational strategies that nurture growth.
3. Paths:
o Description: Paths in the garden illustrate the journey or process taken to achieve
goals, guiding viewers through the landscape of ideas.
o Example: In a customer journey map, paths could represent the steps a customer
takes from awareness to purchase.
4. Garden Features:
o Description: Elements like fences, benches, or water features can represent barriers,
support structures, or key milestones within the process.
o Example: In strategic planning, a fence might symbolize constraints, while a bench
could represent reflection points where teams assess progress.
5. Seasons:
o Description: Different seasons in the garden can symbolize various stages of
development or change, highlighting the dynamic nature of growth.
o Example: In a product development visualization, spring could represent the ideation
phase, while autumn represents the launch.
Applications of the Garden Metaphor
o Visual Representation: A garden can illustrate project timelines, where different
plants represent tasks, and their growth signifies progress.
o Benefit: This visualization helps teams understand the interdependencies of tasks and
the overall health of the project.
2. Organizational Growth:
o Visual Representation: In an organizational chart, different plants can represent
departments, showcasing how they contribute to the overall mission.
o Benefit: This metaphor emphasizes collaboration and the importance of each
department in the organizational ecosystem.
3. Idea Development:
o Visual Representation: During brainstorming sessions, ideas can be visualized as
seeds that grow into plants, showing how concepts develop over time.
o Benefit: This approach encourages creative thinking and illustrates the evolution of
ideas.
4. Knowledge Mapping:
o Visual Representation: A garden metaphor can represent interconnected knowledge
areas, with plants symbolizing different concepts and their relationships.
o Benefit: This visualization aids in understanding how various pieces of knowledge
contribute to a larger framework.
5. Customer Journey Mapping:
o Visual Representation: A garden can illustrate the customer journey, with different
plants representing stages of engagement and growth in the relationship.
o Benefit: This metaphor highlights the nurturing aspect of customer relationships and
the importance of care at each stage.
Creating Effective Garden Visualizations

1. Visual Clarity:
o Ensure that the visualization remains clear and easy to navigate. Avoid overcrowding
the garden with too many elements that may confuse the viewer.
2. Consistent Symbolism:
o Use consistent colors, sizes, and types of plants to represent different categories or
statuses. This consistency helps viewers quickly interpret the visualization.
o Incorporate interactive elements that allow users to click on plants for more
information or to explore different pathways in the garden. This engagement
enhances exploration.
o Clearly label key components in the garden, including plants, paths, and features, to
provide context and improve understanding.
o Gather feedback from users to ensure that the visualization effectively communicates
the intended message. Iterative improvements can lead to enhanced clarity and
usability.
Tools for Creating Metaphorical Visualizations
● Tableau: Offers capabilities for creating various types of visualizations, including

metaphorical ones with custom shapes.
● D3.js: A JavaScript library that allows for the creation of complex, interactive metaphorical
visualizations.
● Power BI: Useful for creating visually appealing dashboards that can include metaphorical
elements.
● Mind Mapping Tools: Tools like MindMeister or XMind can be used to create metaphorical
visualizations that represent ideas and their relationships.
Best Practices
● Clarity and Relevance: Ensure the metaphor is clear and directly relates to the concept being
visualized.
● Audience Consideration: Tailor the metaphor to resonate with your audience's experiences
and knowledge.
● Simplicity: Avoid overly complex metaphors that can confuse rather than clarify.
● Iterate and Test: Get feedback on your metaphorical visualizations to ensure they
communicate the intended message effectively.
DWDV
UNIT –V
Syllabus:
Visualization of volumetric data, vector fields, processes and simulations, Visualization
of maps, geographic information, GIS systems, collaborative visualizations, evaluating
visualizations

DWDV notes

Uploaded by

Copyright:

Available Formats

DWDV notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWDV notes

Uploaded by

Copyright:

Available Formats

DWDV

Data Wrangling: Need of data cleanup, data clean up basics – formatting,

These days almost anything can be a valuable source of information. The

The nature of the information is that it requires a certain kind of organization to

What is Data Wrangling?

Sometimes, data Wrangling is referred to as data munging. It is the process of

Wrangling the data is usually accompanied by Mapping. The term "Data

Data wrangling software has become an indispensable part of data processing.

Data Wrangling Process

1. Discovery: Before starting the wrangling process, it is critical to think

Use Case of Data Wrangling

Data munging is used for diverse use-cases as follows:

o Distinguish corporate fraud by identifying unusual behavior by

o Decrease the time spent on data preparation for analysis

Data Wrangling Tools

Automated data cleaning becomes necessary in businesses dealing with

o Spreadsheets / Excel Power Query is the most basic manual data

o Plotly (data wrangling with Python) is useful for maps

Benefits of Data Wrangling

o Data consistency: The organizational aspect of data wrangling offers a

o Transactional data: Transactional data refers to business operation

Data Wrangling Examples

o Detect corporate fraud

Formatting, in the context of data management, refers to the process of

Functionality and Features

Formatting allows for standardization and normalization of data. It aids in error

Benefits and Use Cases

Formatting provides numerous benefits, including improved data quality,

Challenges and Limitations

Despite its benefits, formatting comes with challenges, such as handling

Integration with Data Lakehouse

Formatting plays a vital role in a data lakehouse environment. It facilitates the

Efficient formatting significantly impacts data processing performance,

1. What is data formatting? Data formatting is the process of structuring data

ETL: Extract, Transform, Load – a process in data warehousing.

Data Formatting: The process of structuring and arranging data according to

What is the difference between data cleaning and data transformation?

How to clean data

Step 1: Remove duplicate or irrelevant observations

1. Duplicate observations will happen most often during data collection.

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

1. Often, there will be one-off observations where, at a glance, they do not

Step 4: Handle missing data

 Does the data make sense?

There are four ways to identify outliers:

What are outliers?

An outlier isn’t always a form of dirty or incorrect data, so you have to be

Example: True outlierYou measure 100-meter running times for a

Example: Distortion of results due to outliersYou calculate the average running

Four ways of calculating outliers

Example: Sorting methodYour dataset for a pilot experiment consists of 8

9 156 163 166 171 176 180 1872

 Statistical outlier detection

If a value has a high enough or low enough z score, it can be considered an

 Using the interquartile range

Interquartile range method

1. Sort your data from low to high

Example: Using the interquartile range to find outliers

Step 1: Sort your data from low to high