Chapter 1-Introduction to data science
Chapter 1-Introduction to data science
Data science enables businesses to process huge amounts of structured and unstructured big data to
detect patterns. This in turn allows companies to increase efficiencies, manage costs, identify new
Data science is the practice of mining large data sets of raw data, both structured and unstructured, to
identify patterns and extract actionable insight from them. This is an interdisciplinary field, and the
foundations of data science include statistics, inference, computer science, predictive analytics,
machine learning algorithm development, and new technologies to gain insights from big data.
Why Data Science is Important
By 2020, there will be around 40 zettabytes of data—that's 40 trillion gigabytes. The amount of data
that exists grows exponentially. At any time, about 90 percent of this huge amount of data gets
generated in the most recent two years, according to sources like IBM and SINTEF.
In fact, internet users generate about 2.5 quintillion bytes of data every day. By 2020, every person on
Earth will be generating about 146,880 GB of data every day, and by 2025, that will be 165 zettabytes
every year.
This means there is a huge amount of work in data science—much left to uncover. According to The
Guardian, in 2012 only about 0.5 percent of all data was analyzed.
Simple data analysis can interpret data from a single source, or a limited amount of data. However,
data science tools are critical to understanding big data and data from multiple sources in a
meaningful way. A look at some of the specific data science applications in business illustrate this
Data science applications are frequently used in healthcare, marketing, banking and finance, and
policy work. Here are some common examples of data science services in action in trending data
science fields:
How Data Science is Transforming Health Care
Data science is transforming healthcare as consumers and healthcare providers alike use data that
wearables generate to monitor and prevent health problems and emergencies. In 2018, McKinsey
described a “big data revolution” in healthcare. In fact, according to McKinsey, applying data science
to the US healthcare system could reduce healthcare spending by $300 billion to $450 billion, or 12 to
Data science and analytics come together when data science is applied in a business setting. Data
science helps businesses better understand the specific needs customers have based on existing
data. For example, with customer age, purchase history, past browsing history, income, and other
demographics, a data scientist can more effectively train models for search and product
recommendation.
Data Science in Finance
Data science is a powerful tool for fraud detection and prevention, honing the ability of financial
institutions to recognize problematic patterns in data faster. Data science can also help reduce non-
For example, institutions that provide loans or other money on credit need to limit the probability of
customers defaulting on payments. To do this, they might use data science to create a model that can
perform predictive analytics on customer payment history data. This would allow the institution to
Government policymakers can use data science to better shape policies to meet the needs of their
constituents, combat census undercount using big data and ML, and more. For example,
policymakers might want to use geospatial data science and related data to drive decisions about
when to evacuate an area based on historical weather patterns. The correlative or descriptive analysis
Data scientists can collect and analyze data sets from aircrafts, ships, satellites, and radars to create
models. These models can help forecast the weather more effectively every day. However, they also
enable scientists to predict natural disasters with greater precision, improve vegetation
management, prevent the next Paradise disaster, and help disaster response decision makers know
To understand how data science helps marketing, consider the levels of insight that big data can offer
into consumer behavior. Companies can refine things like pricing and other marketing strategies using
data science.
For example, data science can drive pricing for e-commerce companies. This allows them to refine
their sense of what the market will bear for their product or service, and increase their profits. Data
science also helps businesses more effectively develop and market their products, because it allows
Data science and data analytics offer insight into purchasing patterns. Businesses can use descriptive
analytics to describe data sets surrounding how consumers buy under different conditions.
They can also deploy correlative analysis to predict when relationships might exist between given
data sets or variables. For example, data might reveal that a subset of consumers that purchase
certain kinds of products are very likely to try products like their new offering.
But data science goes further than this kind of analysis. It can also predict future patterns, identifying
actions that could meaningfully affect overall business strategy. For instance, data scientists can
uncover optimal price points, bids for programmatic advertising or ways to generate new customers in
Data scientists need tools for data transformation, data cleaning, and data visualization. There is also
a need to detect outliers, identify relationships between variables, and construct complete interpretive
models inside a suitable environment. This is where data preparation and statistical analysis tools like
R was developed as a user-friendly language for statistics, data analysis, and graphical models. R
has a large community of programmers that use and support it online, so there is no need to develop
everything alone.
R is particularly suited to data analysis tasks on individual servers that demand standalone analysis or
computing. It’s also excellent for exploratory work and ideal for data science visualization, working in
tandem with visualization packages such as googleVis, ggvis, ggplot2, and rCharts.
On the other hand, R may be too heavy and slow for your system. It also has difficult syntax, and
Python was developed as a more readable language for general uses, and it is simpler and more
flexible to learn. Another key difference is that R exists mostly within the data science ecosystem,
The IPython Notebook system allows users to share notebooks with each other, enabling easier
working without installations, dramatically reducing lost time. The easier learning curve also typically
means shorter time before mastery, including writing and testing your own programs and code—
including in other fields. The down side to Python for data science is less data visualization power.
Python for data science works in many of the same ways and there is little need to learn them both.
However, for some beginner users, Python may be easier to learn due to its simpler syntax.
Conversely, for those with more statistical background or more statistical analysis demands, R for
data science may be a better choice. Decide based on the data problems you will solve, your ability to
learn and master the tool, how much data visualization you expect to do, and the current standards in
Data scientists represent data in the form of graphs, charts and other visualizations. These data
visualizations allow users to “see” insights that are invisible in excel sheets of data. For example, you
may want to depict how certain trends in data relate to each other, or how multiple factors coincide.
Data visualization environments are a common mode of deploying the results of data science to a
broader audience, for example, by using web-based tools that allow exploration and interaction with
the resulting data. To support effective data visualization, a system must have access to the relevant
Visualizing the data in a scatter-plot or other graph can reveal patterns and relationships that are
impossible to observe otherwise. It can also suggest further avenues for research, and new business
strategies.
Data comes from various sources, such as online purchases, multimedia forms, instruments, financial
logs, sensors, text files, and others. Data might be unstructured, semi-structured, or structured.
Unstructured data includes data from blogs, digital audio/video feeds, digital images, emails, mobile
devices, sensors, social networks and tweets, web pages, and online sources. Semi-structured data
includes data from system log files, XML files, and text files. Structured data which has already been
processed in some way includes OLTP, RDBMS (databases), transaction data, and other formats.
This is all “big data,” and putting it to good use is a pressing job of the 21st century. It's simply not
possible to process tremendous amounts of data from disparate sources with simple business
intelligence tools, or even data analytics tools. Instead, data science presents businesses with
advanced, complex algorithms and other tools for analyzing, cleansing, processing, and extracting
Data science is not one tool, skill, or method. Instead, it is a scientific approach that uses applied
statistical and mathematical theory and computer tools to process big data.
The foundations of data science combine the interdisciplinary strengths of data cleansing, intelligent
data capture techniques, and data mining and programming. The result is the data scientist’s ability to
This is one point that distinguishes the work of the data scientist from the data engineer, although
sometimes the two roles are confused. The data engineer prepares data sets for the data scientist to
work with and draw insights from, but the intelligent analysis work falls to data scientists, not “data
science engineers.”
Big data is the raw material used in the field of data science. Characterized by its velocity, variety, and
volume (the 3Vs), big data is the raw material for data science, which affords the techniques for
1. VOLUME
Within the Social Media space for example, Volume refers to the amount of data generated through
websites, portals and online applications. Especially for B2C companies. This is the incredible
amount of data that is generated from different sources like social media, cell phones,
sensors, public releases in the form of public data, photographs, videos. This data is so
large that it cannot be stored using traditional techniques to store and analyze data.
Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350 million users
and Instagram 700 million users. Every day, these users contribute to billions of images, posts,
videos, tweets etc. You can now imagine the insanely large amount -or Volume- of data that is
generated every minute and every hour.
2.VELOCITY
No one likes data coming at a lower speed and thus speed plays a crucial role.
Velocity is the speed at which this data is collected, stored and analyzed.
With Velocity we refer to the speed with which data are being generated. Staying with our social
media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are
posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5 billion searches are
performed in Google. This is like a nuclear data explosion. Big Data helps the company to hold this
explosion, accept the incoming flow of data and at the same time process it fast so that it does not
create bottlenecks.
3. VARIETY
Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting
generated either by humans or by machines. The most commonly added data are structured -texts,
tweets, pictures & videos. However, unstructured data like emails, voicemails, hand-written text, ECG
reading, audio recordings etc, are also important elements under Variety. Variety is all about the
ability to classify the incoming data into various categories.
Application of data science
10 applications that build upon the concepts of Data Science, exploring various domains such as the following:
The earliest applications of data science were in Finance. Companies were fed up of bad debts and losses every
year. However, they had a lot of data which use to get collected during the initial paperwork while sanctioning
loans. They decided to bring in data scientists in order to rescue them out of losses.
Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures,
and other essential variables to analyze the probabilities of risk and default. Moreover, it also helped them to
push their banking products based on customer’s purchasing power.
Healthcare
The healthcare sector, especially, receives great benefits from data science applications.
Procedures such as detecting tumors, artery stenosis, organ delineation employ various different methods and
frameworks like MapReduce to find optimal parameters for tasks like lung texture classification. It applies
machine learning methods, support vector machines (SVM), content-based medical image indexing, and wavelet
analysis for solid texture classification.
2. Genetics & Genomics
Data Science applications also enable an advanced level of treatment personalization through research in
genetics and genomics. The goal is to understand the impact of the DNA on our health and find individual
biological connections between genetics, diseases, and drug response. Data science techniques allow integration
of different kinds of data with genomic data in the disease research, which provides a deeper understanding of
genetic issues in reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data,
we will achieve a deeper understanding of the human DNA. The advanced genetic risk prediction will be a
major step towards more individual care.
3. Drug Development
The drug discovery process is highly complicated and involves many disciplines. The greatest ideas are often
bounded by billions of testing, huge financial and time expenditure. On average, it takes twelve years to make
an official submission.
Data science applications and machine learning algorithms simplify and shorten this process, adding a
perspective to each step from the initial screening of drug compounds to the prediction of the success rate based
on the biological factors. Such algorithms can forecast how the compound will act in the body using advanced
mathematical modeling and simulations instead of the “lab experiments”. The idea behind the computational
drug discovery is to create computer model simulations as a biologically relevant network simplifying the
prediction of future outcomes with high accuracy.
Instructor-led Sessions
Assignments
Lifetime Access
Explore Curriculum
4. Virtual assistance for patients and customer support
Optimization of the clinical process builds upon the concept that for many cases it is not actually necessary for
patients to visit doctors in person. A mobile application can give a more effective solution by bringing the
doctor to the patient instead.
The AI-powered mobile apps can provide basic healthcare support, usually as chatbots. You simply describe
your symptoms, or ask questions, and then receive key information about your medical condition derived from a
wide network linking symptoms to causes. Apps can remind you to take your medicine on time, and if
necessary, assign an appointment with a doctor.
This approach promotes a healthy lifestyle by encouraging patients to make healthy decisions, saves their
time waiting in line for an appointment, and allows doctors to focus on more critical cases.
Internet Search
Now, this is probably the first thing that strikes your mind when you think Data Science Applications.
When we speak of search, we think ‘Google’. Right? But there are many other search engines like Yahoo, Bing,
Ask, AOL, and so on. All these search engines (including Google) make use of data science algorithms to
deliver the best result for our searched query in a fraction of seconds. Considering the fact that, Google
processes more than 20 petabytes of data every day.
Had there been no data science, Google wouldn’t have been the ‘Google’ we know today.
Tar
geted Advertising
If you thought Search would have been the biggest of all data science applications, here is a challenger – the
entire digital marketing spectrum. Starting from the display banners on various websites to the digital billboards
at the airports – almost all of them are decided by using data science algorithms.
This is the reason why digital ads have been able to get a lot higher CTR (Call-Through Rate) than traditional
advertisements. They can be targeted based on a user’s past behavior.
This is the reason why you might see ads of Data Science Training Programs while I see an ad of apparels in the
same place at the same time.
Website Recommendations
Aren’t we all used to the suggestions about similar products on Amazon? They not only help you find relevant
products from billions of products available with them but also adds a lot to the user experience.
A lot of companies have fervidly used this engine to promote their products in accordance with user’s interest
and relevance of information. Internet giants like Amazon, Twitter, Google Play, Netflix, Linkedin, imdb and
many more use this system to improve the user experience. The recommendations are made based on previous
search results for a user.
Advanced Image Recognition
You upload your image with friends on Facebook and you start getting suggestions to tag your friends. This
automatic tag suggestion feature uses face recognition algorithm.
In their latest update, Facebook has outlined the additional progress they’ve made in this area, making specific
note of their advances in image recognition accuracy and capacity.
“We’ve witnessed massive advances in image classification (what is in the image?) as well as object detection
(where are the objects?), but this is just the beginning of understanding the most relevant visual content of any
image or video. Recently we’ve been designing techniques that identify and segment each and every object in an
image, a key capability that will enable entirely new applications.”
In addition, Google provides you with the option to search for images by uploading them. It uses image
recognition and provides related search results.
Sp
eech Recognition
Some of the best examples of speech recognition products are Google Voice, Siri, Cortana etc. Using speech-
recognition feature, even if you aren’t in a position to type a message, your life wouldn’t stop. Simply speak out
the message and it will be converted to text. However, at times, you would realize, speech recognition doesn’t
perform accurately.
Airline Industry across the world is known to bear heavy losses. Except for a few airline service providers,
companies are struggling to maintain their occupancy ratio and operating profits. With high rise in air-fuel
prices and need to offer heavy discounts to customers has further made the situation worse. It wasn’t for long
when airlines companies started using data science to identify the strategic areas of improvements. Now using
data science, the airline companies can:
Gaming
Games are now designed using machine learning algorithms which improve/upgrade themselves as the player
moves up to a higher level. In motion gaming also, your opponent (computer) analyzes your previous moves
and accordingly shapes up its game. EA Sports, Zynga, Sony, Nintendo, Activision-Blizzard have led gaming
experience to the next level using data science.
Augmented Reality
This is the final of the data science applications which seems most exciting in the future. Augmented reality.
Data Science and Virtual Reality do have a relationship, considering a VR headset contains computing
knowledge, algorithms and data to provide you with the best viewing experience. A very small step towards this
is the high trending game of Pokemon GO. The ability to walk around things and look at Pokemon on walls,
streets, things that aren’t really there. The creators of this game used the data from Ingress, the last app from the
same company, to choose the locations of the Pokemon and gyms.
1) Identifying problems and understanding business
Like any other good business lifecycle, the data science lifecycle also starts with ‘why?’
Identifying problems is one of the major steps necessary in the data science process to
find a clear objective around which all the following steps will be formulated. In short, it
is important to understand the business objective early since it will decide the final goal
of your analysis.
This phase should examine the trends of business, analyse case studies of similar
analysis, and study the industry’s domain. The team will assess in-house resources,
infrastructure, total time, and technology needs. Once these aspects are all identified
and evaluated, they will prepare an initial hypothesis to resolve the business
challenges following the current scenario. The phase should –
Clearly state the problem that requires solutions and why it should be resolved
at once
Define the potential value of the business project
Find risks, including ethical aspects involved in the project
Build and communicate a highly integrated, flexible project plan
2) Data collection
Data collection is the next stage in the data science lifecycle to gather raw data from
relevant sources. The data captured can be either in structured or unstructured form.
The methods of collecting the data might come from – logs from websites, social media
data, data from online repositories, and even data streamed from online sources via
APIs, web scraping or data that could be present in excel or any other source.
The person performing the task should know the difference between various data sets
available and the data investment strategy of an organisation. A major challenge faced
by professionals in this step is tracking where each data comes from and whether it is
up-to-date. It is important to keep track of this information throughout the entire
lifecycle of a data science project as it might help test hypotheses or run any other
updated experiments.
3) Data processing
In this phase, data scientists analyse the data collected for biases, patterns, ranges,
and distribution of values. It is done to determine the sustainability of the databases
and predicts their usage in regression, machine learning and deep learning algorithms.
The phase also involves the introspection of different types of data, including nominal,
numerical, and categorical data.
Data visualisation is also done to highlight the critical trends and patterns of data,
comprehended by simple bars and line charts. Simply put, data processing might be
the most time-consuming but arguably the most critical phase in the entire life cycle of
data analytics. The goodness of the model depends on this data processing stage.
4) Data analysis
Data Analysis or Exploratory Data Analysis is another critical step in gaining some ideas
about the solution and factors affecting the data science lifecycle. There are no set
guidelines for this methodology, and it has no shortcuts. The key aspect to remember
here is that your input determines your output. In this section, the data prepared from
the previous stage will be explored further to examine the various features and their
relationships, aiding in better feature selection required for applying it to the model.
Experts use data statistics methods such as mean and median to better understand the
data. In addition, they also plot data and assess its distribution patterns using
histograms, spectrum analysis, and population distribution. Depending on the issues,
the data will be analysed.
5) Data modelling
Modelling Data is one of the major phases of data processes and is often mentioned as
the heart of data analysis. A model should use prepared and analysed data to provide
the desired output. The environment needed for executing the data model will be
decided and created before meeting the specific requirements.
In this phase, the team works together to develop datasets for training and testing the
model for production purposes. It also involves various tasks such as choosing the
appropriate mode type and learning whether the problem is a classification, regression,
or clustering problem. After analysing the model family, you must choose the
algorithms to implement them. It has to be done carefully since extracting necessary
insights from the prepared data is extremely important.
6) Model deployment
Now, we are at the final stage of the lifecycle of data science. After a rigorous
evaluation process, the model is finally prepared to be deployed in the desired format
and preferred channel. Remember, there is no value for the machine learning model
until it’s deployed to production. Hence machine learning models have to be recorded
before the deployment process. In general, these models are integrated and coupled
with products and applications.
The stage of Model deployment involves the creation of a delivery mechanism required
to get the mode out in the market among the users or to another system. Machine
learning models are also deployed on devices and gaining adoption and popularity in
the field of computing. From simple model output in a Tableau Dashboard to a complex
as scaling it to cloud in front of millions of users, this step is distinct for different
projects.
Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come
up with compelling business insights through the deployment of various tools,
techniques, methodologies, algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills
such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists
must have an understanding of Statistics, Mathematics, visualization, and
communication skills.
Microsoft Excel – Excel prevails as the easiest and most popular tool for
handling small amounts of data. The maximum amount of rows it supports is just
a shade over 1 million and one sheet can handle only up to 16,380 columns at a
time. These numbers are simply not enough when the amount of data is big.
Microsoft Access – It is a popular tool by Microsoft that is used for data storage.
Smaller databases up to 2Gb can be handled smoothly with this tool but beyond
that, it starts cracking up.
SQL – SQL is one of the most popular data management systems which has been
around since the 1970s. It was the primary database solution for a few decades.
SQL still remains popular but there’s a drawback – It becomes difficult to scale it
as the database continues to grow.
We have covered some of the basic tools so far. It is time to unleash the big guns now! If
your data is greater than 10Gb all the way up to storage greater than 1Tb+, then you
Variety refers to the different types of data that are out there. The data type may be one
Let us go through the examples falling under the umbrella of these different data types:
Take a moment to observe these examples and correlate them with your real-world data.
As you might have observed in the case of Structured data, there is a certain order and
structure to these data types whereas in the case of unstructured data, the examples
do not follow any trend or pattern. For example, customer feedback may vary in length,
sentiments, and other factors. Moreover, these types of data are huge and diverse.
It can be very challenging to tackle this type of data, so what are the different data
science tools available in the market for managing and handling these different data
types?
The two most common databases are SQL and NoSQL. SQL has been the market-
Some examples for SQL are Oracle, MySQL, SQLite, whereas NoSQL consists of popular
databases like MongoDB, Cassandra, etc. These NoSQL databases are seeing huge
adoption numbers because of their ability to scale and handle dynamic data.
The third and final V represents the velocity. This is the speed at which the data is
captured. This includes both real-time and non-real-time data. We’ll be talking mainly
complex one is the sensor data collected by self-driving cars. Imagine being in a self-
driving car – the car has to dynamically collect and process data regarding its lane,
CCTV
Stock trading
Fraud detection for credit card transaction
Network data – social media (Facebook, Twitter, etc.)
More than 1Tb of data is generated during each trade session at the New York stock
exchange!
Now, let’s head on to some of the commonly used data science tools to handle real-time
data:
Apache Storm – This tool by Apache can be used with almost all the
programming languages. It can process up to 1 Million tuples per second and it is
highly scalable. It is a good tool to consider for high data velocity.
Amazon Kinesis – This tool by Amazon is similar to Kafka but it comes with a
subscription cost. However, it is offered as an out-of-the-box solution which makes
it a very powerful option for organizations.
Apache Flink – Flink is yet another tool by Apache that we can use for real-time
data. Some of the advantages of Flink are high performance, fault tolerance, and
efficient memory management.
Now that we have a solid grasp on the different tools commonly being used for working
with Big Data, let’s move to the segment where you can take advantage of the data by
If you’re setting up a brand new data science project, you’ll have a ton of questions in
mind. This is true regardless of your level – whether you’re a data scientist, a data
In this section, we will be discussing some of the popular data science tools used in the
Data Science is a broad term in itself and it consists of a variety of different domains and
each domain has its own business importance and complexity which is beautifully
The data science spectrum consists of various domains and these domains are
represented by their relative complexity and the business value that they provide. Let us
take up each one of the points I’ve shown in the above spectrum.
Let’s begin with the lower end of the spectrum. It enables an organization to identify
trends and patterns so as to make crucial strategic decisions. The types of analysis range
Excel – It gives a diverse range of options including Pivot tables and charts that
let you do analysis in double-quick time. This is, in short, the Swiss Army Knife of
data science/analytics tools
QlikView – It lets you consolidate, search, visualize, and analyze all your data
sources with just a few clicks. It is an easy and intuitive tool to learn which makes
it so popular.
Tableau – It is amongst the most popular data visualization tools in the market
today. It is capable of handling large amounts of data and even offers Excel-like
calculation functions and parameters. Tableau is well-liked because of its neat
dashboard and story interface.
Moving further up the ladder, the stakes just got high in terms of complexity as well as
the business value! This is the domain where the bread and butter of most data
scientists come from. Some of the types of problems you’ll solve are statistical modeling,
Python – This is one of the most dominant languages for data science in the
industry today because of its ease, flexibility, open-source nature. It has gained
rapid popularity and acceptance in the ML community.
Julia – It is an upcoming language and is being touted as the successor of Python. It’s
still in its nascent stage and it will be interesting to see how it performs in the future. Julia
is an open source programming language used for numerical computing, as well as machine learning
and other kinds of data science applications. In a 2012 blog post announcing Julia, its four creators
said they set out to design one language that addressed all of their needs. A big goal was to avoid
having to write programs in one language and convert them to another for execution.
To that end, Julia combines the convenience of a high-level dynamic language with performance
that's comparable to statically typed languages, such as C and Java. Users don't have to define data
types in programs, but an option allows them to do so. The use of a multiple dispatch approach at
Jupyter Notebooks – These notebooks are widely used for coding in Python.
While it is predominantly used for Python, it also supports other languages such
as Julia, R, etc.
The tools we have discussed so far are true open-source tools. You don’t require to pay
for them or buy any extra licenses. They have thriving and active communities that
SAS
It is a very popular and powerful tool. It’s prevalently and commonly used in the banking
and financial sectors. It has a very high share in private organizations like American
Express, JP Morgan, Mu Sigma, Royal Bank of Scotland, etc. Short for Statistical
Analysis System, SAS is a statistical tool developed by SAS Institute. It is one of the
oldest data analysis tools available. The latest stable build v9.4m6 was released back
● Identification of spelling errors and grammar mistakes for a more accurate analysis
The goal is to turn data into information, and information into insight.
- Carly Fiorina
SAS allows you to mine, alter, manage and retrieve data from several areas. Paired
with SQL, SAS becomes an extremely efficient tool for data access and analysis. SAS
has grown into a suite of tools serving several purposes, some of these areas are:
● Data Mining
● Statistical Analysis
● Business Intelligence Applications
SPSS – Short for Statistical Package for Social Sciences, SPSS was acquired by
IBM in 2009. It offers advanced statistical analysis, a vast library of machine
learning algorithms, text analysis, and much more.
In this blog, we are going to cover Data, types of Data, and Structured Vs Unstructured Data, and
suitable Datastores.
What Is Data?
Data is a set of facts such as descriptions, observations, and numbers used in decision making.
We can classify data as structured, unstructured, or semi-structured data.
1) Structured Data
Structured data is generally tabular data that is represented by columns and rows in a
database.
Databases that hold tables in this form are called relational databases.
The mathematical term “relation” specify to a formed set of data held as a table.
In structured data, all row in a table has the same set of columns.
SQL (Structured Query Language) programming language used for structured data.
2) Semi-structured Data
Semi structured data is a data type that contains semantic tags, but does not conformn to the structure
associated with typical relational databases.It maintains internal tags and markings that identify separate data
elements, whichenables information grouping and hierarchies.Both documents and databases can be semi-
structured. This type of data onlyrepresents about 5-10% of the structured/semi-structured/unstructured data
pie, bur has critical business usage cases.
1. Markup Language XML: This is a semi-structured document language. XML is a set of document encoding
rules that defines a human- and machine-readableformat. Its value is that its tag-driven structure is highly
flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web.
2. Open Standard JSON (JavaScript Object Notation) JSON: It is another semi-structured data interchange
format. Java is implicit in the name but other C-like programming languages recognize it. Its structure consists
of name/value pairs (or
object, hash table, etc.) and an ordered value list (or array, sequence, list). Since the structure is
interchangeable among languages, JSON excels at transmitting data between web applications and servers.
3. NoSQL: Semi-structured data is also an important element of many NoSQL (Not only SQL) databases. NoSQL
databases differ from relational databases because they do not separate the organization (schema) from the
data. This makes NoSQL a better choice to store information that does not easily fit into the record and table
format, such as text with varying lengths. It also allows for easier data exchange
between databases.Above databases are common in big data infrastructure and real-time Web applications
like Linkedin, On LinkedIn, hundreds of millions of business users freely share job titles, locations,skills, and
more; and LinkedIn captures the massive data in a semi-structured format.
When job seeking users create a search, Linkedin matches the query to its massive semi-structured data
stores, cross-references data to hiring trends, and shares the resulting recommendations with job seekers.
3) Unstructured Data
Unstructured data is information that either does not organize in a pre-defined manner or
not have a pre-defined data model.
Unstructured information is a set of text-heavy but may contain data such as numbers,
dates, and facts as well.
Videos, audio, and binary data files might not have a specific structure. They’re assigned
to as unstructured data.
There are some notable differences between structured and unstructured data to be
aware of when dealing with any of the data types. The following table will help compare
the two types of data based on factors such as data sources, data storage, internal
structure, data format, scalability, usage, and more.
Structured data is used in machine learning and Unstructured data is used in natural
drives machine learning algorithms. language processing and text mining.
Structured data is stored in tabular formats like Stored as audio files, videos files, or NoSQL
excel sheets or SQL databases. databases
Structured data is stored in data warehouses Unstructured data is stored in data lakes
Structured data requires less storage space and is Unstructured data requires more storage
highly scalable. space and is difficult to scale.
Level of Structured Data as name On other hand in case of Semi In last the data is
organizing suggest this type of data Structured Data the data is fully non organized
is well organized and organized up to some extent in case of
hence level of organizing only and rest is non organized Unstructured Data
1
is highest in this type of hence the level of organizing and hence level of
data. is less than that of Structured organizing is lowest
Data and higher than that of in case of
Unstructured Data. Unstructured Data.
Means of Data Structured Data is get While in case of Semi On other hand in
Organization organized by the means Structured Data is partially case of
of Relational Database. organized by the means of Unstructured Data
2
XML/RDF. data is based on
simple character
and binary data.
6 Performance In Structure Data we can On other hand in case of Semi While in case of
perform structured query Structured Data only queries Unstructured Data
which allow complex over anonymous nodes are only textual query
joining and thus possible so its performance is are possible so
Sr. Key Structured Data Semi Structured Data Unstructured Data
No.
3.Not all Unstructured Data is High Quality: Unstructured data can be very
uneven when it comes to quality. The lack of consistency in quality occurs because data
is difficult to verify and therefore, is not always accurate. For example,
Facebook status updates, images and videos all qualify as unstructured data, but that
does not make it useful for organizations.
Multi-model Data
Today explosion of unstructured data evolving as a big challenge for industry and researchers.
IT (Internet of Things) has allowed us to always remain connected with the help of different electronics
gadgets. This communication network generates huge data having different formats and data types.When
dealing with such contexts, we may need to collect and explore multimodal
(different forms) and multimedia (different media) data such as images, music and other sounds, gestures,
body posture, and the use of space.Once, the sources are identified, the next thing to consider is the kind of
data that can
be extracted from those sources.Based on the nature of the information collected from the sources, the data
can be
categorized into two types: structured data and unstructured data.One of the well-known applications of such
multimedia data is analysis of brain imaging data sequences - where the sequence can be a series of images
from different
sensors, or a time series from the same subject.
The typical dataset used in this kind of application is a multimodal face dataset, which contains output from
different sensors such as EEG, MEG, and fMRI (medical imaging techniques) on the same subject within the
same paradigm.In this field, Statistical Parametric Mapping (SPM) is a well-known statistical technique, created
by Karl Friston that examines differences in brain activity recorded
during functional neuro imaging experiments.
Standard Datasets
A dataset or data set is simply a collection of data.
In the case of tabular data (in the form of table), a data set corresponds to one or more database tables, where
every column of a table represents a particular variable and each row corresponds to a given record of the
data set in question.In the open data discipline, data set is the unit to measure the information released ina
public open data repository.The simplest and most common format for datasets is a spreadsheet or CSV
format – a single file organized as a table of rows and columns. Sometimes a dataset may be a zip file or folder
containing multiple data tables with related data. Uploading datasets as Open Access helps both individuals
and institutions meet. Availability of Authenticate and standardized data sets provides supports research
reproducibility, fosters innovations and discoverability.