Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
21 views

Chapter 1-Introduction to data science

TYBCS introduction to data science 1st chapter

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Chapter 1-Introduction to data science

TYBCS introduction to data science 1st chapter

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 1 Introduction to Data Science

 Introduction to data science, The 3 V’s: Volume, Velocity, Variety


 Why learn Data Science?
 Applications of Data Science
 The Data Science Lifecycle
 Data Scientist’s Toolbox
 Types of Data-Structured, semi-structured, Unstructured Data, Problems with unstructured data
 Data sources
 Open Data, Social Media Data, Multimodal Data, standard datasets
 Data Formats
 Integers, Floats, Text Data, Text Files, Dense Numerical Arrays, Compressed or Archived Data, CSV
Files, JSON Files, XML Files, HTML Files , Tar Files, GZip Files, Zip Files, Image Files: Rasterized,
Vectorized, and/or Compressed

Introduction to data science

What is Data Science?

Data science enables businesses to process huge amounts of structured and unstructured big data to

detect patterns. This in turn allows companies to increase efficiencies, manage costs, identify new

market opportunities, and boost their market advantage.

Data Science Definition

Data science is the practice of mining large data sets of raw data, both structured and unstructured, to

identify patterns and extract actionable insight from them. This is an interdisciplinary field, and the

foundations of data science include statistics, inference, computer science, predictive analytics,

machine learning algorithm development, and new technologies to gain insights from big data.
Why Data Science is Important

By 2020, there will be around 40 zettabytes of data—that's 40 trillion gigabytes. The amount of data

that exists grows exponentially. At any time, about 90 percent of this huge amount of data gets

generated in the most recent two years, according to sources like IBM and SINTEF.

In fact, internet users generate about 2.5 quintillion bytes of data every day. By 2020, every person on

Earth will be generating about 146,880 GB of data every day, and by 2025, that will be 165 zettabytes

every year.

This means there is a huge amount of work in data science—much left to uncover. According to The

Guardian, in 2012 only about 0.5 percent of all data was analyzed.

Simple data analysis can interpret data from a single source, or a limited amount of data. However,

data science tools are critical to understanding big data and data from multiple sources in a

meaningful way. A look at some of the specific data science applications in business illustrate this

point and provide a compelling introduction to data science.

What Can Data Science Be Used For?

Data science applications are frequently used in healthcare, marketing, banking and finance, and

policy work. Here are some common examples of data science services in action in trending data

science fields:
How Data Science is Transforming Health Care

Data science is transforming healthcare as consumers and healthcare providers alike use data that

wearables generate to monitor and prevent health problems and emergencies. In 2018, McKinsey

described a “big data revolution” in healthcare. In fact, according to McKinsey, applying data science

to the US healthcare system could reduce healthcare spending by $300 billion to $450 billion, or 12 to

17 percent of its total cost.


Data Science for Business

Data science and analytics come together when data science is applied in a business setting. Data

science helps businesses better understand the specific needs customers have based on existing

data. For example, with customer age, purchase history, past browsing history, income, and other

demographics, a data scientist can more effectively train models for search and product

recommendation.
Data Science in Finance

Data science is a powerful tool for fraud detection and prevention, honing the ability of financial

institutions to recognize problematic patterns in data faster. Data science can also help reduce non-

performing assets, revealing downward trends sooner.

For example, institutions that provide loans or other money on credit need to limit the probability of

customers defaulting on payments. To do this, they might use data science to create a model that can

perform predictive analytics on customer payment history data. This would allow the institution to

predict whether future payments will occur in a timely way—or at all.


How Data Science is Transforming Policy Work

Government policymakers can use data science to better shape policies to meet the needs of their

constituents, combat census undercount using big data and ML, and more. For example,

policymakers might want to use geospatial data science and related data to drive decisions about

when to evacuate an area based on historical weather patterns. The correlative or descriptive analysis

of data sets can help make these kinds of decisions.

Data scientists can collect and analyze data sets from aircrafts, ships, satellites, and radars to create

models. These models can help forecast the weather more effectively every day. However, they also

enable scientists to predict natural disasters with greater precision, improve vegetation

management, prevent the next Paradise disaster, and help disaster response decision makers know

when the optimal evacuation time might be.


What is Marketing Data Science?

To understand how data science helps marketing, consider the levels of insight that big data can offer

into consumer behavior. Companies can refine things like pricing and other marketing strategies using

data science.

For example, data science can drive pricing for e-commerce companies. This allows them to refine

their sense of what the market will bear for their product or service, and increase their profits. Data

science also helps businesses more effectively develop and market their products, because it allows

them to select target customers more efficiently.

Data science and data analytics offer insight into purchasing patterns. Businesses can use descriptive

analytics to describe data sets surrounding how consumers buy under different conditions.

They can also deploy correlative analysis to predict when relationships might exist between given

data sets or variables. For example, data might reveal that a subset of consumers that purchase

certain kinds of products are very likely to try products like their new offering.

But data science goes further than this kind of analysis. It can also predict future patterns, identifying

actions that could meaningfully affect overall business strategy. For instance, data scientists can

uncover optimal price points, bids for programmatic advertising or ways to generate new customers in

the future based on trends in existing data.

R vs Python for Data Science

Data scientists need tools for data transformation, data cleaning, and data visualization. There is also

a need to detect outliers, identify relationships between variables, and construct complete interpretive

models inside a suitable environment. This is where data preparation and statistical analysis tools like

R and Python come in.

R was developed as a user-friendly language for statistics, data analysis, and graphical models. R

has a large community of programmers that use and support it online, so there is no need to develop

everything alone.
R is particularly suited to data analysis tasks on individual servers that demand standalone analysis or

computing. It’s also excellent for exploratory work and ideal for data science visualization, working in

tandem with visualization packages such as googleVis, ggvis, ggplot2, and rCharts.

On the other hand, R may be too heavy and slow for your system. It also has difficult syntax, and

comes with a learning curve that can be steep.

Python was developed as a more readable language for general uses, and it is simpler and more

flexible to learn. Another key difference is that R exists mostly within the data science ecosystem,

while Python is used in various verticals.

The IPython Notebook system allows users to share notebooks with each other, enabling easier

working without installations, dramatically reducing lost time. The easier learning curve also typically

means shorter time before mastery, including writing and testing your own programs and code—

including in other fields. The down side to Python for data science is less data visualization power.

Python for data science works in many of the same ways and there is little need to learn them both.

However, for some beginner users, Python may be easier to learn due to its simpler syntax.

Conversely, for those with more statistical background or more statistical analysis demands, R for

data science may be a better choice. Decide based on the data problems you will solve, your ability to

learn and master the tool, how much data visualization you expect to do, and the current standards in

your specific vertical.


How is Data Visualization Used in Data Science?

Data scientists represent data in the form of graphs, charts and other visualizations. These data

visualizations allow users to “see” insights that are invisible in excel sheets of data. For example, you

may want to depict how certain trends in data relate to each other, or how multiple factors coincide.

Data visualization environments are a common mode of deploying the results of data science to a

broader audience, for example, by using web-based tools that allow exploration and interaction with

the resulting data. To support effective data visualization, a system must have access to the relevant

data science outputs and have intuitive interaction capabilities.

Visualizing the data in a scatter-plot or other graph can reveal patterns and relationships that are

impossible to observe otherwise. It can also suggest further avenues for research, and new business

strategies.

Big Data vs Data Science

Data comes from various sources, such as online purchases, multimedia forms, instruments, financial

logs, sensors, text files, and others. Data might be unstructured, semi-structured, or structured.

Unstructured data includes data from blogs, digital audio/video feeds, digital images, emails, mobile

devices, sensors, social networks and tweets, web pages, and online sources. Semi-structured data

includes data from system log files, XML files, and text files. Structured data which has already been

processed in some way includes OLTP, RDBMS (databases), transaction data, and other formats.
This is all “big data,” and putting it to good use is a pressing job of the 21st century. It's simply not

possible to process tremendous amounts of data from disparate sources with simple business

intelligence tools, or even data analytics tools. Instead, data science presents businesses with

advanced, complex algorithms and other tools for analyzing, cleansing, processing, and extracting

meaningful insights from data.

Data science is not one tool, skill, or method. Instead, it is a scientific approach that uses applied

statistical and mathematical theory and computer tools to process big data.

The foundations of data science combine the interdisciplinary strengths of data cleansing, intelligent

data capture techniques, and data mining and programming. The result is the data scientist’s ability to

capture, maintain, and prepare big data for intelligent analysis.

This is one point that distinguishes the work of the data scientist from the data engineer, although

sometimes the two roles are confused. The data engineer prepares data sets for the data scientist to

work with and draw insights from, but the intelligent analysis work falls to data scientists, not “data

science engineers.”

Big data is the raw material used in the field of data science. Characterized by its velocity, variety, and

volume (the 3Vs), big data is the raw material for data science, which affords the techniques for

analyzing the data.

The 3 V’s: Volume, Velocity, Variety

1. VOLUME

Within the Social Media space for example, Volume refers to the amount of data generated through
websites, portals and online applications. Especially for B2C companies. This is the incredible
amount of data that is generated from different sources like social media, cell phones,
sensors, public releases in the form of public data, photographs, videos. This data is so
large that it cannot be stored using traditional techniques to store and analyze data.
Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350 million users
and Instagram 700 million users. Every day, these users contribute to billions of images, posts,
videos, tweets etc. You can now imagine the insanely large amount -or Volume- of data that is
generated every minute and every hour.
2.VELOCITY

No one likes data coming at a lower speed and thus speed plays a crucial role.
Velocity is the speed at which this data is collected, stored and analyzed.
With Velocity we refer to the speed with which data are being generated. Staying with our social
media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are
posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5 billion searches are
performed in Google. This is like a nuclear data explosion. Big Data helps the company to hold this
explosion, accept the incoming flow of data and at the same time process it fast so that it does not
create bottlenecks.

3. VARIETY

Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting
generated either by humans or by machines. The most commonly added data are structured -texts,
tweets, pictures & videos. However, unstructured data like emails, voicemails, hand-written text, ECG
reading, audio recordings etc, are also important elements under Variety. Variety is all about the
ability to classify the incoming data into various categories.
Application of data science

10 applications that build upon the concepts of Data Science, exploring various domains such as the following:

 Fraud and Risk Detection


 Healthcare
 Internet Search
 Targeted Advertising
 Website Recommendations
 Advanced Image Recognition
 Speech Recognition
 Airline Route Planning
 Gaming
 Augmented Reality

Fraud and Risk Detection

The earliest applications of data science were in Finance. Companies were fed up of bad debts and losses every
year. However, they had a lot of data which use to get collected during the initial paperwork while sanctioning
loans. They decided to bring in data scientists in order to rescue them out of losses.

Over the years, banking companies learned to divide and conquer data via customer profiling, past expenditures,
and other essential variables to analyze the probabilities of risk and default. Moreover, it also helped them to
push their banking products based on customer’s purchasing power.

Healthcare

The healthcare sector, especially, receives great benefits from data science applications.

1. Medical Image Analysis

Procedures such as detecting tumors, artery stenosis, organ delineation employ various different methods and
frameworks like MapReduce to find optimal parameters for tasks like lung texture classification. It applies
machine learning methods, support vector machines (SVM), content-based medical image indexing, and wavelet
analysis for solid texture classification.
2. Genetics & Genomics

Data Science applications also enable an advanced level of treatment personalization through research in
genetics and genomics. The goal is to understand the impact of the DNA on our health and find individual
biological connections between genetics, diseases, and drug response. Data science techniques allow integration
of different kinds of data with genomic data in the disease research, which provides a deeper understanding of
genetic issues in reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data,
we will achieve a deeper understanding of the human DNA. The advanced genetic risk prediction will be a
major step towards more individual care.

3. Drug Development

The drug discovery process is highly complicated and involves many disciplines. The greatest ideas are often
bounded by billions of testing, huge financial and time expenditure. On average, it takes twelve years to make
an official submission.

Data science applications and machine learning algorithms simplify and shorten this process, adding a
perspective to each step from the initial screening of drug compounds to the prediction of the success rate based
on the biological factors. Such algorithms can forecast how the compound will act in the body using advanced
mathematical modeling and simulations instead of the “lab experiments”. The idea behind the computational
drug discovery is to create computer model simulations as a biologically relevant network simplifying the
prediction of future outcomes with high accuracy.

Data Science with R Programming Certification Training Course

 Instructor-led Sessions

 Real-life Case Studies

 Assignments

 Lifetime Access

Explore Curriculum
4. Virtual assistance for patients and customer support
Optimization of the clinical process builds upon the concept that for many cases it is not actually necessary for
patients to visit doctors in person. A mobile application can give a more effective solution by bringing the
doctor to the patient instead.

The AI-powered mobile apps can provide basic healthcare support, usually as chatbots. You simply describe
your symptoms, or ask questions, and then receive key information about your medical condition derived from a
wide network linking symptoms to causes. Apps can remind you to take your medicine on time, and if
necessary, assign an appointment with a doctor.
This approach promotes a healthy lifestyle by encouraging patients to make healthy decisions, saves their
time waiting in line for an appointment, and allows doctors to focus on more critical cases.

The most popular applications nowadays are Your.MD and Ada.

Internet Search

Now, this is probably the first thing that strikes your mind when you think Data Science Applications.

When we speak of search, we think ‘Google’. Right? But there are many other search engines like Yahoo, Bing,
Ask, AOL, and so on. All these search engines (including Google) make use of data science algorithms to
deliver the best result for our searched query in a fraction of seconds. Considering the fact that, Google
processes more than 20 petabytes of data every day.

Had there been no data science, Google wouldn’t have been the ‘Google’ we know today.

Tar
geted Advertising

If you thought Search would have been the biggest of all data science applications, here is a challenger – the
entire digital marketing spectrum. Starting from the display banners on various websites to the digital billboards
at the airports – almost all of them are decided by using data science algorithms.

This is the reason why digital ads have been able to get a lot higher CTR (Call-Through Rate) than traditional
advertisements. They can be targeted based on a user’s past behavior.

This is the reason why you might see ads of Data Science Training Programs while I see an ad of apparels in the
same place at the same time.
Website Recommendations

Aren’t we all used to the suggestions about similar products on Amazon? They not only help you find relevant
products from billions of products available with them but also adds a lot to the user experience.

A lot of companies have fervidly used this engine to promote their products in accordance with user’s interest
and relevance of information. Internet giants like Amazon, Twitter, Google Play, Netflix, Linkedin, imdb and
many more use this system to improve the user experience. The recommendations are made based on previous
search results for a user.
Advanced Image Recognition

You upload your image with friends on Facebook and you start getting suggestions to tag your friends. This
automatic tag suggestion feature uses face recognition algorithm.

In their latest update, Facebook has outlined the additional progress they’ve made in this area, making specific
note of their advances in image recognition accuracy and capacity.

“We’ve witnessed massive advances in image classification (what is in the image?) as well as object detection
(where are the objects?), but this is just the beginning of understanding the most relevant visual content of any
image or video. Recently we’ve been designing techniques that identify and segment each and every object in an
image, a key capability that will enable entirely new applications.”
In addition, Google provides you with the option to search for images by uploading them. It uses image
recognition and provides related search results.

Sp
eech Recognition

Some of the best examples of speech recognition products are Google Voice, Siri, Cortana etc. Using speech-
recognition feature, even if you aren’t in a position to type a message, your life wouldn’t stop. Simply speak out
the message and it will be converted to text. However, at times, you would realize, speech recognition doesn’t
perform accurately.

Airline Route Planning

Airline Industry across the world is known to bear heavy losses. Except for a few airline service providers,
companies are struggling to maintain their occupancy ratio and operating profits. With high rise in air-fuel
prices and need to offer heavy discounts to customers has further made the situation worse. It wasn’t for long
when airlines companies started using data science to identify the strategic areas of improvements. Now using
data science, the airline companies can:

1. Predict flight delay


2. Decide which class of airplanes to buy
3. Whether to directly land at the destination or take a halt in between (For example, A flight can have a
direct route from New Delhi to New York. Alternatively, it can also choose to halt in any country.)
4. Effectively drive customer loyalty programs
Southwest Airlines, Alaska Airlines are among the top companies who’ve embraced data science to bring
changes in their way of working.

Gaming

Games are now designed using machine learning algorithms which improve/upgrade themselves as the player
moves up to a higher level. In motion gaming also, your opponent (computer) analyzes your previous moves
and accordingly shapes up its game. EA Sports, Zynga, Sony, Nintendo, Activision-Blizzard have led gaming
experience to the next level using data science.

Augmented Reality

This is the final of the data science applications which seems most exciting in the future. Augmented reality.

Data Science and Virtual Reality do have a relationship, considering a VR headset contains computing
knowledge, algorithms and data to provide you with the best viewing experience. A very small step towards this
is the high trending game of Pokemon GO. The ability to walk around things and look at Pokemon on walls,
streets, things that aren’t really there. The creators of this game used the data from Ingress, the last app from the
same company, to choose the locations of the Pokemon and gyms.
1) Identifying problems and understanding business
Like any other good business lifecycle, the data science lifecycle also starts with ‘why?’
Identifying problems is one of the major steps necessary in the data science process to
find a clear objective around which all the following steps will be formulated. In short, it
is important to understand the business objective early since it will decide the final goal
of your analysis.
This phase should examine the trends of business, analyse case studies of similar
analysis, and study the industry’s domain. The team will assess in-house resources,
infrastructure, total time, and technology needs. Once these aspects are all identified
and evaluated, they will prepare an initial hypothesis to resolve the business
challenges following the current scenario. The phase should –
 Clearly state the problem that requires solutions and why it should be resolved
at once
 Define the potential value of the business project
 Find risks, including ethical aspects involved in the project
 Build and communicate a highly integrated, flexible project plan

2) Data collection
Data collection is the next stage in the data science lifecycle to gather raw data from
relevant sources. The data captured can be either in structured or unstructured form.
The methods of collecting the data might come from – logs from websites, social media
data, data from online repositories, and even data streamed from online sources via
APIs, web scraping or data that could be present in excel or any other source.
The person performing the task should know the difference between various data sets
available and the data investment strategy of an organisation. A major challenge faced
by professionals in this step is tracking where each data comes from and whether it is
up-to-date. It is important to keep track of this information throughout the entire
lifecycle of a data science project as it might help test hypotheses or run any other
updated experiments.
3) Data processing
In this phase, data scientists analyse the data collected for biases, patterns, ranges,
and distribution of values. It is done to determine the sustainability of the databases
and predicts their usage in regression, machine learning and deep learning algorithms.
The phase also involves the introspection of different types of data, including nominal,
numerical, and categorical data.
Data visualisation is also done to highlight the critical trends and patterns of data,
comprehended by simple bars and line charts. Simply put, data processing might be
the most time-consuming but arguably the most critical phase in the entire life cycle of
data analytics. The goodness of the model depends on this data processing stage.

4) Data analysis
Data Analysis or Exploratory Data Analysis is another critical step in gaining some ideas
about the solution and factors affecting the data science lifecycle. There are no set
guidelines for this methodology, and it has no shortcuts. The key aspect to remember
here is that your input determines your output. In this section, the data prepared from
the previous stage will be explored further to examine the various features and their
relationships, aiding in better feature selection required for applying it to the model.
Experts use data statistics methods such as mean and median to better understand the
data. In addition, they also plot data and assess its distribution patterns using
histograms, spectrum analysis, and population distribution. Depending on the issues,
the data will be analysed.

5) Data modelling
Modelling Data is one of the major phases of data processes and is often mentioned as
the heart of data analysis. A model should use prepared and analysed data to provide
the desired output. The environment needed for executing the data model will be
decided and created before meeting the specific requirements.
In this phase, the team works together to develop datasets for training and testing the
model for production purposes. It also involves various tasks such as choosing the
appropriate mode type and learning whether the problem is a classification, regression,
or clustering problem. After analysing the model family, you must choose the
algorithms to implement them. It has to be done carefully since extracting necessary
insights from the prepared data is extremely important.

6) Model deployment
Now, we are at the final stage of the lifecycle of data science. After a rigorous
evaluation process, the model is finally prepared to be deployed in the desired format
and preferred channel. Remember, there is no value for the machine learning model
until it’s deployed to production. Hence machine learning models have to be recorded
before the deployment process. In general, these models are integrated and coupled
with products and applications.
The stage of Model deployment involves the creation of a delivery mechanism required
to get the mode out in the market among the users or to another system. Machine
learning models are also deployed on devices and gaining adoption and popularity in
the field of computing. From simple model output in a Tableau Dashboard to a complex
as scaling it to cloud in front of millions of users, this step is distinct for different
projects.

Data Scientist:

A data scientist is a professional who works with an enormous amount of data to come
up with compelling business insights through the deployment of various tools,
techniques, methodologies, algorithms, etc.

Skill required: To become a data scientist, one should have technical language skills
such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists
must have an understanding of Statistics, Mathematics, visualization, and
communication skills.
 Microsoft Excel – Excel prevails as the easiest and most popular tool for
handling small amounts of data. The maximum amount of rows it supports is just
a shade over 1 million and one sheet can handle only up to 16,380 columns at a
time. These numbers are simply not enough when the amount of data is big.

 Microsoft Access – It is a popular tool by Microsoft that is used for data storage.
Smaller databases up to 2Gb can be handled smoothly with this tool but beyond
that, it starts cracking up.

 SQL – SQL is one of the most popular data management systems which has been
around since the 1970s. It was the primary database solution for a few decades.
SQL still remains popular but there’s a drawback – It becomes difficult to scale it
as the database continues to grow.
We have covered some of the basic tools so far. It is time to unleash the big guns now! If

your data is greater than 10Gb all the way up to storage greater than 1Tb+, then you

need to implement the tools I’ve mentioned below:

 Hadoop – It is an open-source distributed framework that manages data


processing and storage for big data. You are likely to come across this tool
whenever you build a machine learning project from scratch.

 Hive – It is a data warehouse built on top of Hadoop. Hive provides a SQL-like


interface to query the data stored in various databases and file systems that
integrate with Hadoop.

Tools for Handling Variety

Variety refers to the different types of data that are out there. The data type may be one

of these – Structured and Unstructured data.

Let us go through the examples falling under the umbrella of these different data types:
Take a moment to observe these examples and correlate them with your real-world data.

As you might have observed in the case of Structured data, there is a certain order and

structure to these data types whereas in the case of unstructured data, the examples

do not follow any trend or pattern. For example, customer feedback may vary in length,

sentiments, and other factors. Moreover, these types of data are huge and diverse.

It can be very challenging to tackle this type of data, so what are the different data

science tools available in the market for managing and handling these different data

types?

The two most common databases are SQL and NoSQL. SQL has been the market-

dominant players for a number of years before NoSQL emerged.

Some examples for SQL are Oracle, MySQL, SQLite, whereas NoSQL consists of popular

databases like MongoDB, Cassandra, etc. These NoSQL databases are seeing huge

adoption numbers because of their ability to scale and handle dynamic data.

Tools for Handling Velocity

The third and final V represents the velocity. This is the speed at which the data is

captured. This includes both real-time and non-real-time data. We’ll be talking mainly

about the real-time data here.


We have a lot of examples around us that capture and process real-time data. The most

complex one is the sensor data collected by self-driving cars. Imagine being in a self-

driving car – the car has to dynamically collect and process data regarding its lane,

distance from other vehicles, etc. all at the same time!

Some other examples of real-time data being collected are:

 CCTV
 Stock trading
 Fraud detection for credit card transaction
 Network data – social media (Facebook, Twitter, etc.)

Did you know?

More than 1Tb of data is generated during each trade session at the New York stock

exchange!

Now, let’s head on to some of the commonly used data science tools to handle real-time

data:

 Apache Kafka – Kafka is an open-source tool by Apache. It is used for building


real-time data pipelines. Some of the advantages of Kafka are – It is fault-tolerant,
really quick, and used in production by a large number of organizations.

 Apache Storm – This tool by Apache can be used with almost all the
programming languages. It can process up to 1 Million tuples per second and it is
highly scalable. It is a good tool to consider for high data velocity.

 Amazon Kinesis – This tool by Amazon is similar to Kafka but it comes with a
subscription cost. However, it is offered as an out-of-the-box solution which makes
it a very powerful option for organizations.

 Apache Flink – Flink is yet another tool by Apache that we can use for real-time
data. Some of the advantages of Flink are high performance, fault tolerance, and
efficient memory management.

Now that we have a solid grasp on the different tools commonly being used for working

with Big Data, let’s move to the segment where you can take advantage of the data by

applying advanced machine learning techniques and algorithms.


Widely Used Data Science Tools

If you’re setting up a brand new data science project, you’ll have a ton of questions in

mind. This is true regardless of your level – whether you’re a data scientist, a data

analyst, a project manager, or a senior data science executive.

Some of the questions you’ll face are:

 Which tools should you use in different domains of data science?


 Should I buy licenses for the tools or opt for an open-source one?, and so on

In this section, we will be discussing some of the popular data science tools used in the

industry according to different domains.

Data Science is a broad term in itself and it consists of a variety of different domains and

each domain has its own business importance and complexity which is beautifully

captured in the below image:

The data science spectrum consists of various domains and these domains are

represented by their relative complexity and the business value that they provide. Let us

take up each one of the points I’ve shown in the above spectrum.

Reporting and Business Intelligence

Let’s begin with the lower end of the spectrum. It enables an organization to identify

trends and patterns so as to make crucial strategic decisions. The types of analysis range

from MIS, data analytics, all the way over to dashboarding.

The commonly used tools in these domains are:

 Excel – It gives a diverse range of options including Pivot tables and charts that
let you do analysis in double-quick time. This is, in short, the Swiss Army Knife of
data science/analytics tools
 QlikView – It lets you consolidate, search, visualize, and analyze all your data
sources with just a few clicks. It is an easy and intuitive tool to learn which makes
it so popular.

 Tableau – It is amongst the most popular data visualization tools in the market
today. It is capable of handling large amounts of data and even offers Excel-like
calculation functions and parameters. Tableau is well-liked because of its neat
dashboard and story interface.

 Microstrategy – It is yet another BI tool that supports dashboards, automated


distributions, and other key data analytics tasks.

 PowerBI – It is a Microsoft offering in the Business Intelligence (BI) space.


PowerBI was built to integrate with Microsoft technologies. So if your organization
has a Sharepoint or SQL database user, you and your team will love working on
this tool.
 Google Analytics – Wondering how did Google Analytics make it to this list?
Well, digital marketing plays a major role in transforming businesses and there’s
no better tool than this to analyze your digital efforts.

Predictive Analytics and Machine Learning Tools

Moving further up the ladder, the stakes just got high in terms of complexity as well as

the business value! This is the domain where the bread and butter of most data

scientists come from. Some of the types of problems you’ll solve are statistical modeling,

forecasting, neural networks, and deep learning.

Let us understand the commonly used tools in this domain:

 Python – This is one of the most dominant languages for data science in the
industry today because of its ease, flexibility, open-source nature. It has gained
rapid popularity and acceptance in the ML community.

 R – It is another very commonly used and respected language in data science. R


has a thriving and incredibly supportive community and it comes with a plethora
of packages and libraries that support most machine learning tasks.
 Apache Spark – Spark was open-sourced by UC Berkley in 2010 and has since
become one of the largest communities in big data. It is known as the swiss army
knife of big data analytics as it offers multiple advantages such as flexibility,
speed, computational power, etc.

Julia – It is an upcoming language and is being touted as the successor of Python. It’s

still in its nascent stage and it will be interesting to see how it performs in the future. Julia

is an open source programming language used for numerical computing, as well as machine learning

and other kinds of data science applications. In a 2012 blog post announcing Julia, its four creators

said they set out to design one language that addressed all of their needs. A big goal was to avoid

having to write programs in one language and convert them to another for execution.

To that end, Julia combines the convenience of a high-level dynamic language with performance

that's comparable to statically typed languages, such as C and Java. Users don't have to define data

types in programs, but an option allows them to do so. The use of a multiple dispatch approach at

runtime also helps to boost execution speed.


 Jupyter Notebooks – These notebooks are widely used for coding in Python.
While it is predominantly used for Python, it also supports other languages such
as Julia, R, etc.

The tools we have discussed so far are true open-source tools. You don’t require to pay

for them or buy any extra licenses. They have thriving and active communities that

maintain and release updates on a regular basis.

SAS

It is a very popular and powerful tool. It’s prevalently and commonly used in the banking

and financial sectors. It has a very high share in private organizations like American

Express, JP Morgan, Mu Sigma, Royal Bank of Scotland, etc. Short for Statistical

Analysis System, SAS is a statistical tool developed by SAS Institute. It is one of the
oldest data analysis tools available. The latest stable build v9.4m6 was released back

in November 2018. Key features offered by SAS are:

● Easy to learn with loads of available tutorials

● A well-packed suite of tools

● Simple yet powerful GUI

● Granular analysis of textual content

● Seamless and dedicated technical support

● Visually appealing reports

● Identification of spelling errors and grammar mistakes for a more accurate analysis

The goal is to turn data into information, and information into insight.

- Carly Fiorina

SAS allows you to mine, alter, manage and retrieve data from several areas. Paired

with SQL, SAS becomes an extremely efficient tool for data access and analysis. SAS

has grown into a suite of tools serving several purposes, some of these areas are:

● Data Mining

● Statistical Analysis
● Business Intelligence Applications

● Clinical Trial Analysis

● Econometrics & Time Series Analysis

 SPSS – Short for Statistical Package for Social Sciences, SPSS was acquired by
IBM in 2009. It offers advanced statistical analysis, a vast library of machine
learning algorithms, text analysis, and much more.

 Matlab – Matlab is really underrated in the organizational landscape but it is


widely used in academia and research divisions. It has lost a lot of ground in
recent times to the likes of Python, R, and SAS but universities, especially in the
US, still teach a lot of undergraduate courses using Matlab.
Types of Data-Structured, semi-structured, Unstructured Data, Problems with unstructured data
We can classify data as structured data, semi-structured data, or unstructured data. Structured
data resides in predefined formats and models, Unstructured data is stored in its natural format until it’s
extracted for analysis, and Semi-structured data basically is a mix of both structured and unstructured data.

In this blog, we are going to cover Data, types of Data, and Structured Vs Unstructured Data, and
suitable Datastores.

What Is Data?

 Data is a set of facts such as descriptions, observations, and numbers used in decision making.
 We can classify data as structured, unstructured, or semi-structured data.
1) Structured Data

 Structured data is generally tabular data that is represented by columns and rows in a
database.
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specify to a formed set of data held as a table.
 In structured data, all row in a table has the same set of columns.
 SQL (Structured Query Language) programming language used for structured data.

2) Semi-structured Data

 Semi-structured data is information that doesn’t consist of Structured data (relational


database) but still has some structure to it.
 Semi-structured data consist of documents held in JavaScript Object Notation (JSON)
format. It also includes key-value stores and graph databasess

Semi structured data is a data type that contains semantic tags, but does not conformn to the structure
associated with typical relational databases.It maintains internal tags and markings that identify separate data
elements, whichenables information grouping and hierarchies.Both documents and databases can be semi-
structured. This type of data onlyrepresents about 5-10% of the structured/semi-structured/unstructured data
pie, bur has critical business usage cases.

Email is a very common example of a semi-structured data type. Examples of semi-


structured data includes:

1. Markup Language XML: This is a semi-structured document language. XML is a set of document encoding
rules that defines a human- and machine-readableformat. Its value is that its tag-driven structure is highly
flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web.
2. Open Standard JSON (JavaScript Object Notation) JSON: It is another semi-structured data interchange
format. Java is implicit in the name but other C-like programming languages recognize it. Its structure consists
of name/value pairs (or
object, hash table, etc.) and an ordered value list (or array, sequence, list). Since the structure is
interchangeable among languages, JSON excels at transmitting data between web applications and servers.

3. NoSQL: Semi-structured data is also an important element of many NoSQL (Not only SQL) databases. NoSQL
databases differ from relational databases because they do not separate the organization (schema) from the
data. This makes NoSQL a better choice to store information that does not easily fit into the record and table
format, such as text with varying lengths. It also allows for easier data exchange
between databases.Above databases are common in big data infrastructure and real-time Web applications
like Linkedin, On LinkedIn, hundreds of millions of business users freely share job titles, locations,skills, and
more; and LinkedIn captures the massive data in a semi-structured format.
When job seeking users create a search, Linkedin matches the query to its massive semi-structured data
stores, cross-references data to hiring trends, and shares the resulting recommendations with job seekers.

3) Unstructured Data
 Unstructured data is information that either does not organize in a pre-defined manner or
not have a pre-defined data model.
 Unstructured information is a set of text-heavy but may contain data such as numbers,
dates, and facts as well.
 Videos, audio, and binary data files might not have a specific structure. They’re assigned
to as unstructured data.

Typical Human-generated Unstructured Data:


 Text Files: Word processing, spreadsheets, presentations, email, logs.
 Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it as
semi-structured. However, its message field is unstructured and traditional analytics tools cannot
parse it.

1. Social Media: Data from Facebook, Twitter, LinkedIn.


2.Website: Data from YouTube, Instagram, photo sharing sites.
3.Mobile Data: Data from Text messages, locations.
4.Communications: Chat, IM, phone recordings, collaboration software.
5. Media: Data from MP3, digital photos, audio and video files.
6. Business Applications: Data from MS Office documents, productivity applications.

Typical Machine-generated Unstructured Data:


Machine-generated data is information that is automatically created by a computer,process, application, or
other machine without human intervention. Machine-generated data is becoming a major data resource and
will continue to do so.
For example, it includes:
1. Satellite Imagery: Data from Weather data, land forms, military movements.
2. Scientific Data: Data from Oil and gas exploration, space exploration, seismic
imagery, atmospheric data.
3. Digital Surveillance: Data from Surveillance photos and video.
4. Sensor Data: Data from Traffic, weather, oceanographic sensors.
Structured vs. unstructured data

There are some notable differences between structured and unstructured data to be
aware of when dealing with any of the data types. The following table will help compare
the two types of data based on factors such as data sources, data storage, internal
structure, data format, scalability, usage, and more.

Structured data Unstructured data

Unstructured data is qualitative data that


Structured data is quantitative data that consists consists of audio, video, sensors,
of numbers and values. descriptions, and more.

Structured data is used in machine learning and Unstructured data is used in natural
drives machine learning algorithms. language processing and text mining.

Structured data is stored in tabular formats like Stored as audio files, videos files, or NoSQL
excel sheets or SQL databases. databases

Unstructured data does not have a pre-


Structured data has a pre-defined data model.
defined data model.

Unstructured data is sourced from email


Structured data is sourced from online forms, GPS
messages, word-processing documents, pdf
sensors, network logs, web server logs, OLTP
files, and so on.
systems, and the like.

Structured data is stored in data warehouses Unstructured data is stored in data lakes

Structured data requires less storage space and is Unstructured data requires more storage
highly scalable. space and is difficult to scale.

Following are the important differences between Structure and Union.


Sr. Key Structured Data Semi Structured Data Unstructured Data
No.

Level of Structured Data as name On other hand in case of Semi In last the data is
organizing suggest this type of data Structured Data the data is fully non organized
is well organized and organized up to some extent in case of
hence level of organizing only and rest is non organized Unstructured Data
1
is highest in this type of hence the level of organizing and hence level of
data. is less than that of Structured organizing is lowest
Data and higher than that of in case of
Unstructured Data. Unstructured Data.

Means of Data Structured Data is get While in case of Semi On other hand in
Organization organized by the means Structured Data is partially case of
of Relational Database. organized by the means of Unstructured Data
2
XML/RDF. data is based on
simple character
and binary data.

Transaction In Structured Data In Semi Structured Data While in


Management management and transaction is not by default Unstructured Data
concurrency of data is but is get adapted from DBMS no transaction
3
present and hence but data concurrency is not management and
mostly preferred in present. no concurrency are
multitasking process. present.

Versioning As mentioned in On other hand in case of Semi Versioning in case


definition Structured Structured Data versioning is of Unstructured
Data supports in done only where tuples or Data is possible
4 Relational Database so graph is possible as partial only as on whole
versioning is done over database is supported in case data as no support
tuples, rows and table as of Semi Structured Data. of database at all.
well.

Flexible and As Structured Data is While in case Semi Structured As there is no


Scalable based on relational Data data is more flexible than dependency on any
database so it becomes Structured Data but less database so
schema dependent and flexible and scalable as Unstructured Data
less flexible as well as compare to Unstructured Data. is more flexible and
5
less scalable. scalable as
compare to
Structured and
Semi Structured
Data.

6 Performance In Structure Data we can On other hand in case of Semi While in case of
perform structured query Structured Data only queries Unstructured Data
which allow complex over anonymous nodes are only textual query
joining and thus possible so its performance is are possible so
Sr. Key Structured Data Semi Structured Data Unstructured Data
No.

performance is highest lower than Structured Data but performance is


as compare to that of more than that of Unstructured lower than both
Semi Structured and Data Structured and
Unstructured Data. Semi Structured
Data.

Problems with Unstructured Data


Unstructured data is not organized in a predefined manner. It generates immense
business value, but most organizations have not been able to yield insights because
there are simply so many challenges involved in analyzing unstructured data.

Following are some common problems associated with unstructured data:


1.Unstructured Data Keeps Expanding: Unstructured data continues to grow at an
exponential rate and experts believe that it will make up over 93% of data by 2022.This
large volume is going to be a huge challenge in analysing this type of data because the
larger the data set, the harder it is to store and analyse data in a way that is timely and
efficient.

2.Time Consuming: The lack of structure makes compilation and organizing


unstructured data a time- and energy-consuming task.

3.Not all Unstructured Data is High Quality: Unstructured data can be very
uneven when it comes to quality. The lack of consistency in quality occurs because data
is difficult to verify and therefore, is not always accurate. For example,
Facebook status updates, images and videos all qualify as unstructured data, but that
does not make it useful for organizations.

4.Data cannot be Analysed with Conventional Systems: Unstructured data cannot


be analysed with current databases because most data analytics databases are designed
for structured data, and are not equipped for unstructured. data.Therefore, data analytics
experts need to find new methods to locate, extract,organise and store data.
Unstructured data comes in different formats and databases that need to reflect the
freeform state of the data.
DATA SOURCES
A data source in data science is the initial location where data that is being used come from.Data collection is
the process of acquiring, collecting, extracting, and storing the huge
amount of data which-may be in the structured or unstructured form like text, video,audio, XML files, records,
or other image files used in later stages of data analysis.
In the process of data analysis, "Data collection' is the initial step before starting to analyze the patterns or
useful information in data.
The data which is to be analyzed must be collected from different valid sources. Some of these data sources
are Open Data Source, Social Media Data Source, Multimodal
Data Source and Standard Datasets Source.

Open Data Source


The idea behind open data is that some data should be freely available in a public domain that can be used by
anyone as they wish, without restrictions from copyright,patents, or other mechanisms of control.
Local and federal governments, Non-Government Organizations (NGOs) and academic communities all lead
open data initiatives. For example, Open Government Data Platform India is a platform for supporting Open
Data initiative of Government of India. Open Government Data Platform India is also packaged as a product
and made available in open source for implementation by countries globally. The National Data Sharing and
Accessibility Policy (NDSAP) that came into existence in 2012, with the approval of the union cabinet, is taken
effort to associate the principles with Open data accessibility.The NDSAP is empowered by the Section 4(2) of
the Right to Information (RTI) Act, and.makes it the responsibility of every public authority to share their data
and information at regular intervals.The NDSAP, is applicable to all non-personal, non-sensitive data produced
using public funds by the central, state, and local governments, and their departments. It covers data in all
formats, digital, analog, machine, and human-readable formats.
The NDSAP uses the principles of open data such as openness, transparency, quality,privacy, and machine
readability.As a policy, NDSAP encourages and facilitates the sharing of government-owned data to achieve
two primary goals; transparency and the accountability of the government,and innovation and the economic
development of the country.

1.7.2 Social Media Data Source


Social media channels has abundant source of data. Social media are interactive Web 2.0 Internet-based
applications. Social media are reflection of public.Social media are interactive technologies that allows creation
of sharing/exchange of information, ideas, career interests and other forms of expression via virtual
communities and networks.Social media data is very useful for research or marketing purposes. Analysis of
social data helps to take appropriate decision for marketing strategies, security policies, andprediction and
consumer analysis.This is facilitated by the Application Programming Interface (API), which are provided by
social media companies, to leverage the vast amounts of data available through various social media
channels.For various data-related needs (e.g., retrieving a user's profile picture), one could send API requests
to a particular social media service. This is typically a programmatic call that results in that service sending a
response in a structured data format, such as an XML.

Multi-model Data
Today explosion of unstructured data evolving as a big challenge for industry and researchers.
IT (Internet of Things) has allowed us to always remain connected with the help of different electronics
gadgets. This communication network generates huge data having different formats and data types.When
dealing with such contexts, we may need to collect and explore multimodal
(different forms) and multimedia (different media) data such as images, music and other sounds, gestures,
body posture, and the use of space.Once, the sources are identified, the next thing to consider is the kind of
data that can
be extracted from those sources.Based on the nature of the information collected from the sources, the data
can be
categorized into two types: structured data and unstructured data.One of the well-known applications of such
multimedia data is analysis of brain imaging data sequences - where the sequence can be a series of images
from different
sensors, or a time series from the same subject.
The typical dataset used in this kind of application is a multimodal face dataset, which contains output from
different sensors such as EEG, MEG, and fMRI (medical imaging techniques) on the same subject within the
same paradigm.In this field, Statistical Parametric Mapping (SPM) is a well-known statistical technique, created
by Karl Friston that examines differences in brain activity recorded
during functional neuro imaging experiments.

Standard Datasets
A dataset or data set is simply a collection of data.
In the case of tabular data (in the form of table), a data set corresponds to one or more database tables, where
every column of a table represents a particular variable and each row corresponds to a given record of the
data set in question.In the open data discipline, data set is the unit to measure the information released ina
public open data repository.The simplest and most common format for datasets is a spreadsheet or CSV
format – a single file organized as a table of rows and columns. Sometimes a dataset may be a zip file or folder
containing multiple data tables with related data. Uploading datasets as Open Access helps both individuals
and institutions meet. Availability of Authenticate and standardized data sets provides supports research
reproducibility, fosters innovations and discoverability.

You might also like