Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

SAS 101 - Introduction to Data Science

Uploaded by

Dan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

SAS 101 - Introduction to Data Science

Uploaded by

Dan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

INTRODUCTION TO DATA SCIENCE

DATA SCIENCE
Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models. The data used
for analysis can come from many different sources and presented in various formats.
Data Science deals with the processes of data mining, cleansing, analysis,
visualization, and actionable insight generation. Data Scientist must have the basic
knowledge of mathematics, computer programming and statistics to solve the complex
data problems in an efficient way to boost the business revenue.
Data Science is the mining and analysis of relevant information from data to solve
analytically complicated problems. It is most widely used technique amongst Artificial
Intelligence and Machine Learning Engineers. For example, when you logged on any e-
commerce website and browsed some categories and products before purchase, you are
generating data, which will be helpful for Analysts to know your behavior about
purchase.
Data science is about using already stored raw and unstructured data in organization’s
repository, which process through systematic, programming and business skills in
creative ways to generate business worth.

APPLICATIONS OF DATA SCIENCE


Presently application of data science is very vast. You can see it everywhere in your
daily life. Some prominent areas of applications of data science include:
Healthcare: Healthcare companies are using data science to build sophisticated medical instruments to
detect and cure diseases.
Gaming: Video and computer games are now being created with the help of data science and that has
taken the gaming experience to the next level.
Image Recognition: Identifying patterns is one of the most commonly known applications of data science.
Scanning images and detecting objects in an image is one of the most popular data science applications.
Recommendation Systems: Next up in the data science and its applications list comes Recommendation
Systems. Netflix and Amazon give movie and product recommendations based on what you like to watch,
purchase, or browse on their platforms.
Logistics: Data Science is used by logistics companies to optimize routes to ensure faster delivery of
products and increase operational efficiency.
Fraud Detection: Fraud detection comes the next in the list of applications of data science. Banking and
financial institutions use data science and related algorithms to detect fraudulent transactions.
1
Internet Search Engines: Internet comes the next in the list of applications of data science. When we
think of search, we immediately think of Google. Right? However, there are other search engines, such as
Yahoo, Duckduckgo, Bing, AOL, Ask, and others, that employ data science algorithms to offer the best
results for our searched query in a matter of seconds. Given that Google handles more than 20 petabytes of
data per day. Google would not be the 'Google' we know today if data science did not exist.
Speech Recognition: Speech recognition is one of the most commonly known applications of data
science. It is a technology that enables a computer to recognize and transcribe spoken language into text. It
has a wide range of applications, from virtual assistants and voice-controlled devices to automated
customer service systems and transcription services.
Targeted Advertising: If you thought Search was the most essential data science use, consider this: the
whole digital marketing spectrum. From display banners on various websites to digital billboards at
airports, data science algorithms are utilised to identify almost anything. This is why digital
advertisements have a far higher CTR (Call-Through Rate) than traditional marketing. They can be
customised based on a user's prior behaviour. That is why you may see adverts for Data Science Training
Programs while another person sees an advertisement for clothes in the same region at the same time.
Airline Route Planning: Next up in the data science and its applications list comes route planning. As a
result of data science, it is easier to predict flight delays for the airline industry, which is helping it grow. It
also helps to determine whether to land immediately at the destination or to make a stop in between, such
as a flight from Delhi to the United States of America or to stop in between and then arrive at the
destination.
Augmented Reality: Last but not least, the final data science applications appear to be the most
fascinating in the future. Yes, we are discussing something other than augmented reality. Do you
realise there's a fascinating relationship between data science and virtual reality? A virtual reality
headset incorporates computer expertise, algorithms, and data to create the greatest viewing
experience possible.

USES OF DATA SCIENCE

i. Data science may detect patterns in seemingly unstructured or unconnected data, allowing
conclusions and predictions to be made.

ii. Tech businesses that acquire user data can utilise strategies to transform that data into valuable or
profitable information.

iii. Data Science has also made inroads into the transportation industry, such as with driverless cars. It
is simple to lower the number of accidents with the use of driverless cars. For example, with
driverless cars, training data is supplied to the algorithm, and the data is examined using data
Science approaches, such as the speed limit on the highway, busy streets, etc.
2
iv. Data Science applications provide a better level of therapeutic customisation through genetics and
genomics research.

HISTORICAL BACKGROUND OF DATA SCIENCE


History of data goes back to 1500s when the Latin originated word "datum" was used.
But the work started on it during the period from 1940 to 1950. Claude Elwood
Shannon, an American Mathematical Engineer published a paper "A Mathematical
Theory of Communication" in 1948. Although he was not a data scientist but his
information theory formed the basis of machine learning algorithms.
John Wilder Tukey wrote a book Exploratory Data Analysis in 1977. The concept of
Exploratory Data Analysis was promoted by him to explore the data. The exploratory data
analysis (EDA) technique is used to analyze datasets mainly with the visual methods.
Peter Naur wrote the Concise Survey of Computer Methods in 1974 where he utilized
the expression "Data Science" first time. He used this term repeatedly in his book.
In 1999, Jacob Zahavi brought up the requirement for new devices to deal with the
enormous measures of data accessible to organizations, in “Mining Data for Nuggets of
Knowledge”.
In 2001, William Cleveland published a paper, “Data Science: An Action Plan for
Expanding the Technical Areas of the Field of Statistics”. You can find the paper.
The International Council for Science: Committee on Data for Science and Technology
started distributing the Data Science Journal in 2001, concentrated on issues like
the portrayal of data systems, their production on the web, applications and legitimate
issues.
In 2008, the title, "Data Scientist" turned into a trendy expression and in the long run a
piece of the language. Jeff Hammerbacher and DJ Patil of Facebook and LinkedIn are
given acknowledgment for starting its utilization as a trendy expression. Johan
Oskarsson was reintroduced the term NoSQL in 2009 when he sorted out a dialog on
"open-source, non-relational databases".

BASIC COMPONENTS OF DATA SCIENCE


Data
Data is a very basic element of data science. There are different types of data. Below is a
diagram showing the different kinds data.
Data is divided into categorical or qualitative data and numerical or quantitative data.
Categorical or qualitative data: is based on descriptive information. It has further three
types:-
3
 Bionomial Data: Variable data with only two options e.g. good or bad, true or
false.
 Nominal or Unordered Data: Variable data which is in unordered form e.g. red,
green, man.
 Ordinal Data: Variable data with proper order e.g. short, medium, long.

Numerical or quantitative data: is based on numerical information. It is further divided


into:
 Discrete data: This data is countable e.g. no. of children, whole numbers; and
 Continuous data: This data is measurable e.g. height, width, length. Continuous
data has further two types: Interval and Ratio

BASIC DATA SCIENCE COMPONENTS


Big Data: Big data consists of huge data sets. These data sets are analyzed and visualized
to unveil the trends, human behavior, and interactions.
The great example of big data is social media site Facebook where hundreds of terabytes
data is added daily in the form of text, audio, video, images etc.

Machine Learning: Machine Learning is a part of Data Science that enables the system to
process data sets without any human interference (autonomously). It utilizes different
algorithms to work on massive volume of data generated from various sources and makes
prediction, analysis patterns and gives recommendations. The real life example of Machine
learning is its use in fraud detection and client retention. Machine learning has three types.
 Supervised machine learning: labeled data sets are used, here input and output
variables are used to produce outcome.
 Unspervised machine learning: un-labeled data sets are used, here only input
variables are used and no output variable is used.
 Reinforcement learning: It is different from supervised machine learning. It is
about taking appropriate action in particular situation to maximize the reward.

Statistics and Probability: Statistics and Probability are assumed essential elements in
data science as they make the numerical foundation of data science and likelihood. It
is difficult to do data science without the basic knowledge of statistics and probability.

Programming Languages: Programming languages specially Python and R play vital role in
data organization, visualization and data investigation. Python is high level programming
4
language which provides free libraries for data analysis. It is popular amongst the data
scientists.
R is another popular language. The best feature of R is data visualization. This language is

mostly used for social media post analysis. There are another languages that provide support
for data science like Java 8 with Lambdas and Scala. SQL is used for structured data and
NoSQL for unstructured data.

HOW DATA SCIENCE WORK?


Data science integrates devices from multi disciplines to accumulate a data set, process
and get experiences from the data collection, obtain requisite information from the set,
and decipher it for basic leadership purposes.
Data science field incorporates statistics, data mining, Artificial Intelligence,
programming and analytics. Data mining applies algorithm in the perplexing
informational collection to uncover designs that are then used to separate usable and
pertinent information from the set. Factual estimates like prescient examination use this
separated information to check occasions that are probably going to occur later on
dependent on what the information indicates occurred before.
Artificial Intelligence is a man-made reasoning instrument that forms mass amounts of
information that a human would not be able to process in whole life. The data examiner
gathers and processes the organized information from the AI by using various algorithms
under analytics. Data analyst translates, changes over and summarizes the data to a
understandable language that the basic leadership group can understand it easily.

DATA SCIENCE LIFECYCLE


Data science lifecycle consists of five distinct stages, each with its own tasks:
i. Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves
gathering raw structured and unstructured data.
ii. Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture.
This stage covers taking the raw data and putting it in a form that can be used.
iii. Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data
scientists take the prepared data and examine its patterns, ranges, and biases to determine how useful
it will be in predictive analysis.
iv. Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative
Analysis. Here is the real meat of the lifecycle. This stage involves performing the various analyses
on the data.

5
v. Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making. In this
final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and reports.
MAIN PROCESSES OF DATA SCIENCE
The main processes of data science are as follows:
Data Exploration
It is an essential step as it consumes most amount of time span. About 70% of the time is
spent on information investigation. The principle element for data science is information,
so, when we get information, it is only from time to time that information is in a right
organized structure.

Modeling:
At this point, our information is arranged and prepared to go ahead. This is the second step
where we really utilized the Machine Learning algorithms to fit the information into the
model.
The determination of a model relies upon the sort of information we have and the business
prerequisite. For instance, the model choice for prescribing an article to a client will be not
quite the same as the model required for foreseeing the quantity of articles that will be
sold on a specific day. When the model is chosen, we fit the information into the model.

Model Testing:
Model deployment is the subsequent stage and critical for the execution of the model. The
model is tried with test information to check the precision and different qualities of the
model and roll out the required improvements in the model to get the ideal outcome.
In the event that we don't get the ideal precision we can again go to previous Step-II i.e.
modeling, select an alternate model and afterward rehash a similar Step-III i.e. model
testing and pick the model which gives the best outcome according to the business
necessity.

Model Deployment:
When we obtain the ideal outcome by appropriate testing according to the business
prerequisites, we conclude the model, which gives us the best outcome according to
testing results and send the model in the manufacturing location.

DATA SCIENCE TOOLS


The main purpose of using data science tools is to avoid the programming aspect and provide
user-friendly GUI. So, a person with less knowledge of algorithms can easily use them in
6
building machine learning models.

Some famous tools are discussed below.


1. RapidMiner: RapidMiner is a gadget for the complete life-cycle of forecast modeling.
RapidMiner Studio is the Visual Workflow Designer for Data Science Teams.
RapidMiner Server share and works together on each progression and part of the
information mining process. It permits to upgrade with the progressed lining instrument:
RapidMiner Server can cut out assets and devote to groups, use cases or ventures.
2. Data Robot: It is the platform for automated Machine Learning that can be utilized by
data scientists, software engineers, IT professionals and executives. Data Robot has
Python SDK and APIs. It ensure an easy development process and parallel processing.
3. Apache Hadoop: It is a java based open source framework which can perform
distributed processing of immense data sets across computer clusters. Apache Hadoop
runs in parallel on a cluster, so, it has the capability to permit you to process data
across all the nodes. It has many modules, such as HDFS, Hadoop Map Reduce,
Hadoop Common, Hadoop YARN, Hadoop Ozone. HDFS splits immense data and
allocate across many nodes in a cluster to ensure high accessibility.
4. Matlab: Matlab is available for personal use as well as for students which provides
solution for evaluating data, developing algorithms and producing models. Matlab is
also utilized for wireless communications and data analytics. The best ability of Matlab
is its scalability. Its algorithms can easily be converted to HDL, CUDA & C/C++
code.
5. KNIME: KNIME is free and open source platform which is helpful for data scientists
in blending tools and data types. It also permit to utilize your desire’s gadgets and
expend to Apache Spark and Big Data. KNIME can easily work with various data
sources and various types of platforms.
6. Trifacta: Basically, its main product is Wrangler that is helpful in sightseeing,
converting, scrubbing and joining the desktop files together. You just import your
datasets to Wrangler and the application will automatically start to shape and
structure your data. Its algorithms help you to make your data by telling common
changes and accumulations. Its advance self-service platform for data training is
Trifacta Wrangler Pro and Trifacta Enterprise is more helpful for the predictor staff.
7. Alteryx: Alteryx provide end-to-end analytics platform which permits the data
scientists and business analysists to break data hurdles and bring game- changing
insights which are helpful in solving big corporate hitches. Alteryx determine the data
and collaborate across the group. It has the ability to make and investigate the model. It
7
also permitted you to implant Python, R and Alteryx models into your processes.
8. Excel: Microsoft Excel can be utilized as data science tool as it is easier and best
analyzing data tool for non-professional people. You can easily organize, sort, filter
and summarize data with the help of Microsoft Excel.
9. Tableau: Tableau can be utilized by anyone due to its drag and drop functionality. In
some basic versions, data visualization tool is free of cost. It can work with any
database and also support various format data, such as, xml, csv, xls, etc.
10. Kubernates: It is an open source tool for handling clusters of containers. It provides
combination of features which are very helpful for data scientists. Kubernates
provides tools for installing applications, variations to existing container type
applications, scaling those applications and help in enhancing the usage of the
existing hardware under your containers.
11. Cloud Dataflow: Cloud Dataflow is a best gadget for data scientist as it offers fully
managed environment that can easily measure the massive data sets and enables data
science crews to own more of the creation process. It exposes transformational use
cases across businesses, including:
 Point-of-sale and segmentation analysis in marketing.
 Fraud exposure in economic facilities
 Personalized user experience in gaming
 IoT analytics in healthcare, logistics and engineering

In summary, the data science tools are grouped as follows:


i. Data Analysis: SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner
ii. Data Warehousing: Informatica/ Talend, AWS Redshift
iii. Data Visualization: Jupyter, Tableau, Cognos, RAW
iv. Machine Learning: Spark MLib, Mahout, Azure ML studio

CAREER PATH (JOB ROLE) IN DATA SCIENCE


Data Architect: Data Architect is responsible for building and maintaining an organization’s
database with the assistance of database administrators and analysts. They create database
solutions, evaluate requirements and plan design reports.
Data Engineer: Data Engineer is responsible for real-time processing on stored or collected
business data, so that, the data could be ready for analysis by data scientists.
Database Administrator: Database Administrator utilized various software tools to store and
organize conventional data for further examination.
Data Scientist: Data Scientists used conventional statistical methods or machine learning
8
techniques for making strategic business decisions.
Data Analyst: Data analysts perform advance types of analysis for companies and they may
also be responsible for tracking web analytics and analyzing A/B testing.
Data Visualizer: They translate data analytics into clear and concise information for business
communication.
Machine Learning Scientist: Machine Learning Scientist explores new data approaches and
algorithms.
Machine Learning Engineer: Machine Learning Engineer applies state of the art
computational models and delivers software solutions.
Statistician: Statisticians must have the solid knowledge of statistics and probability. They are
responsible for analyzing and report statistical information for business point of view.
Business Intelligence Analyst: They mainly focus on analyzing market trends.
Business Intelligence Consultant: Business Intelligence Consultants provide their expertise in
designing, developing and implementation of BI and analytics systems. They also examine the
business feat and prepare reports on performance metrics.
Business Intelligence Developer: They are responsible for designing and developing strategies
to support business consumer in rapidly searching the requisite information for better business
assessments.

WHO IS A DATA SCIENTIST?


Data scientists are analytical data professionals who have the technical ability to handle complicated issues
as well as the desire to investigate what questions need to be answered. They're a mix of mathematicians,
computer scientists, and trend forecasters. They work in both the business and IT sectors.

DUTIES OF A DATA SCIENTIST


A data scientist may do the following tasks:
i. Discover patterns and trends in datasets to get insights
ii. Create forecasting algorithms and data models
iii. Improve the quality of data or product offerings by utilising machine learning techniques
iv. Distribute suggestions to other teams and top management
v. In data analysis, use data tools such as R, SAS, Python, or SQL
vi. Top the field of data science innovations

FUNCTIONS OF A DATA SCIENTIST


A data scientist analyzes business data to extract meaningful insights. In other words, a data scientist solves
business problems through a series of steps, including:
9
1. Before tackling the data collection and analysis, the data scientist determines the problem by asking
the right questions and gaining understanding.
2. The data scientist then determines the correct set of variables and data sets.
3. The data scientist gathers structured and unstructured data from many disparate sources—enterprise
data, public data, etc.
4. Once the data is collected, the data scientist processes the raw data and converts it into a format
suitable for analysis. This involves cleaning and validating the data to guarantee uniformity,
completeness, and accuracy.
5. After the data has been rendered into a usable form, it’s fed into the analytic system—ML algorithm
or a statistical model. This is where the data scientists analyze and identify patterns and trends.
6. When the data has been completely rendered, the data scientist interprets the data to find
opportunities and solutions.
7. The data scientists finish the task by preparing the results and insights to share with the appropriate
stakeholders and communicating the results.

WHO OVERSEES THE DATA SCIENCE PROCESS?


1. Business Managers: The business managers are the people in charge of overseeing the data science
training method. Their primary responsibility is to collaborate with the data science team to
characterise the problem and establish an analytical method. A data scientist may oversee the
marketing, finance, or sales department, and report to an executive in charge of the department. Their
goal is to ensure projects are completed on time by collaborating closely with data scientists and IT
managers.
2. IT Managers: Following them are the IT managers. If the member has been with the organisation
for a long time, the responsibilities will undoubtedly be more important than any others. They are
primarily responsible for developing the infrastructure and architecture to enable data science
activities. Data science teams are constantly monitored and resourced accordingly to ensure that they
operate efficiently and safely. They may also be in charge of creating and maintaining IT
environments for data science teams.
3. Data Science Managers: The data science managers make up the final section of the tea. They
primarily trace and supervise the working procedures of all data science team members. They also
manage and keep track of the day-to-day activities of the three data science teams. They are team
builders who can blend project planning and monitoring with team growth.

10

You might also like