SAS 101 - Introduction to Data Science
SAS 101 - Introduction to Data Science
DATA SCIENCE
Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models. The data used
for analysis can come from many different sources and presented in various formats.
Data Science deals with the processes of data mining, cleansing, analysis,
visualization, and actionable insight generation. Data Scientist must have the basic
knowledge of mathematics, computer programming and statistics to solve the complex
data problems in an efficient way to boost the business revenue.
Data Science is the mining and analysis of relevant information from data to solve
analytically complicated problems. It is most widely used technique amongst Artificial
Intelligence and Machine Learning Engineers. For example, when you logged on any e-
commerce website and browsed some categories and products before purchase, you are
generating data, which will be helpful for Analysts to know your behavior about
purchase.
Data science is about using already stored raw and unstructured data in organization’s
repository, which process through systematic, programming and business skills in
creative ways to generate business worth.
i. Data science may detect patterns in seemingly unstructured or unconnected data, allowing
conclusions and predictions to be made.
ii. Tech businesses that acquire user data can utilise strategies to transform that data into valuable or
profitable information.
iii. Data Science has also made inroads into the transportation industry, such as with driverless cars. It
is simple to lower the number of accidents with the use of driverless cars. For example, with
driverless cars, training data is supplied to the algorithm, and the data is examined using data
Science approaches, such as the speed limit on the highway, busy streets, etc.
2
iv. Data Science applications provide a better level of therapeutic customisation through genetics and
genomics research.
Machine Learning: Machine Learning is a part of Data Science that enables the system to
process data sets without any human interference (autonomously). It utilizes different
algorithms to work on massive volume of data generated from various sources and makes
prediction, analysis patterns and gives recommendations. The real life example of Machine
learning is its use in fraud detection and client retention. Machine learning has three types.
Supervised machine learning: labeled data sets are used, here input and output
variables are used to produce outcome.
Unspervised machine learning: un-labeled data sets are used, here only input
variables are used and no output variable is used.
Reinforcement learning: It is different from supervised machine learning. It is
about taking appropriate action in particular situation to maximize the reward.
Statistics and Probability: Statistics and Probability are assumed essential elements in
data science as they make the numerical foundation of data science and likelihood. It
is difficult to do data science without the basic knowledge of statistics and probability.
Programming Languages: Programming languages specially Python and R play vital role in
data organization, visualization and data investigation. Python is high level programming
4
language which provides free libraries for data analysis. It is popular amongst the data
scientists.
R is another popular language. The best feature of R is data visualization. This language is
mostly used for social media post analysis. There are another languages that provide support
for data science like Java 8 with Lambdas and Scala. SQL is used for structured data and
NoSQL for unstructured data.
5
v. Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making. In this
final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and reports.
MAIN PROCESSES OF DATA SCIENCE
The main processes of data science are as follows:
Data Exploration
It is an essential step as it consumes most amount of time span. About 70% of the time is
spent on information investigation. The principle element for data science is information,
so, when we get information, it is only from time to time that information is in a right
organized structure.
Modeling:
At this point, our information is arranged and prepared to go ahead. This is the second step
where we really utilized the Machine Learning algorithms to fit the information into the
model.
The determination of a model relies upon the sort of information we have and the business
prerequisite. For instance, the model choice for prescribing an article to a client will be not
quite the same as the model required for foreseeing the quantity of articles that will be
sold on a specific day. When the model is chosen, we fit the information into the model.
Model Testing:
Model deployment is the subsequent stage and critical for the execution of the model. The
model is tried with test information to check the precision and different qualities of the
model and roll out the required improvements in the model to get the ideal outcome.
In the event that we don't get the ideal precision we can again go to previous Step-II i.e.
modeling, select an alternate model and afterward rehash a similar Step-III i.e. model
testing and pick the model which gives the best outcome according to the business
necessity.
Model Deployment:
When we obtain the ideal outcome by appropriate testing according to the business
prerequisites, we conclude the model, which gives us the best outcome according to
testing results and send the model in the manufacturing location.
10