Unit 4 Notes
Unit 4 Notes
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
Data Science is about finding patterns in data, through analysis, and make future predictions.
Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare,
and manufacturing.
Data Science can be applied in nearly every part of a business where data is available. Examples
are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she
must organize the data in a standard format.
Where to Start?
In this tutorial, we will start by presenting what data is and how data can be analyzed.
You will learn how to use statistics and mathematical functions to make predictions.
What is Data?
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Structured data
Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Data science began in statistics. Part of the evolution of data science was the inclusion of
concepts such as machine learning, artificial intelligence, and the internet of things. With the
flood of new information coming in and businesses seeking new ways to increase profit and
make better decisions, data science started to expand to other fields, including medicine,
engineering, and more.
We could say that data science was born from the idea of merging applied statistics with
computer science. The resulting field of study would use the extraordinary power of modern
computing. Scientists realized they could not only collect data and solve statistical problems but
also use that data to solve real-world problems and make reliable fact-driven predictions.
1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article “The Future of Data Analysis,” he foresaw the inevitable emergence of a
new field nearly two decades before the first personal computers. While Tukey was ahead of his
time, he was not alone in his early appreciation of what would come to be known as “data
science.” Another early figure was Peter Naur, a Danish computer engineer whose book Concise
Survey of Computer Methods offers one of the very first definitions of data science:
“The science of dealing with data, once they have been established, while the relation of the data
to what they represent is delegated to other fields and sciences.”
1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was “to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information and
knowledge.”
1980s and 1990s: Data science began taking more significant strides with the emergence of the
first Knowledge Discovery in Databases (KDD) workshop and the founding of the International
Federation of Classification Societies (IFCS). These two societies were among the first to focus
on educating and training professionals in the theory and methodology of data science (though
that term had not yet been formally adopted).
It was at this point that data science started to garner more attention from leading professionals
hoping to monetize big data and applied statistics.
1994: BusinessWeek published a story on the new phenomenon of “Database Marketing.” It
described the process by which businesses were collecting and leveraging enormous amounts of
data to learn more about their customers, competition, or advertising techniques. The only
problem at the time was that these companies were flooded with more information than they
could possibly manage. Massive amounts of data were sparking the first wave of interest in
establishing specific roles for data management. It began to seem like businesses would need a
new kind of worker to make the data work in their favor.
1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon
the necessity and potential of data science.
2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose
to the challenge, and later on Spark and Cassandra made their debuts.
2014: Due to the increasing importance of data, and organizations’ interest in finding patterns
and making better business decisions, demand for data scientists began to see dramatic growth in
different parts of the world.
2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the realm
of data science. These technologies have driven innovations over the past decade — from
personalized shopping and entertainment to self-driven vehicles along with all the insights to
efficiently bring forth these real-life applications of AI into our daily lives.
2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.
2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data
Life Cycle Phases of Data Analytics
1. Data Engineer
Job Description
Data engineers format raw data so that it can be analyzed. They collect data that will be used
downstream, manage it, and convert the data so that it can be used by business analysts and
others on the team. Data engineers build systems that make huge volumes of data more available
to an organization.
Responsibilities
Requirements
Data engineers usually have at least an undergraduate degree in a math or computing field. They
need to be familiar with programming languages like Python and Scala as well as database
technologies like SQL. Apache Spark and Hadoop are commonly used tools in this role too.
2. Data Scientist
Job Description
Data scientists employ statistical and analytical skills to process and derive insight from large
datasets. They usually use various programming languages to achieve that goal. These insights
unearthed by data scientists help solve key business challenges.
Responsibilities
Average Salary
The average salary of a data scientist is $74,700 per year.
Requirements
Most data scientists have at least a bachelor’s degree, usually in computer sciences, engineering,
or a mathematical field like statistics. Languages like Python and R are commonly used in the
field. Data scientists are sometimes required to present data, for which a data visualization tool
like Tableau is used.
3. Data Analyst
Job Description
A data analyst examines the available data and uses statistical methods to solve specific business
problems. Professionals in this field usually work in an interdisciplinary environment and
collaborate with both business and data teams. Data analysts are different from data scientists,
who focus on creating tools and frameworks to gather data, while data analysts unearth data-
based insights.
Responsibilities
Average Salary
The average salary of a data analyst is $62,610.
Requirements
Data analysts need to assess which insights can be obtained from a given dataset. They use
programming languages like Python and R to design data analysis algorithms. Data analysts also
need to present the results of their work to various stakeholders in the company.
4. Data Administrator
Job Description
Data administrators build processes to store, retrieve, and maintain the available data. They
ensure that the data coming from a given source is current and stored in a secure manner. They
also define policies concerning database environments.
Responsibilities
Average Salary
The average salary for data administrator roles is $50,634.
Requirements
Data administrators need to be familiar with an organization’s data lifecycle. They use database
tools like SQL and Oracle. Hadoop is a commonly used tool for data management among
administrators.
5. Data Architect
Job Description
Data architects build and maintain an organizations’ databases. They conceptualize database
architectures based on a company’s requirements and build it end to end. Data architects monitor
their databases and execute system migrations whenever needed.
Responsibilities
Average Salary
The average salary of a data architect is $123,000 annually.
Requirements
Data architects need to have a strong understanding of database systems and data mining
procedures. Companies often require data architects to have at least a bachelor’s degree in
computer science or engineering. Good communication skills are also essential to update
executive teams on an organization’s evolving approach to data storage.
Related Read: 8 Best Data Architecture Courses To Boost Your Career
Responsibilities
Average Salary
The average salary of a machine learning engineer is $132,900 per year.
Requirements
A bachelor’s degree in computer sciences or engineering is required for machine learning
engineer jobs. Professionals in this field need to be well-versed in statistics and machine learning
algorithms. Machine learning engineers are also required to have an understanding of database
architecture and database systems.
Responsibilities
Average Salary
The average salary of a machine learning scientist is $137,053.
Requirements
Machine learning scientists are often PhDs with a focus on artificial intelligence and neural
networks. They use tools like OpenCV to model machine learning algorithms. The role requires
the ability to work on distributed systems and model deployment.
Responsibilities
Average Salary
The average salary for business intelligence developer roles is $94,800.
9. Business Analyst
Job Description
Business analysts use data to interpret changing business needs, andmeasure how changing
processes affect a business.They also communicate between different teams, acting as
intermediaries to translate business goals into concrete objectives.
Responsibilities
Model business processes and measure the impact of various changes using data
Communicate changes and translate requirements for various stakeholders
Assess data analysis proposals and suggest modifications
Average Salary
The average business analyst salary is $79,000.
Requirements
Business analysts need strong analytical skills. They use Python and R to perform analyses that
require data wrangling and data manipulation. Tools like Power BI and Tableau are commonly
used by business analysts to generate reports.
Responsibilities
Requirements
A strong understanding of database technologies such as SQL, PostgreSQL, and Oracle is key
for database administrators. Completing a certification like a Microsoft Certified Database
Administrator (MCDBA) can be beneficial for a career in the field. DBAs need to stay abreast of
developments in their field and recommend new tools or processes.