Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Unit 4 Notes

The document provides an overview of Data Science, highlighting its importance in various industries for data analysis, predictive modeling, and decision-making. It outlines the roles of different professionals in the field, such as Data Engineers, Data Scientists, and Machine Learning Engineers, along with their responsibilities, average salaries, and required skills. Additionally, it discusses the evolution of Data Science, the data analytics lifecycle, and the significance of structured and unstructured data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 4 Notes

The document provides an overview of Data Science, highlighting its importance in various industries for data analysis, predictive modeling, and decision-making. It outlines the roles of different professionals in the field, such as Data Engineers, Data Scientists, and Machine Learning Engineers, along with their responsibilities, average salaries, and required skills. Additionally, it discusses the evolution of Data Science, the data analytics lifecycle, and the significance of structured and unstructured data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit 4: Introduction to Data Science

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.

What is Data Science?

Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make future predictions.

By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)


 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?

Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare,
and manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is available. Examples
are:

 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

How Does a Data Scientist Work?

A Data Scientist requires expertise in several backgrounds:


 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases

A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she
must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and replace them with a
suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m.
However, the number 140 is larger than 1,8. - so scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way the "company" can
understand.

Where to Start?

In this tutorial, we will start by presenting what data is and how data can be analyzed.

You will learn how to use statistics and mathematical functions to make predictions.

What is Data?

Data is a collection of information.

One purpose of Data Science is to structure data, making it interpretable and easy to work with.

Data can be categorized into two groups:

 Structured data
 Unstructured data

Unstructured Data

Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data

Structured data is organized and easier to work with.


Evolution of Data Science: Growth & Innovation
The term “data science” — and the practice itself — has evolved over the years. In recent years,
its popularity has grown considerably due to innovations in data collection, technology, and mass
production of data worldwide. Gone are the days when those who worked with data had to rely
on expensive programs and mainframes. The proliferation of programming languages like
Python and procedures to collect, analyze, and interpret data paved the way for data science to
become the popular field it is today.

Data science began in statistics. Part of the evolution of data science was the inclusion of
concepts such as machine learning, artificial intelligence, and the internet of things. With the
flood of new information coming in and businesses seeking new ways to increase profit and
make better decisions, data science started to expand to other fields, including medicine,
engineering, and more.

Origins, Predictions, Beginnings

We could say that data science was born from the idea of merging applied statistics with
computer science. The resulting field of study would use the extraordinary power of modern
computing. Scientists realized they could not only collect data and solve statistical problems but
also use that data to solve real-world problems and make reliable fact-driven predictions.

1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article “The Future of Data Analysis,” he foresaw the inevitable emergence of a
new field nearly two decades before the first personal computers. While Tukey was ahead of his
time, he was not alone in his early appreciation of what would come to be known as “data
science.” Another early figure was Peter Naur, a Danish computer engineer whose book Concise
Survey of Computer Methods offers one of the very first definitions of data science:

“The science of dealing with data, once they have been established, while the relation of the data
to what they represent is delegated to other fields and sciences.”

1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was “to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information and
knowledge.”

1980s and 1990s: Data science began taking more significant strides with the emergence of the
first Knowledge Discovery in Databases (KDD) workshop and the founding of the International
Federation of Classification Societies (IFCS). These two societies were among the first to focus
on educating and training professionals in the theory and methodology of data science (though
that term had not yet been formally adopted).

It was at this point that data science started to garner more attention from leading professionals
hoping to monetize big data and applied statistics.
1994: BusinessWeek published a story on the new phenomenon of “Database Marketing.” It
described the process by which businesses were collecting and leveraging enormous amounts of
data to learn more about their customers, competition, or advertising techniques. The only
problem at the time was that these companies were flooded with more information than they
could possibly manage. Massive amounts of data were sparking the first wave of interest in
establishing specific roles for data management. It began to seem like businesses would need a
new kind of worker to make the data work in their favor.

1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon
the necessity and potential of data science.

2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.

2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose
to the challenge, and later on Spark and Cassandra made their debuts.

2014: Due to the increasing importance of data, and organizations’ interest in finding patterns
and making better business decisions, demand for data scientists began to see dramatic growth in
different parts of the world.

2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the realm
of data science. These technologies have driven innovations over the past decade — from
personalized shopping and entertainment to self-driven vehicles along with all the insights to
efficiently bring forth these real-life applications of AI into our daily lives.

2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.

2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data
Life Cycle Phases of Data Analytics

Data Analytics Lifecycle :


The Data analytic lifecycle is designed for Big Data problems and data science projects. The
cycle is iterative to represent real project. To address the distinct requirements for performing
analysis on Big Data, step – by – step methodology is needed to organize the activities and
tasks involved with acquiring, processing, analyzing, and repurposing data.
Phase 1: Discovery –
 The data science team learn and investigate the problem.
 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
 Steps to explore, preprocess, and condition data prior to modeling and analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and transform, to
get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined
order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
 Team explores data to learn about relationships between variables and subsequently, selects
key variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and production
purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –
 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the models or if they
need more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
 After executing model team need to compare outcomes of modeling to criteria established
for success and failure.
 Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Phase 6: Operationalize –
 The team communicates benefits of project more broadly and sets up pilot project to deploy
work in controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the model
in production environment on small scale , and make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.

Data Science Roles & What They Mean

1. Data Engineer
Job Description
Data engineers format raw data so that it can be analyzed. They collect data that will be used
downstream, manage it, and convert the data so that it can be used by business analysts and
others on the team. Data engineers build systems that make huge volumes of data more available
to an organization.

Responsibilities

 Source data and create datasets based on organization goals


 Design algorithms to convert raw data into usable information
 Create the architecture for data pipelines to data warehouses and databases
 Ensure adherence to data governance policies
Salary
The average salary of a data engineer is $112,202 per year.

Requirements
Data engineers usually have at least an undergraduate degree in a math or computing field. They
need to be familiar with programming languages like Python and Scala as well as database
technologies like SQL. Apache Spark and Hadoop are commonly used tools in this role too.

Related Read: How Do You Become a Data Engineer?

2. Data Scientist
Job Description
Data scientists employ statistical and analytical skills to process and derive insight from large
datasets. They usually use various programming languages to achieve that goal. These insights
unearthed by data scientists help solve key business challenges.

Responsibilities

 Frame questions based on the goals of the business


 Conduct data investigations and exploratory analyses to answer those questions
 Integrate and process data from various sources
 Choose models and algorithms to guide the data analysis process

Average Salary
The average salary of a data scientist is $74,700 per year.

Requirements
Most data scientists have at least a bachelor’s degree, usually in computer sciences, engineering,
or a mathematical field like statistics. Languages like Python and R are commonly used in the
field. Data scientists are sometimes required to present data, for which a data visualization tool
like Tableau is used.
3. Data Analyst
Job Description
A data analyst examines the available data and uses statistical methods to solve specific business
problems. Professionals in this field usually work in an interdisciplinary environment and
collaborate with both business and data teams. Data analysts are different from data scientists,
who focus on creating tools and frameworks to gather data, while data analysts unearth data-
based insights.

Responsibilities

 Analyze data to unearth patterns and derive meaning


 Develop and maintain databases and data warehouses
 Prepare reports presenting insights obtained from the data analysis process
 Work with management, engineers, and other team members to identify opportunities to
improve data processes

Average Salary
The average salary of a data analyst is $62,610.

Requirements
Data analysts need to assess which insights can be obtained from a given dataset. They use
programming languages like Python and R to design data analysis algorithms. Data analysts also
need to present the results of their work to various stakeholders in the company.

4. Data Administrator
Job Description
Data administrators build processes to store, retrieve, and maintain the available data. They
ensure that the data coming from a given source is current and stored in a secure manner. They
also define policies concerning database environments.

Responsibilities

 Monitor and maintain an organization’s data pipeline


 Filter out data that is corrupted or irrelevant
 Write and update data governance policies
 Collaborate with various stakeholders to improve data storage and retrieval efficiency

Average Salary
The average salary for data administrator roles is $50,634.

Requirements
Data administrators need to be familiar with an organization’s data lifecycle. They use database
tools like SQL and Oracle. Hadoop is a commonly used tool for data management among
administrators.

5. Data Architect
Job Description
Data architects build and maintain an organizations’ databases. They conceptualize database
architectures based on a company’s requirements and build it end to end. Data architects monitor
their databases and execute system migrations whenever needed.

Responsibilities

 Ideate and build database solutions for an organization


 Study database implementation procedures to meet internal and external regulations
 Prepare database architecture reports for executive team members
 Oversee data migration from legacy systems to new database technologies

Average Salary
The average salary of a data architect is $123,000 annually.

Requirements
Data architects need to have a strong understanding of database systems and data mining
procedures. Companies often require data architects to have at least a bachelor’s degree in
computer science or engineering. Good communication skills are also essential to update
executive teams on an organization’s evolving approach to data storage.
Related Read: 8 Best Data Architecture Courses To Boost Your Career

6. Machine Learning Engineer


Job Description
Machine learning engineers use artificial intelligence to automate data analysis processes. This
includes processes such as predictive modeling, data mining and pattern recognition. It is a role
that combines approaches from engineering, math and artificial intelligence.

Responsibilities

 Design machine learning systems to automate data-related processes


 Analyze statistical data and optimize machine learning algorithms
 Choose machine learning libraries and tools to simplify workflow
 Identify datasets to train new machine learning models

Average Salary
The average salary of a machine learning engineer is $132,900 per year.

Requirements
A bachelor’s degree in computer sciences or engineering is required for machine learning
engineer jobs. Professionals in this field need to be well-versed in statistics and machine learning
algorithms. Machine learning engineers are also required to have an understanding of database
architecture and database systems.

7. Machine Learning Scientist


Job Description
Machine learning scientists have research-focused roles. They research the algorithms and
models that a company plans to implement as part of its data analysis process. While machine
learning engineers primarily implement algorithms, machine learning scientists gauge their
efficiency, applicability and security.

Responsibilities

 Identify candidate algorithms to solve various data-related business problems


 Study different algorithms and identify key characteristics
 Test and implement algorithms for data analysis
 Present their findings to various stakeholders

Average Salary
The average salary of a machine learning scientist is $137,053.

Requirements
Machine learning scientists are often PhDs with a focus on artificial intelligence and neural
networks. They use tools like OpenCV to model machine learning algorithms. The role requires
the ability to work on distributed systems and model deployment.

8. Business Intelligence Developer


Job Description
A business intelligence developer analyzes data to produce insights for their organization.
Business intelligence developers also generate reports with accessible insights to help make
business decisions.

Responsibilities

 Translate organizational needs into technical specifications for data teams


 Analyze markets, products, and product-market interactions to source data points for
datasets
 Build systems for business performance monitoring and generate reports for executive
teams
 Perform quality assurance checks on business intelligence systems

Average Salary
The average salary for business intelligence developer roles is $94,800.
9. Business Analyst
Job Description
Business analysts use data to interpret changing business needs, andmeasure how changing
processes affect a business.They also communicate between different teams, acting as
intermediaries to translate business goals into concrete objectives.

Responsibilities

 Model business processes and measure the impact of various changes using data
 Communicate changes and translate requirements for various stakeholders
 Assess data analysis proposals and suggest modifications

Average Salary
The average business analyst salary is $79,000.

Requirements
Business analysts need strong analytical skills. They use Python and R to perform analyses that
require data wrangling and data manipulation. Tools like Power BI and Tableau are commonly
used by business analysts to generate reports.

10. Database administrator


Job Description
A database administrator (DBA) oversees a company’s database, which is important because a
company needs to have reliable access to accurate data at all times. DBAs ensure that databases
are functioning correctly and create processes for data backups.

Responsibilities

 Monitor database systems and identify any performance or security issues


 Establish authorizations for different stakeholders and guard against unauthorized access
 Design database architecture and front-end to simplify access for other team members
 Ensure that database functionality is in agreements with company’s data governance
policies
Average Salary
The average salary of a database administrator is $73,800.

Requirements
A strong understanding of database technologies such as SQL, PostgreSQL, and Oracle is key
for database administrators. Completing a certification like a Microsoft Certified Database
Administrator (MCDBA) can be beneficial for a career in the field. DBAs need to stay abreast of
developments in their field and recommend new tools or processes.

Applications of Data Science


1. IDENTIFYING CANCER TUMORS

2. TRACKING MENSTRUAL CYCLES

3. PERSONALIZING TREATMENT PLANS

4. CLEANING CLINICAL TRIAL DATA

5. MODELING TRAFFIC PATTERNS

You might also like