0% found this document useful (0 votes)

79 views

1 - Introduction To Data Science

The document provides an introduction to data science, big data, machine learning, deep learning, and data mining. It discusses what data science and big data are, the skills needed to be a good data scientist, types of regression analysis, the 5 V's of big data (volume, variety, veracity, value), Apache Hadoop, the process of data mining including establishing goals, selecting data, preprocessing, transforming, storing, mining, and evaluating results. It also defines machine learning and deep learning and provides an example structure for a report.

Uploaded by

Daniel Vasconcellos

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views

1 - Introduction To Data Science

Uploaded by

Daniel Vasconcellos

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

edX - Introduction to Data Science

IBM - DS0101EN

O que é Data Science?

Data Science is knowledge of data. Nowadays we receive tons of data, so Data

Science is the tool that allows us to understand the world and make predictions about
the future. In short, Data Science is a very powerful tool that allows us to make the best
decisions based on all the knowledge of the data.

O que é Big Data?

The term Big Data refers to data sets that are so massive, so quickly built, and so
varied that they defy traditional analysis methods.
Data analysis means that organizations now have the power to analyse these vast data
sets, and new knowledge and insights are becoming available to everyone.

O que é necessário para ser um bom data scientist?

According to professor Haider, the three important qualities to possess in order to

succeed as a data scientist are to be curious, extremely argumentative and judgmental.

O que é Regressão?

Regression analysis makes it possible to quantify and infer about the relationship
between a variable called the dependent variable (response variable) and other
variables called independent variables (explanatory variables).

Os V´s da Big Data

1. Volume
Is the scale of the data, or the increase in the amount of data stored.
Drivers of volume are the increase in data sources, higher resolution sensors, and
scalable.

2. Variety
Is the diversity of the data, reflects that data comes from different sources, machines,
people, and processes, Data can be:
Structured - Data fits neatly into rows and columns, in relational databases.
Unstructured - Data not organized in a pre-defined way, like Tweets, blog posts,
pictures and videos.
Drivers are mobile technologies, social media, wearable technologies, geo
technologies, vídeo.

3. Veracity
Is the quality and origin of data, and its conformity to facts and accuracy. Drivers
include cost and the need for traceability.

Página 1 de 6
4. Value
Isn't just profit. It may have medical or social benefits, as well as customer, employee,
or personal satisfaction. The main reason that people invest time to understand Big
Data is to derive value from it.

O que é o Apache Hadoo?

Is a collection of open-source software utilities that facilitates using a network of many

computers to solve problems involving massive amounts of data and computation.
It provides a software framework for distributed storage and processing of big data
using the MapReduce programming model.
A MapReduce program is composed of a map procedure, which performs filtering and
sorting (such as sorting students by first name into queues, one queue for each name),
and a reduce method, which performs a summary operation.

Data mining

Data mining is the process of automatically searching and analyzing data, discovering
previously unrevealed patterns.
It involves preprocessing the data to prepare it and transforming it into an appropriate
format.
Once this is done, insights and patterns are mined and extracted using various tools
and techniques ranging from simple data visualization tools to machine learning and
statistical models.

1. Establishing Data Mining Goals

The first step in data mining requires you to set up goals for the exercise. The cost-
benefit trade-off is always instrumental in determining the goals and scope of the data
mining exercise. The level of accuracy expected from the results also influences the
costs. High levels of accuracy from data mining would cost more and vice versa.
Furthermore, beyond a certain level of accuracy, you do not gain much from the
exercise, given the diminishing returns. Thus, the cost-benefit trade-offs for the desired
level of accuracy are important considerations for data mining goals.

2. Selecting Data
The output of a data-mining exercise largely depends upon the quality of data being
used. At times, data are readily available for further processing.
On the other hand, data may not be readily available for data mining. In such cases,
you must identify other sources of data or even plan new data collection initiatives,
including surveys.
Therefore, identifying the right kind of data needed for data mining that could answer
the questions at reasonable costs is critical.

3. Preprocessing Data
Preprocessing data is an important step in data mining. Often raw data are messy,
containing erroneous or irrelevant data. In addition, even with relevant data, information
is sometimes missing. In the preprocessing stage, you identify the irrelevant attributes
of data and expunge such attributes from further consideration. At the same time,
identifying the erroneous aspects of the data set and flagging them as such is
necessary.

Página 2 de 6
Data should be subject to checks to ensure integrity. Lastly, you must develop a formal
method of dealing with missing data and determine whether the data are missing
randomly or systematically.
If the data were missing randomly, a simple set of solutions would suffice. However,
when data are missing in a systematic way, you must determine the impact of missing
data on the results.

4. Transforming Data
After the relevant attributes of data have been retained, the next step is to determine
the appropriate format in which data must be stored.
An important consideration in data mining is to reduce the number of attributes needed
to explain the phenomena. This may require transforming data Data reduction
algorithms.
In addition, variables may need to be transformed to help explain the phenomenon
being studied.

5. Storing Data
The transformed data must be stored in a format that makes it conducive for data
mining. The data must be stored in a format that gives unrestricted and immediate
read/write privileges to the data scientist
Data safety and privacy should be a prime concern for storing data.

6. Mining Data
After data is appropriately processed, transformed, and stored, it is subject to data
mining. This step covers data analysis methods, including parametric and non-
parametric methods, and machine-learning algorithms. A good starting point for data
mining is data visualization. Multidimensional views of the data using the advanced
graphing capabilities of data mining software are very helpful in developing a
preliminary understanding of the trends hidden in the data set.

7. Evaluating Mining Results

After results have been extracted from data mining, you do a formal evaluation of the
results. Formal evaluation could include testing the predictive capabilities of the models
on observed data to see how effective and efficient the algorithms have been in
reproducing data. This is known as an "in-sample forecast". In addition, the results are
shared with the key stakeholders for feedback, which is then incorporated in the later
iterations of data mining to improve the process.

O que é Machine Learning?

Machine learning is a subset of AI that uses computer algorithms to analyze data and
make intelligent decisions based on what it has learned, without being explicitly
programmed. Machine learning algorithms are trained with large sets of data and they
learn from examples.
Machine learning is what enables machines to solve problems on their own and make
accurate predictions using the provided data.

Página 3 de 6
O que é Deep Learning?

Deep learning is a specialized subset of Machine Learning that uses layered neural
networks to simulate human decision-making.
Deep learning algorithms can label and categorize information and identify patterns. It
is what enables AI systems to continuously learn on the job, and improve the quality
and accuracy of results by determining whether decisions were correct.
A neural network in AI is a collection of small computing units called neurons that take
incoming data and learn to make decisions over time.

Estrutura de um relatório

It is recommend that the deliverable follows a prescribed format including the cover
page, table of contents, executive summary, detailed contents, acknowledgments,
references, and appendices (if needed).

1. Cover Page
Should include the title of the report, names of authors, their affiliations, and contacts,
the name of the institutional publisher (if any), and the date of publication.

2. Table of Contents (ToC)

Is like a map needed for a trip never taken before. You need to have a sense of the
journey before embarking on it. A map provides a visual proxy for the actual travel with
details about the landmarks that you will pass by in your trip. The ToC with main
headings and lists of tables and figures offers a glimpse of what lies ahead in the
document.

3. Abstract or Executive Summary

Nothing is more powerful than explaining the crux of your arguments in three
paragraphs or less. Of course, for larger documents running a few hundred pages, the
executive summary could be longer. An "introductory section" is always helpful in
setting up the problem. This is where you formally introduce your research questions
and hypothesis.

4. Methodology Section
Is where you introduce the research methods and data sources you used for the
analysis. If you have collected new data, explain the data collection exercise in some
detail. You will refer to the literature review to bolster your choice for variables, data,
and methods and how they will help you answer your research questions.

5. Result Section
Is where you present your empirical findings. Starting with descriptive statistics and
illustrative graphics, you will move toward formally testing your hypothesis.

6. Discussion Ssection
Is where you rely on the power of narrative to enable numbers to communicate your
thesis to your readers. You refer the reader to the research question and the
knowledge gaps you identified earlier. You highlight how your findings provide the
ultimate missing piece to the puzzle.

Página 4 de 6
7. Conclusion" Section
You generalize your specific findings and take on a rather marketing approach to
promote your findings so that the reader does not remain stuck in the caveats that you
have voluntarily outlined earlier. You might also identify future possible developments
in research and applications that could result from your research.

8. Housekeeping
List of references, the acknowledgment section (acknowledging the support of those
who have enabled your work is always good), and "appendices", if needed.

Final Exam

Introduce yourself and explain why data science is of interest to you.

A: I am a mechanical engineer, I consider myself a person who seeks to understand
the “why's" of everything around me.
I recently started to study more about data science and my fascination has grown more
and more over time.
Data science allows us to make decisions and get to know the World based on real
data, what I find absolutely fascinating is the fact that we can improve the lives of
several people through data analysis, this analysis is done in a logical way, which
makes it fair and equal.
It's not just an analysis based on the averages and what is called "normal", but a deep
analysis that includes outliers and deviations, allows you to see the real big picture.

Give an example of an industry where data science is being used, and how it is being
used.
A: The healthcare system is undoubtedly an area where data science has been
developed and which will have a lot to grow and evolve in the coming years. Data
science analytics can provide practical insights and aid in the decision-making.
Algorithms developed using data science and machine learning are helping doctors to
deliver even better and faster diagnostic outcomes.
Today, radiologists and cardiologists are increasingly relying on these algorithms to
detect coronary artery disease from non-contrast chest CT scans.
I think this is one of the best uses of data, to serve people their health and on the way
to make the process more effective, faster and more democratic, in the sense that it
can reach a greater number of people.

List 2 characteristics a data scientist may possess and share why these characteristics
are important to their success.
A: The two characteristics I chose are curiosity and the ability to be a good storyteller.
Curiosity is important because it is the characteristic that allows us to continuously
explore, ask questions and pursue new challenges.
It is curiosity that makes the data scientist wonder why something behaves a certain
way, that's what made me look for this course.
Storytelling is very important for all professional areas. In the particular case of Data
Science, it is extremely important because we have to adjust our narrative to the
audience that is listening to us.

Página 5 de 6
As the analysis of a data scientist is based on data in a rational way, it is important to
convey this message in a way that the audience understands and identifies with what is
being presented.
To summarize, I would like to leave a sentence from a neurologist doctor, who like me
is Portuguese:
“Data makes people think, emotions make them act.” — Antonio Damásio.

List two components of the eight main components of a report outlined in the course,
and explain why you think they are important to include in a data science report.
A: Executive summary and the conclusion section are two of the eight main
components of a report.
The executive summary it is extremely important because it makes to the reader know
the key points of what will be the essence of our study. A well-written summary is a
summary that, without going into too much detail, makes the reader more and more
curious, and willing to know more about the subject.
The conclusion section is the closing statement, it’s used to promote our findings. In
short is where all the dots are connected. It should demonstrate the importance of our
ideas, solutions and opportunities for future research and applications.

Página 6 de 6

FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
4 Anime
No ratings yet
4 Anime
3 pages
Data Mining.pdf
No ratings yet
Data Mining.pdf
6 pages
FDS UNIT 1 QB
No ratings yet
FDS UNIT 1 QB
7 pages
Module 1
No ratings yet
Module 1
35 pages
AD3491 - FDSA - Unit I - Introduction - Part I
100% (2)
AD3491 - FDSA - Unit I - Introduction - Part I
23 pages
Unit 1
No ratings yet
Unit 1
14 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
past ppr(1)
No ratings yet
past ppr(1)
31 pages
data scince report
No ratings yet
data scince report
11 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
Intro Lectures To DSA
No ratings yet
Intro Lectures To DSA
17 pages
UNIT3
No ratings yet
UNIT3
125 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
No ratings yet
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
66 pages
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
No ratings yet
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
3 pages
Mining
No ratings yet
Mining
7 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
File 1
No ratings yet
File 1
3 pages
DM - Weka Reprot
No ratings yet
DM - Weka Reprot
18 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
Big Data
No ratings yet
Big Data
3 pages
data mining and business analytics
No ratings yet
data mining and business analytics
7 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
Fda 1
No ratings yet
Fda 1
5 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
What Is Data - Types of Data, and How To Analyze Data (Updated)
No ratings yet
What Is Data - Types of Data, and How To Analyze Data (Updated)
8 pages
introduction to data science
No ratings yet
introduction to data science
8 pages
Data Science
No ratings yet
Data Science
46 pages
DM Module1
No ratings yet
DM Module1
15 pages
1708443470801
No ratings yet
1708443470801
71 pages
Unit 1
No ratings yet
Unit 1
30 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Data Science
100% (2)
Data Science
33 pages
Sri Venkateswara Engineering College
No ratings yet
Sri Venkateswara Engineering College
15 pages
Business Analytics
100% (5)
Business Analytics
46 pages
UNIT 3 DWM NOTES
No ratings yet
UNIT 3 DWM NOTES
17 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Data Mining Chapter 1
0% (1)
Data Mining Chapter 1
12 pages
Unit 1 and Unit 2 notes bda
No ratings yet
Unit 1 and Unit 2 notes bda
11 pages
Document
No ratings yet
Document
5 pages
Big Data - Iv Bda
No ratings yet
Big Data - Iv Bda
143 pages
Data Mining1
No ratings yet
Data Mining1
37 pages
DS
No ratings yet
DS
94 pages
Big Data Analytics
No ratings yet
Big Data Analytics
4 pages
What Is Data Mining
No ratings yet
What Is Data Mining
8 pages
Data Mining
No ratings yet
Data Mining
89 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
No ratings yet
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
15 pages
Data Mining 445545
No ratings yet
Data Mining 445545
11 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Unit 3
No ratings yet
Unit 3
18 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Console Diagnostics
No ratings yet
Console Diagnostics
18 pages
WiNet-S Auto-Upgrade Via Local Access and WiNet-S Plant Creation
No ratings yet
WiNet-S Auto-Upgrade Via Local Access and WiNet-S Plant Creation
22 pages
Winplot Tutorial
No ratings yet
Winplot Tutorial
16 pages
Freebies
No ratings yet
Freebies
5 pages
Blackline Sap Core: External Api Configuration
No ratings yet
Blackline Sap Core: External Api Configuration
25 pages
MS Dos MCQ
No ratings yet
MS Dos MCQ
4 pages
Curriculum Vitae: Personal Data
No ratings yet
Curriculum Vitae: Personal Data
3 pages
Sage Accounting 100 Training
100% (1)
Sage Accounting 100 Training
3 pages
Akshar Rathish
No ratings yet
Akshar Rathish
5 pages
LoC (Letter of Credit) Enhancements
No ratings yet
LoC (Letter of Credit) Enhancements
21 pages
PIC Job Advertisement July 2020
No ratings yet
PIC Job Advertisement July 2020
3 pages
Brochure Motion Graphic Design
No ratings yet
Brochure Motion Graphic Design
12 pages
Wireframe Model (2D in 1960s For Drafting, 3D in 1970s) Wireframe Entities
No ratings yet
Wireframe Model (2D in 1960s For Drafting, 3D in 1970s) Wireframe Entities
27 pages
The Ethics and Impact of Digital Immortality
No ratings yet
The Ethics and Impact of Digital Immortality
19 pages
TAFJ Default Properties
No ratings yet
TAFJ Default Properties
19 pages
North East University Bangladesh: Project Title: Discussion Forum
No ratings yet
North East University Bangladesh: Project Title: Discussion Forum
9 pages
Sr Final Board Revision Cs Part Test 3 Qp
No ratings yet
Sr Final Board Revision Cs Part Test 3 Qp
5 pages
Information Security Manual Standard 27002 Iso
No ratings yet
Information Security Manual Standard 27002 Iso
129 pages
Betting Bots - False Favourites
No ratings yet
Betting Bots - False Favourites
41 pages
dbms11
No ratings yet
dbms11
36 pages
Modify Simple Items. Selection Modes. Multiplying Commands: Assoc. Prof Eng Simona Sofia Duicu PHD
No ratings yet
Modify Simple Items. Selection Modes. Multiplying Commands: Assoc. Prof Eng Simona Sofia Duicu PHD
7 pages
(Modified) Assignment 2 - Group - Project
No ratings yet
(Modified) Assignment 2 - Group - Project
3 pages
Symmetrix Procedure Generator Information Guide
No ratings yet
Symmetrix Procedure Generator Information Guide
14 pages
User Guide: Tracer CH530™ Control System For Scroll Chillers CGAM/CXAM 020-170
No ratings yet
User Guide: Tracer CH530™ Control System For Scroll Chillers CGAM/CXAM 020-170
52 pages
How To Make Your Own Series 60 Theme
No ratings yet
How To Make Your Own Series 60 Theme
7 pages
Challenges IEC 61850 Substation Commissioning
No ratings yet
Challenges IEC 61850 Substation Commissioning
7 pages
Ubuntu Packages-Libraries Tech Info - For Dev Reference
No ratings yet
Ubuntu Packages-Libraries Tech Info - For Dev Reference
3 pages
Python Programming PRACTICAL NO.16 ANSWERS
No ratings yet
Python Programming PRACTICAL NO.16 ANSWERS
6 pages
CIS SM Demo
100% (1)
CIS SM Demo
6 pages