Lesson 1 Notes
Lesson 1 Notes
Diploma in Data
Analysis
Introduction
to Data
Analysis
Lesson 1: Summary Notes
DATA ANALYSIS
2
Contents
3 Lesson Objectives
3 Introduction
9 Introduction to data
14 Conclusion
14 References
DATA ANALYSIS
3
Lesson Objectives
• Objective 1: Introduction to data analysis
• Objective 2: Introduction to data
• Objective 3: Importing and cleaning data
Lesson Introduction
The future is rapidly evolving into one that is
incredibly data driven, where decisions are made
based upon data analysis. It is said that the amount
of data in the world doubles every 2 years as more
information becomes available. One of the major
challenges we face considering all this data, is how
to extract useful insights from it and how we can
use this data to make better, more informed, and
accurate decisions. One of the ways that we can
better utilise this data is through data analysis. Data
analysis helps make sense of this mass amount of
data by extracting useful insights and interpreting
them in a meaningful way. At the end of the day,
everyone can benefit from learning more about
analysing data!
DATA ANALYSIS
4
Statistician John Tukey defined data analysis in 1961 as: “Procedures for analysing data, techniques for interpreting
the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or
more accurate, and all the machinery and results of (mathematical) statistics which apply to analysing data” (Tukey
and Cleveland, 1986).
Data analysis can be looked at as a journey, where with each step you get to know the data a little bit better.
Here, we might be asked by a business to help with an issue they face, or we might have a question about a problem,
and we want to use data to solve that problem.
• We want to understand what the stakeholder needs are in this step of the journey.
• What needs to happen before you can dive into the data?
• We need to understand what stakeholders want from data.
• We need to understand the question that the stakeholders are posing to us. Ways of doing this would be to ask
specific questions, for example: Which channel should we focus on more to raise revenue?
Remember: It is your responsibility to mitigate what you are capable of and what is possible with the data available
and how long it will take. It should not take months to analyse, but important to give yourself enough time to avoid
mistakes. Set realistic expectations with the data available and how much time it will take to analyse the data.
Prioritise
Now that we are aware of the problem, we need to decide which of these questions need answering are the most
important.
• Identify what the questions are that you need to answer, what are your goals? I always like to create a list of my
tasks and then assess the importance and urgency of each.
• How would the stakeholder measure the value of each question they need to answer?
• Are there any tasks that are quick wins? (i.e. something that is easily ‘ticked off the list’ that will need minimal
effort.
Open communication about the process with the stakeholder is key, so be open to any changes they might want.
DATA ANALYSIS
5
Once you have decided upon which question to tackle first, you need to decide on which tools will help you achieve
your goals.
There are many different tools to utilise for data analysis out there today. It is always better to master one before
moving onto another, but by understanding the complexity of data analysis you want to undertake, you can choose
the tool best suited for your needs.
During this course, you will learn more about data analysis through tools Excel, R and Tableau.
Excel
Excel traditional tool for analysing data and where we will start our journey.
In this course, I assume that you have some prior experience with Excel, but if you have never been exposed to this
tool, rather head to Shaw Academy’s Excel course first to get better acquainted with this tool.
Excel is a great tool for analysing data, especially if you want employees of various technical skills to analyse data, but
it is just one of the many tools that are available in the data analysis toolkit. We don’t have to rest all our expectations
on Excel alone.
R is the next tool in our toolkit in this course. R is an integrated suite of software facilities for data manipulation,
calculation, and graphical display.
What’s great about R is it is Open Source Software, meaning it is freely available for anyone to use (but more about this
later). R is one of my favourite tools to analyse data as it has some of the best data manipulation, data visualisation
and result reporting capabilities. We will cover R in more detail in module 2 and 3, so stay tuned for the exciting
journey that lies ahead!
Tableau
Tableau is the last tool we will use to our advantage in our toolkit. It is free for students and we will end our journey
with this tool by exploring it in more detail in module 4.
DATA ANALYSIS
6
The next step in the data analyst journey would be to ask yourself Where does the data come from? Are we using all
possible data sources available to draw insights from?
If you are combining several different data sources, it might be good to think about setting up ETL service (once
again, more on this later).
Quality
When thinking about where the data is sourced from, we need to consider its quality. In the case of data from medical
tests, we could investigate the percentage of false positive to false negative tests we receive, meaning we would look
at how sensitive the tests are.
Relevance
This step can also look at when the data was recording, is your data relevant to today or is it outdated?
We could consider some basic data cleaning in this step, by adding in more fields (e.g. unique variables) to create new
cumulative variables to investigate. Checking the quality of your data is exceptionally important, because herein lies
the foundation for analysis. If the data is incorrect, you will draw conclusions that are incorrect.
NOTES
DATA ANALYSIS
7
What is interesting about what we know about data analysis and how we utilise this skill today, is that before computers,
the US census of 1880 took over 7 years to process the collected data. A machine capable of systematically processing
data recorded on punch cards was luckily invented. This cut down the time needed to analyse this data so that in the
1890 census it only took 18 months to do the job.
Another turning point was reached when relational databases came into being in the 1980s which allowed us to
analyse data on demand through Structured Query Language (SQL). This sped up the process of using data to draw
insights for everyday use and gradually more computational techniques led us to what we know as data analysis today
(i.e. being able to use live data to draw conclusions).
The concepts of data analysis will always have its roots in statistical analysis, but as computational techniques evolve,
the two become more integrated. This requires a data analyst to understand the statistical techniques involved, but
also to be able to utilise these computational techniques to extract insights from in hopefully less than 18 months!
Changing technologies ensure that data analysis is ever changing and a lifelong journey of learning.
Data analysis is for everyone! It’s a skill that you can use to better understand data in your everyday life but it’s also a
skill that can be used by a business to draw insights from the data they have available and use those insights to make
better data driven decisions to aid their business decisions.
NOTES
DATA ANALYSIS
8
Data analysis has become synonymous with problem solving. It can impact the way a business serves its customers.
Because of the growing skills gap, analytical skills such as data analysis, has become integral to not just technological
companies, but diverse industries such as insurance, marketing, product management, customer experience and
many more.
For businesses to stay competitive, it has become essential to analyse data and find meaningful insights to use for
better decision making in the business world. Choosing to follow a data analysis path, places you at the forefront of
the decision-making process in the company.
Pursuing a career in data analysis allows you to choose between a variety of industries and the high demand for the
skill means that this is a valuable role. Analytics is everywhere which means that new opportunities in this sector are
constantly cropping up. It’s a hugely exciting time to be a part of this industry and start a career in analytics. There is
no doubt that analytics will continue to be a huge part of enterprises in the years to come, so without delay, let’s get
you started on the road to analysing some data!
NOTES
DATA ANALYSIS
9
Introduction to data
What is data?
• Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia.
• e.g. The height of Mount Everest is ‘data’. This piece of information informs the mountaineer how they can prepare
to ascend the mountain.
What is dataset?
A dataset is a structured collection of data generally associated with a unique body of work. A database is an organised
collection of data stored as multiple datasets, that are generally stored and accessed electronically from a computer
system that allows the data to be easily accessed, manipulated, and updated. We manipulate and investigate data
from the dataset that is stored in the database.
But where does all this data come from? Data is information that is collected through observation and can be
qualitative or quantitative.
As mentioned previously, we use data in a vast array of industries to help make better decisions. Scientific research,
businesses, finance, governance, non-profits use data, in fact any organisation you can think of uses data. As a data
analyst, you will analyse and report findings back to the industries that have collected data.
DATA ANALYSIS
10
Data storage
Data analysts will use databases to access the data. Remember that a database is a collection of information that is
organised and easily accessible, managed and updated.
Types of databases
Relational database
A relational database is a collection of information that organises data points with defined relationships for easy
access. In the relational database model, the data structures including data tables, indexes and views which remain
separate from the physical storage, allowing administrators to edit the physical data storage without affecting the
logical data structure.
Distributed database
A distributed database is a collection of multiple interconnected databases, which are spread physically across various
locations that communicate via a computer network.
Object-oriented database
An object-oriented database is a database that subscribes to a model with information represented by objects.
Graph database
A graph database is a database designed to treat the relationships between data as equally important to the data
itself. It is intended to hold data without constricting it to a pre-defined model.
In the 1980’s Richard Stallman started what can be seen today as the Open Source movement.
DATA ANALYSIS
11
• Open data in line with open movements like open source, open hardware to make these tools and data free and
easy to access.
• This is important because data grows exponentially every day. The hypothesis is that if there are restrictions,
businesses and governments will not be able to become more data driven in their approach.
As a repository of the world’s most comprehensive data regarding what’s happening in different countries across the
world, World Bank Open Data is a vital source of Open Data. It also provides access to other datasets as well which are
mentioned in the data catalogue.
• 3000+ datasets
• Allows you to download in different formats
DATA ANALYSIS
12
Kaggle
• Variety of datasets
• Encourages publishers to share data in an accessible format
• Encourages cross collaboration with other data analysts, scientists, and engineers
• Promotes competitions to solve challenges
• Users publish code snippets
Titanic dataset
On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When
the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international
community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss
of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck
involved in surviving the sinking, some groups of people were more likely to survive than others. Kaggle contains an
open source dataset that holds information about the survival of Titanic passengers. Prior to your lesson you must
head over to Kaggle and download the dataset to follow along with me for the final topic of importing and cleaning
this dataset with the help of Power Query in Excel.
NOTES
DATA ANALYSIS
13
Manually
For this method you will manually copy-paste data into an Excel spreadsheet. The problem with this process is that
it is slow, repetitive and error prone.
Macros and VBAs help us to automate the importing process. Our Excel course offers more on this if you are
interested. This method requires some programming knowledge and requires you to spend some time maintaining.
Power Query
We will use the Power Query Extract, Transform, Load (ETL) tool available in Excel to import our dataset.
Power Query is a Business Intelligence tool available in Excel which helps us to manipulate data. It can connect to
different data sources, combine, and transform them. With Power Query, you can reuse queries (i.e. set up a query
once and refresh it when new data becomes available). No coding knowledge is required, but you can use M-code
(‘miscellaneous function’) if you want to write your own.
Think of power query as your extract, transform and load tool in Excel. It allows you to:
• Extract: Use Power Query to discover and connect to a variety of data sources.
• Transform: Transform the extracted data by, for example, combining or refining it.
• Load: Share the transformed data.
NOTES
DATA ANALYSIS
14
Conclusion
Data truly is everywhere and understanding it will not only help you understand the world we live in but it will help
you make sense of it too and open a variety of opportunities for you and your career. Remember that the best way to
master data analysis, is to practise what we did during the lesson. You can download open source datasets and play
around with them if you are interested in other sources of information, like finance or medicine.
In lesson 2 we will start exploring data in further detail. We will learn more about different data types, different ways
data can be graphically represented and learn more about some basic descriptive statistics. Throughout lesson 2, we
will practise what we learn on the Titanic dataset and dive deeper into the sea of data!
References
• Tukey, J. and Cleveland, W., 1986. The Collected Works of John W. Tukey. Belmont, Calif: Wadsworth
Advanced Books & Software.
• https://www.datapine.com/blog/data-analysis-questions/
• https://www.kaggle.com/c/titanic
• h tt p s : / / b l o g . l u z .v c /e n /exce l / h o w -to - e n a b l e - i n sta l l - p o w e r- q u e r y - exce l / # : ~ : text = I n % 2 0
general%2C%20Power%20Query%20has,be%20disabled%20or%20not%20present.
• https://powerspreadsheets.com/excel-power-query-tutorial/
• https://www.freecodecamp.org/news/why-should-you-learn-data-analysis/
• https://www.sas.com/en_nz/insights/articles/analytics/5-reasons-why-everybody-should-learn-
data-analytics.html
• https://www.r-project.org/about.html
• https://searchsqlserver.techtarget.com/definition/database#:~:text=A%20database%20is%20a%20
collection,or%20interactions%20with%20specific%20customers.
• https://www.usgs.gov/faqs/what-are-differences-between-data-a-dataset-and-a-database?qt-
news_science_products=0#qt-news_science_products
• https://www.flydata.com/blog/a-brief-history-of-data-analysis/
NOTES
DATA ANALYSIS