Unit I (Notes 2)
Unit I (Notes 2)
Unit I (Notes 2)
A quick note here: data analysis and data science are not the same. Although
they belong to the same family, data science is typically more advanced (a lot
more programming, creating new algorithms, building predictive models, etc.).
1. Define the question or goal behind the analysis: what are you trying to
discover?
4. Manipulate data using Excel or Google Sheets. This may include plotting the
data out, creating pivot tables, and so on.
5. Analyze and interpret the data using statistical tools (i.e. finding correlations,
trends, outliers, etc.).
Is data analytics hard? Well, the great thing about data analysis is that it’s more
of an entry-level role, meaning you can jump right in with basic knowledge after
you take some data analysis courses for beginners and sharpen a few key skills.
1
(Of course, it certainly won’t hurt if you already have experience with coding,
math, or statistics!)
Becoming a data analyst can also open the door to lucrative careers like data
science and data engineering (just to name a few) as you gain more experience
on the job.
What is the key objective of data analysis? That depends on what type of data
analysis skills you’re using. Here are five kinds of data analytics.
Diagnostic analysis: Takes the insights found from both descriptive and
exploratory analytics and investigates further to find the causes.
Predictive analysis: This type is often used more by data scientists, rather than
data analysts. It uses data, statistics, and machine learning algorithms and
techniques to figure out the likelihood of future outcomes based on data.
Examples include sales forecasting and risk assessment.
Prescriptive analysis: Takes insights found from all types of data analysis
(descriptive, exploratory, diagnostic, predictive) to determine the best course of
action.
Next, what are the methods data analysts use to accomplish these various
objectives? Here’s a quick introduction to data analytics methods.
Cluster analysis: Organizes data into groups, or clusters, that share common
characteristics. More on this here:
2
Regression analysis: A set of statistical processes that allows you to examine
the relationship between two or more variables. Learn more about this method
here:
Factor analysis: Condenses several variables into just a few to make data
analysis easier. Learn more:
Data mining: The process of finding trends, patterns, and correlations in large
data sets. Learn more
3
Data Requirements Specification
The data required for analysis is based on a question or an experiment. Based on the
requirements of those directing the analysis, the data necessary as inputs to the
analysis is identified (e.g., Population of people). Specific variables regarding a
population (e.g., Age and Income) may be specified and obtained. Data may be
numerical or categorical.
Data Collection
Data Collection is the process of gathering information on targeted variables identified
as data requirements. The emphasis is on ensuring accurate and honest collection
of data. Data Collection ensures that data gathered is accurate such that the related
decisions are valid. Data Collection provides both a baseline to measure and a target
to improve.
Data is collected from various sources ranging from organizational databases to the
information in web pages. The data thus obtained, may not be structured and may
contain irrelevant information. Hence, the collected data is required to be subjected
to Data Processing and Data Cleaning.
Data Processing
The data that is collected must be processed or organized for analysis. This includes
structuring the data as required for the relevant Analysis Tools. For example, the data
might have to be placed into rows and columns in a table within a Spreadsheet or
Statistical Application. A Data Model might have to be created.
Data Cleaning
The processed and organized data may be incomplete, contain duplicates, or contain
errors. Data Cleaning is the process of preventing and correcting these errors. There
are several types of Data Cleaning that depend on the type of data. For example,
while cleaning the financial data, certain totals might be compared against reliable
published numbers or defined thresholds. Likewise, quantitative data methods can
be used for outlier detection that would be subsequently excluded in analysis.
Data Analysis
Data that is processed, organized and cleaned would be ready for the analysis.
Various data analysis techniques are available to understand, interpret, and derive
conclusions based on the requirements. Data Visualization may also be used to
examine the data in graphical format, to obtain additional insight regarding the
messages within the data.
Statistical Data Models such as Correlation, Regression Analysis can be used to
identify the relations among the data variables. These models that are descriptive of
the data are helpful in simplifying analysis and communicate results.
4
The process might require additional Data Cleaning or additional Data Collection, and
hence these activities are iterative in nature.
Communication
The results of the data analysis are to be reported in a format as required by the users
to support their decisions and further action. The feedback from the users might result
in additional analysis.
The data analysts can choose data visualization techniques, such as tables and
charts, which help in communicating the message clearly and efficiently to the users.
The analysis tools provide facility to highlight the required information with color codes
and formatting in tables and charts.
PYTHON Introduction:
It is used for:
Python Features
Python is a dynamic, high level, free open source and interpreted programming
language. It supports object-oriented programming as well as procedural
oriented programming. In Python, we don’t need to declare the type of variable
because it is a dynamically typed language. For example, x = 10 Here, x can be
anything such as String, int, etc.
Features in Python
There are many features in Python, some of which are discussed below –
5
1. Easy to code: Python is a high-level programming language. Python is very
easy to learn the language as compared to other languages like C, C#, Javascript,
Java, etc. It is very easy to code in python language and anybody can learn python
basics in a few hours or days. It is also a developer-friendly language.
2. Free and Open Source: Python language is freely available at the official
website and you can download it from there Since it is open-source, this means
that source code is also available to the public. So you can download it as, use it
as well as share it.
3. Object-Oriented Language: One of the key features of python is Object-
Oriented programming. Python supports object-oriented language and concepts
of classes, objects encapsulation, etc.
4. GUI Programming Support: Graphical User interfaces can be made using a
module such as PyQt5, PyQt4, wxPython, or Tk in python. PyQt5 is the most
popular option for creating graphical apps with Python.
5. High-Level Language: Python is a high-level language. When we write
programs in python, we do not need to remember the system architecture, nor
do we need to manage the memory.
6. Extensible feature: Python is a Extensible language. We can write us some
Python code into C or C++ language and also we can compile that code in C/C++
language.
7. Python is Portable language: Python language is also a portable language.
For example, if we have python code for windows and if we want to run this code
on other platforms such as Linux, 10 Unix, and Mac then we do not need to change
it, we can run this code on any platform.
8. Python is Integrated language: Python is also an Integrated language
because we can easily integrated python with other languages like c, c++, etc.
9. Interpreted Language: Python is an Interpreted Language because Python
code is executed line by line at a time. like other languages C, C++, Java, etc. there
is no need to compile python code this makes it easier to debug our code. The
source code of python is converted into an immediate form called bytecode.
10. Large Standard L Library Python has a large standard library which
provides a rich set of module and functions so you do not have to write your own
code for every single thing. There are many libraries present in python for such
as regular expressions, unit-testing, web browsers, etc.
6
at run time not in advance because of this feature we don’t need to specify the
type of variable.
We write the python code in any text editor and save the same file using the
“.py” extension in our system. Now, how will this code run? There must be
some application or program like “python” or “python3” that must be
installed in your system, and this is their duty to run this python code. This
type of program is called Interpreter.
The Python interpreter first reads the command, then evaluates the
command, prints the results, and then again loops it back to read the
command and because of this only Python is known as REPL i.e., (Read,
Evaluate, Print, Loop).
1. Interactive mode
2. Script mode
1. Interactive mode:
❖ Interactive Mode, as the name suggests, allows us to interact with OS.
❖ When we type Python statement, interpreter displays the
result(s) immediately.
7
Advantages:
❖ Python, in interactive mode, is good enough to learn, experiment or
explore.
❖ Working in interactive mode is convenient for beginners and for testing
small pieces of code.
Drawback:
❖ We cannot save the statements and have to retype all the statements once
again to re-run them.
In interactive mode, you type Python programs and the interpreter displays
the result:
>>> 1 + 1
2
The chevron, >>>, is the prompt the interpreter uses to indicate that it is
ready for you to enter code. If you type 1 + 1, the interpreter replies 2.
>>> print ('Hello, World!')
Hello, World!
This is an example of a print statement. It displays a result on the screen. In
this case, the result is the words.
8
2. Script mode:
❖ In script mode, we type python program in a file and then use interpreter
to execute the content of the file.
❖ Scripts can be saved to disk for future use. Python scripts have
the extension .py, meaning that the filename ends with .py
❖ Save the code with filename.py and run the interpreter in script mode to
execute the script.
9
Python shell to display output with syntax highlighting.
Example
x = 5
print(type(x))
Scalar Types
Sequence Type
list Items
List items are indexed, the first item has index [0], the second item has
index [1] etc.
Example
Create a List:
print(thislist)
Ordered
When we say that lists are ordered, it means that the items have a
defined order, and that order will not change.
If you add new items to a list, the new items will be placed at the end
of the list.
Changeable
The list is changeable, meaning that we can change, add, and remove
items in a list after it has been created.
Allow Duplicates
11
Since lists are indexed, lists can have items with the same value:
Example
Example
Example
• Create a Tuple:
• thistuple = ("apple", "banana", "cherry")
print(thistuple)
• Tuple items are indexed, the first item has index [0], the second item
has index [1] etc.
Mapping Type
12
Dictionary: A dictionary Dict() object is an unordered collection
of data(python version3.6 and earlier) in a key:value pair form.
A collection of such pairs is enclosed in curly brackets.
• Dictionaries are written with curly brackets, and have keys and
values:
Set Types
• Create a Set:
• thisset = {"apple", "banana", "cherry"}
print(thisset)
• Set items are unordered, unchangeable, and do not allow duplicate
values.
• frozenset: Frozenset is immutable version of set whose elements are
added from other iterables.
13
Mutable and Immutable Types
Data objects of the above types are stored in a computer's memory
for processing. Some of these values can be modified during
processing,
but contents of others can't be altered once they are created in the
memory.
Numbers, strings, and Tuples are immutable, which means
their contents can't be altered after creation.
Python Variables
Python is not “statically typed”. We do not need to declare variables before
using them or declare their type. A variable is created the moment we first
assign a value to it. A variable is a name given to a memory location. It is the
basic unit of storage in a program.
• The value stored in a variable can be changed during program
execution.
• A variable is only a name given to a memory location, all the
operations done on the variable effects that memory location.
Rules for creating variables in Python:
• A variable name must start with a letter or the underscore
character.
• A variable name cannot start with a number.
• A variable name can only contain alpha-numeric characters and
underscores (A-z, 0-9, and _ ).
• Variable names are case-sensitive (name, Name and NAME are
three different variables).
• The reserved words(keywords) cannot be used naming the
variable.
Example:
# An integer assignment
age = 45
# A floating point
salary = 1456.8
# A string
name = "John"
print(age)
14
print(salary)
print(name)
print(a)
print(b)
print(c)
Python Identifiers
A Python identifier is a name used to identify a variable, function, class, module or
other object. An identifier starts with a letter A to Z or a to z or an underscore (_)
followed by zero or more letters, underscores and digits (0 to 9).
Python does not allow punctuation characters such as @, $, and % within identifiers.
Python is a case sensitive programming language.
Thus, Manpower and manpower are two different identifiers in Python.
Here are naming conventions for Python identifiers −
• Class names start with an uppercase letter. All other identifiers start with a
lowercase letter.
• Starting an identifier with a single leading underscore indicates that the identifier
is private.
• Starting an identifier with two leading underscores indicates a strongly private
identifier.
• If the identifier also ends with two trailing underscores, the identifier is a
language-defined special name.
15
Python Statement
Multi-line statement
a = 1 + 2 + 3 + \
4 + 5 + 6 + \
7 + 8 + 9
a = (1 + 2 + 3 +
4 + 5 + 6 +
7 + 8 + 9)
colors = ['red',
'blue',
'green']
a = 1; b = 2; c = 3
16