Data Analytics With Python
Data Analytics With Python
LEARNING
AND
AI WITH PYTHON
©2022 TOPS Technolgies. All Rights Reserved
Welcome To The World Of Data
Data Science
Data Analytics
Data Analysis
AI
ML
And
DL
©2022 TOPS Technolgies. All Rights Reserved
What is Data Science
Data science is an
interdisciplinary field
that uses scientific
methods, processes,
algorithms and systems
to extract knowledge
and insights from many
structural and
unstructured data. Data
science is related to
data mining, machine
learning and big data.
©2022 TOPS Technolgies. All Rights Reserved
What Is Data Analysis
Data analysis refers to the
process of examining,
transforming and arranging a
given data set in specific ways
in order to study its individual
parts and extract useful
information.
As You Will be Part Of any Organization, There Is A Very High Chance That You Will Have a
Database to Get The Data For Analysis….
And You Can not Make Any Accurate Analysis On Such Dataset , So You Need To Clean those
Dataset….
And This Process Will Always Take Your 60% To 70% Time Of Your Project..
Tts All About The Prediction of Your Data And Making Of Right Decision
Here You Will Use The Most Exciting Thing Which Is Machine Learning.
It’s Necessary To Represent Your Analysis To Your Client As well As to your upper
Authorities
● The Python interpreter and the extensive standard library are available
in source or binary form without charge for all major platforms, and can
be freely distributed.
Why Python ?
Designed to be easy to learn and master ○ Clean, clear syntax ○ Very few
keywords
Highly portable
Extensible
extension.
● Python modules and programs are differentiated only by the way they
are
called.
Programming Style
● py files executed directly are programs (often referred to as scripts)
Anaconda Installation
Individual Edition
Click Here
print() function
● The print function in Python is a function that outputs to your console
window whatever you say you want to print out.
● At first blush, it might appear that the print function is rather useless for
programming, but it is actually one of the most widely used functions in all
of python. The reason for this is that it makes for a great debugging tool.
Refer this example:
● Click here
● Click here
Escape Sequences
Click here
end= “ “
● The end=' ' is just to say that you want a space after the end of the
● Types of comments :
E.g. number=20
age = 21
10name = “python”
NAME=”python” print(name)
● Click here
https://github.com/TopsCode/Python/blob/master/Module1/1.1%20Progra
ming%20Style/1.1.6%20Variable.py
● Click here
https://github.com/TopsCode/Python/blob/master/Module1/1.1%20Progra
ming%20Style/1.1.7%20sum%20of%20two%20numbers(variable). py
1. Mutable:
Float
Float represent real numbers and are written with a decimal point.
Strings
Python does not have a character data type, a single character is simply a
string with a length of 1. Square brackets can be used to access elements of
the string.
Operators in Python
● To perform specific operations we need to use some symbols and that
symbols are operator
Example :
A+B
Here, + is a operator
A+B is expression
ArithmeticOperators
Assignment Operators
Logical Operators
Comparison Operators
Identity Operators
Membership Operators
Collections
● List
● Tuple
● Dictionaries
● Set
values.
● The most versatile is the list, which can be written as a list of comma-
separated
● Lists might contain items of different types, but usually the items all have the
same type.
2. Accessing List
● Accessing List
Like strings (and all other built-in sequence type), lists can be indexed
and
Sliced
Example: fruits[0]
Example :fruits[-3:-1]
3. Operations
● “in” operator :- This operator is used to check if an element is
present in the list or not.
● Click here
● Click Here
● Click here
● The differences between tuples and lists are, the tuples cannot be changed
unlike lists and tuples use parentheses, whereas lists use square brackets.
● Eg fruits=(“Mango’”,”Banana”,”Oranges”,23,44)
● Eg numbers=(11,22,33,44)
● Eg fruits=“Mango”,”Banana”,”Oranges’”
Introduction
● Unlike lists, tuples are immutable.
This means that elements of a tuple cannot be changed once it has been
assigned.
● But if the element is itself a mutable data type like list, its nested items can
be
changed.
● The index of -1 refers to the last item, -2 to the second last item and so on.
Click here
Click here
Click here
Click here
Click here
Click here
Set
A set is a Sequence of unique elements.
To create a set:
set1=set((1,2,3,4,3,4,5,6,5,5,4))
print(set1)
Output: {1, 2, 3, 4, 5, 6}
Dictionaries
1. Introduction
2. Accessing values in dictionaries
3. Working with dictionaries
4. Properties
5. Functions
Introduction
● Dictionaries are sometimes found in other languages as “associative
memories” or “associative arrays”.
● Tuples can be used as keys if they contain only strings, numbers, or tuples;
if a tuple contains any mutable object either directly or indirectly, it cannot
be used as a key.
1. Introduction
● You can’t use lists as keys, since lists can be modified in place using index
assignments, slice assignments, or methods like append() and extend().
● The main operations on a dictionary are storing a value with some key and
extracting the value given the key.
● Like lists they can be easily changed, can be shrunk and grown ad libitum
at run time. They shrink and grow without the necessity of making copies.
Dictionaries can be contained in lists and vice versa.
Introduction
● But the main difference is that items in dictionaries are accessed via keys
and not via their position.
2. Accessing Values
● To access dictionary elements, we can use the familiar square brackets
along with the key to obtain its value.
● We can also create a dictionary using the built-in class dict() (constructor).
Click here
Click here
4. Properties
● Properties of Dictionaries
user-defined objects.
○ Which means you can use strings, numbers or tuples as dictionary keys
but
● If..Elif..else Statement
● Nested if Statement
If Statements
● It is similar to that of other languages.
● Syntax :
if condition:
statements
If .. else statement
● It is similar to that of other languages.
● It is frequently the case that you want one thing to happen when a
condition it true, and something else to happen when it is false.
● Syntax :
if condition:
statements
else:
If..elif..else statement
● It is similar to that of other languages.
● The elif is short for else if. It allows us to check for multiple expressions.
● If the condition for if is False, it checks the condition of the next elif block
and so on.
If condition:
statement
Elif condition:
statement
Nested if….else statement
● There may be a situation when you want to check for another condition
after a condition resolves to true.
● Syntax :
if condition: statements
if condition: statements
else:
statement(s)
Refer this Example :
1.2.1 if statement
Click here
Click here
Click here
○ For Loop
○ While Loop
For Loops
● For loop has the ability to iterate over the items of any sequence, such as a
list or a string.
● Syntax :
statements(s)
● Then, the first item in the sequence is assigned to the iterating variable
iterating_var.
● Next, the statements block is executed.
● Each item in the list is assigned to iterating_var, and the statement(s) block
is executed until the entire sequence is exhausted
Click here
Click here
Nested Loops
Nested loop means a loop statement inside another loop
statement.
Syntax :
statements(s)
statements(s)
Refer this Example :
1.3.5 nested for loop: Click here
● To loop through a set of code a specified number of times, we can use the
range() function,
Click here
Click here
While Loop
● A while loop statement in Python programming language repeatedly
executes a target statement as long as a given condition is true.
● Syntax :
while expression:
statement(s)
Control Statements
Control Statements
● Loop control statements change execution from its normal sequence.
● When execution leaves a scope, all automatic objects that were created in
that scope are destroyed.
1. Break
2. Continue
3. Pass
Break Statement
● It brings control out of the loop and transfers execution to the
statement immediately following the loop.
● Syntax : break
Click Here
Continue Statements
● Syntax : continue
Click here
Pass Statements
● The pass statement does nothing.
● Syntax : pass
● Refer this Example :
Click here
FUNCTIONS
Function Definition
● A function is a block of organized, reusable code that is used to
perform a single, related action.
○ Python gives us many built-in functions like print(), etc. but we can
also create our own functions.
return [expression]
Defining a Functions
● The keyword ”def” introduces a function definition.
● Syntax :
functionname() or functionname(argument)
Click here
Function Arguments
● It is possible to define functions with a variable number of
arguments.
○ Keyword arguments
Function Arguments
● Default arguments values
○ The most useful form is to specify a default value for one or more
arguments.
○ This creates a function that can be called with fewer arguments than it
is defined to allow.
● Note : The default value is evaluated only once. This makes a difference
when the default is a mutable object such as a list, dictionary, or instances of
most classes.
Keyword Arguments
return sep.join(args)
Click here
Scope of Variables
Global variables
○ Defining a variable on the module level makes it a global variable, you
don't need to global keyword.
○ The global keyword is needed only if you want to reassign the global
variables in the function/method.
○ The global keyword is needed only if you want to reassign the global
variables in the function/method.
Local variables
○ If a variable is assigned a value anywhere within the function’s body, it’s
assumed to be a local unless explicitly declared as global.
● The file name is the module name with the suffix .py appended.
● Within a module, the module’s name (as a string) is available as the value
of the
● The imported module names are placed in the importing module’s global
symbol table.
Eg : import fibo
fibo.fib(10)
Importing Module
● There is a variant of the import statement that imports names from a
module directly into the importing module’s symbol table.
For example:
fib(500)
● This functions are divided into some categories like Number theoretic and
representation functions, Power and logarithmic functions, Trigonometric
functions, Angular conversion, Hyperbolic functions, Special functions. ●
Constants
Click here
Packages
● Packages are a way of structuring Python’s module namespace by using
“dotted module names”.
Click here
Input-Output
Reading from keyboard
● To read data from keyboard “input()” is use.
inserted between values, default a space. end: string appended after the
Files and Exceptions Handling
Opening and Closing file
● To read data from keyboard “input()” is use.
● syntax ● open(fileName,mode).
● mode: ‘r’ (only for reading), ‘w’ (only for writing), ‘a’ (for append) , r+ (for read and write). ● Normally, files are
opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific
encoding.
● If the end of the file has been reached, f.read() will return an empty string
(’ ’).
● f.readline() reads a single line from the file; a newline character (\n) is left
at the end of the string, and is only omitted on the last line of the file if the
file doesn’t end in a newline.
● For reading lines from a file, you can loop over the file object.
Reading and writing files
● f.write(string) : writes the contents of string to the file, returning the
number of characters written.
● f.tell() : It returns an integer giving the file object’s current position in the
file represented as number of bytes from the beginning of the file when in
binary mode and an opaque number when in text mode.
Click here
Click here
Click here
Exception Handling
Exception
Exception handling
Click here
Click here
Try …..finally clause
● try statement had always been paired with except clauses. But there is
another way to use it as well.
Click Here
User defined Exception
● Python also allows you to create your own exceptions by deriving classes
from the standard built-in exceptions.
class MyNewError(Exception):
pass
● Attributes
● Inheritance
● Overloading
● Overriding
Class And Object
● Python classes provide all the standard features of Object Oriented
Programming: the class inheritance mechanism allows multiple base classes,
a derived class can override any methods of its base class or classes, and a
method can call the method of a base class with the same name.
class ClassName:
Statement 1
Statement 2 …….
Class And Object
● The statements inside a class definition will usually be function definitions,
but other statements are also allowed.
● In particular, function definitions bind the name of the new function here.
Member methods in class
● The class_suite consists of all the component statements defining class
members, data attributes and functions.
● The class attributes are data members (class variables and instance
variables) and methods, accessed via dot notation.
● Eg. displayDetails()
Object
● Class objects support two kinds of operations: attribute references and
instantiation.
● Attribute references use the standard syntax used for all attribute
references in Python: obj.name
● Valid attribute names are all the names that were in the class’s namespace
when the class object was created..
Object
● Class objects support two kinds of operations: attribute references and
instantiation.
● Just pretend that the class object is a parameterless function that returns a
new instance of the class.
● A class can inherit attributes and behaviour methods from another class,
called the superclass.
● A class which inherits from a superclass is called a subclass, also called heir
class or child class.
● Method overloading
● Overloading is the ability to define the same method, with the same
name but with a different number of arguments and types.
● It's the ability of one function to perform different tasks, depending on
the number of parameters or the types of the parameters.
● Python operators work for built-in classes.
● But same operator behaves differently with different types.
Overloading
● For example, the + operator will, perform arithmetic addition on two
numbers, merge two lists and concatenate two strings.
➢ MySQL
➢ Microsoft Access
➢ Oracle
➢ PostgreSQL
➢ SQLite
➢ Mongo DB
➢ IBM DB2, etc.
What is the need of DBMS?
Database systems are basically developed for large amount of data. When
dealing with huge amount of data, there are two things that require
optimization: Storage of data and retrieval of data.
Storage:
● According to the principles of database systems, the data is stored in such a way that it
acquires lot less space as the redundant data (duplicate data) has been removed before
storage.
Examples of entities:
Attribute
Attributes are the properties which define the entity type. For example,
Roll_No, Name, DOB, Age, Address, Mobile_Noare the attributes which
defines entity type Student. In ER diagram, attribute is represented by an
oval.
1.Key Attribute:
● The attribute which uniquely identifies each entity in the entity set is
called key attribute.
● For example, Roll_Nowill be unique for each student. In ER diagram, key
attribute is represented by an oval with underlying lines.
2. Composite Attribute –
An attribute consisting more than one value for a given entity. For example,
Phone_No(can be more than one for a given student). In ER diagram,
multivalued attribute is represented by double oval.
4. Derived Attribute –
An attribute which can be derived from other attributes of the entity type is
known as derived attribute. e.g.; Age (can be derived from DOB). In ER
diagram, derived attribute is represented by dashed oval.
The complete entity type Student with its
attributes can be represented as:
Relationship Type and Relationship Set:
A relationship type represents the association between entity types. For
example, ‘Enrolled in’ is a relationship type that exists between entity type
Student and Course. In ER diagram, relationship type is represented by a
diamond and connecting the entities with lines.
When there is only ONE entity set participating in a relation, the relationship
is called as unary relationship. For example, one person is married to only
one person.
2.Binary Relationship–
When there are TWO entities set participating in a relation, the relationship
is called as binary relationship. For example, Student is enrolled in Course.
3. N-ary Relationship –
One to one – When each entity in each entity set can take part only once in
the relationship, the cardinality is one to one. Let us assume that a male can
marry to one female and a female can marry to one male. So the relationship
will be one to one.
Many to one –When entity set can take part only on entities in once in the relationship set and entities in
other entity set can take part more than once in the relationship set,cardinality is many to one.
Many to many –When entities in all entity sets can take part more than once in the relationship cardinality
is many to many. Let us assume that a student can take more than one course and one course can be taken
by many students. So the relationship will be many to many.
Participation Constraint:Participation Constraint is applied on the entity participating in the relationship
set.
Total Participation –Each entity in the entity set must participate in the relationship. If each student must
enroll in a course, the participation of student will be total. Total participation is shown by double line in ER
diagram.
Partial Participation –The entity in the entity set may or may NOT participate in the relationship. If some
courses are not enrolled by any of the student, the participation of course will be partial. The diagram
depicts the ‘Enrolled in’ relationship set with Student Entity set having total participation and Course Entity
set having partial participation
Weak Entity Type and Identifying Relationship:
● An entity type has a key attribute which uniquely identifies each entity in the entity set.
● But there exists some entity type for which key attribute can’t be defined. These are called Weak
Entity type.
● For example, A company may store the information of dependants (Parents, Children, Spouse) of an
Employee. But the dependents don’t have existence without the employee. So Dependent will be
weak entity type and Employee will be Identifying Entitytype for Dependant.
Algebra
Relational Algebra is procedural query language, which takes Relation as input and generate relation as
output. Relational algebra mainly provides theoretical foundation for relational databases and SQL.
Unary Relational Operations Binary Relational Operations
● UNION (υ)
● INTERSECTION ( ),
● DIFFERENCE (-)
● CARTESIAN PRODUCT ( x )
What is SQL?
● SQL is Structured Query Language, which is a computer language for storing, manipulating and
retrieving data stored in relational database.
● SQL is a language of database, it includes database creation, deletion, fetching rows and modifying
rows etc.
● SQL is the standard language for Relation Database System. All relational database management
systems like MySQL, MS Access, Oracle, Sybase, Informix, postgre and SQL Server use SQL as standard
database language.
● Also, they are using different dialects, such as:
● MS SQL Server using T-SQL, ANSI SQL
● Oracle using PL/SQL
● MS Access version of SQL is called JET SQL (native format) etc
Objectives
“The large majority of today's business applications revolve around relational databases and the SQL
programming language (Structured Query Language). Few businesses could function without these
technologies…”
Why SQL?
● Allows users to access data in relational database management systems.
● Allows users to define the data in database and manipulate that data.
● Allows to embed within other languages using SQL modules, libraries & pre-compilers.
● A primary key is a column of table which uniquely identifies each tuple (row) in that table.
● Primary key enforces integrity constraints to the table.
● Only one primary key is allowed to use in a table.
● The primary key does not accept the any duplicate and NULL values.
● The primary key value in a table changes very rarely so it is chosen with care where the changes can
occur in a seldom manner.
● A primary key of one table can be referenced by foreign key of another table.
Keys
Unique Key:
● Unique key constraints also identifies an individual table uniquely in a relation or table.
● A table can have more than one unique key unlike primary key.
● Unique key constraints can accept only one NULL value for column.
● Unique constraints are also referenced by the foreign key of another table.
● It can be used when someone wants to enforce unique constraints on a column and a group of
columns which is not a primary key.
Keys
Foreign Key:
● When, "one" table's primary key field is added to a related "many" table in order to create the
common field which relates the two tables, it is called a foreign key in the "many" table.
● In the example given below, salary of an employee is stored in salary table. Relation is established via
is stored in "Employee" table. To identify the salary of "Jforeign key column “Employee_ID_Ref” which
refers “Employee_ID” field in Employee table.of "Jhon" is stored in "Salary" table. But his employee
info For example, salary hon", his "employee id" is stored with each salary record.
Database Normalization ...
● Normalization is the process of minimizing redundancy (duplicity) from a relation or set of relations.
● Redundancy in relation may cause insertion, deletion and updation anomalies. So, it helps to
minimize the redundancy in relations.
To convert the above relation to 2NF,we need to split the table into two tables such as :Table 1: STUD_NO, COURSE_NOTable
2: COURSE_NO, COURSE_FEE
Third Normal Form
● A relation is in third normal form, if there is no transitive dependency for non-prime attributes as well
as it is in second normal form.A relation is in 3NF if at least one of the following condition hold sin
every non-trivial function dependency X –> Y
○ X is a super key.
○ Y is a prime attribute (each element of Y is part of some candidate key).
● Transitive dependency –If A->B and B->C are two FDs then A->C is called transitive dependency.
In relation STUDENT given in Table 4, FD set: {STUD_NO -> STUD_NAME, STUD_NO -> STUD_STATE,
STUD_STATE -> STUD_COUNTRY, STUD_NO -> STUD_AGE}Candidate Key: {STUD_NO}
For this relation in table 4, STUD_NO -> STUD_STATE and STUD_STATE -> STUD_COUNTRY are true.
STUD_COUNTRY is transitively dependent on STUD_NO. It violates the third normal form. To convert it in
third normal form, we will decompose the relation STUDENT (STUD_NO, STUD_NAME, STUD_PHONE,
STUD_STATE, STUD_COUNTRY_STUD_AGE) as:STUDENT (STUD_NO, STUD_NAME, STUD_PHONE,
STUD_STATE, STUD_AGE)STATE_COUNTRY (STATE, COUNTRY)
Boyce Codd normal form (BCNF)
● It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is stricter than 3NF.
● A table complies with BCNF if it is in 3NF and for every functional dependency X->Y, X should be the super
key of the table.
Example: Suppose there is a company wherein employees work in more than one department. They store the
data like this:
● Functional dependencies :
emp_id-> emp_nationality
emp_dept->{dept_type, dept_no_of_emp}
● To make the table comply with BCNF we can break the table in three tables like this:
SQL Process
● When you are executing an SQL command for any RDBMS, the system determines the best way to
carry out your request and SQL engine figures out how to interpret the task.
● RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table.
● FULL JOIN: returns rows when there is a match in one of the tables.
JOIN
● A SQL Join statement is used to combine data or rows from two or more tables based on a common
field between them.
1. Aggregate Function
2. Scalar Function
Aggregate Function
These functions are used to do operations from the values of the column and a single value is returned.
● Example:
○ SELECT AVG(AGE) AS AvgAge FROM Students;
● Example:
○ SELECT COUNT(*) AS NumStudents FROM Students;
● Example:
○ SELECT FIRST(MARKS) AS MarksFirst FROM Students;
● Example:
○ SELECT LAST(MARKS) AS MarksLast FROM Students;
Student Table
● Output:
5. MAX()
● Syntax:
○ SELECT MAX(column_name) FROM table_name;
● Example:
○ SELECT MAX(MARKS) AS MaxMarks FROM Students;
● Example:
○ SELECT MIN(MARKS) AS MinMarks FROM Students;
MinMarks
● Output: 50 Student Table
7. SUM()
● Syntax:
○ SELECT SUM(column_name) FROM table_name;
● Example:
○ SELECT SUM(MARKS) AS TotalMarks FROM Students;
● Example:
○ SELECT UCASE(NAME) FROM Students;
● Example:
○ SELECT LCASE(NAME) FROM Students;
● Example:
○ SELECT MID(NAME,1,4) FROM Students;
● Example:
○ SELECT LENGTH(NAME) FROM Students;
● Example:
○ SELECT ROUND(MARKS,0) FROM table_name;
Student Table
● Output:
6. NOW()
● Syntax:
○ SELECT NOW() FROM table_name;
● Example:
○ SELECT NAME, NOW() AS DateTime FROM Students
● Example:
○ SELECT NAME, FORMAT(Now(),'YYYY-MM-DD') AS Date FROM Students;
● Output:
Student Table
PROCEDURE
● A stored procedure is a prepared SQL code that you can save, so the code can be reused
over and over again.
● So if you have an SQL query that you write over and over again, save it as a stored
procedure, and then just call it to execute it.
● You can also pass parameters to a stored procedure, so that the stored procedure can act
based on the parameter value(s) that is passed.
● Syntax:
create trigger [trigger_name]
[before | after]
{insert | update | delete}
on [table_name]
[for each row]
[trigger_body]
Explanation of syntax:
create trigger [trigger_name]:Creates or replaces an existing trigger with the trigger_name.
on [table_name]:This specifies the name of the table associated with the trigger.
[for each row]: This specifies a row-level trigger, i.e., the trigger will be executed for each row being
affected.
● After Trigger:
○ https://github.com/TopsCode/Data_AnalyticsWithPython_ML_AI/blob/master/SQL/Trigger/
afterTrigger
Transaction
● A transaction is a logical unit of work of database processing that includes one or more database
access operations.
● A transaction can be defined as an action or series of actions that is carried out by a single user or
application program to perform operations for accessing the contents of the database.
● The operations can include retrieval, (Read), insertion (Write), deletion and modification.
● Each transaction begins with a specific task and ends when all the tasks in the group successfully
complete. If any of the tasks fail, the transaction fails. Therefore, a transaction has only two
results:success or failure.
● In order to maintain consistency in a database, before and after transaction, certain properties are
followed. These are called ACID properties(Atomicity, Consistency, Isolation, Durability) .
ACID PROPERTY
3. Isolation:
● In a database system where more than one transaction are being executed simultaneously and in
parallel, the property of isolation states that all the transactions will be carried out and executed as if
it is the only transaction in the system.
● No transaction will affect the existence of any other transaction.
● For example, in an application that transfers funds from one account to another, the isolation
property ensures that another transaction sees the transferred funds in one account or the other, but
not in both, nor in neither.
Transaction Control
The following commands are used to control transactions.
● The COMMIT command saves all the transactions to the database since the last COMMIT or
ROLLBACK command.
○ COMMIT;
2. ROLLBACK
● The ROLLBACK command is the transactional command used to undo transactions that have not
already been saved to the database.
● This command can only be used to undo transactions since the last COMMIT or ROLLBACK command
was issued.
○ ROLLBACK;
3. SAVEPOINT
● A SAVEPOINT is a point in a transaction when you can roll the transaction back to a certain point
without rolling back the entire transaction.
○ SAVEPOINT SAVEPOINT_NAME;
● This command serves only in the creation of a SAVEPOINT among all the transactional statements.
The ROLLBACK command is used to undo a group of transactions.
○ ROLLBACK TO SAVEPOINT_NAME;
Cursor
● It is a temporary area for work in memory system while the execution of a statement is done.
● A Cursor in SQL is an arrangement of rows together with a pointer that recognizes a present row.
● It is a database object to recover information from a result set one row at once.
● It is helpful when we need to control the record of a table in a singleton technique, at the end of the
day one row at any given moment. The arrangement of columns the cursor holds is known as the
dynamic set.
Main components of Cursors
Each cursor contains the followings 5 parts,
● Declare Cursor:In this part we declare variables and return a set of values.
○ DECLARE cursor_nameCURSOR FOR SELECT_statement;
● Close: This is an exit part of the cursor and used to close a cursor.
○ CLOSE cursor_name;
Syntax:
DECLARE variables;
records;
create a cursor;
BEGIN
OPEN cursor;
FETCH cursor;
process the records;
CLOSE cursor;
END;
Database Backup and Recovery
● Database Backup is storage of data that means the copy of the data.
● If the original data is lost, then using the backup it can reconstructed.
● It is a copy of files storing database information to some other location, such as disk, some offline
storage like magnetic tape.
● Physical backups are the foundation of the recovery mechanism in the database.
● Physical backup provides the minute details about the transaction and modification to the database.
2. Logical Backup
● Logical Backup contains logical data which is extracted from a database.
● It includes backup of logical data like views, procedures, functions, tables, etc.
● It is a useful supplement to physical backups in many circumstances but not a sufficient protection
against data loss without physical backups, because logical backup provides only structural
information.
Importance of Backups
● Planning and testing backup helps against failure of media, operating system, software and any other
kind of failures that cause a serious data crash.
● Physical backup extracts data from physical storage (usually from disk to tape). Operating system is an
example of physical backup.
● Logical backup extracts data using SQL from the database and store it in a binary file.
● Logical backup is used to restore the database objects into the database. So the logical backup utilities
allow DBA (Database Administrator) to back up and recover selected objects within the database.
Causes of Failure
1.System Crash
● System crash occurs when there is a hardware or software failure or external factors like a power
failure.
● The data in the secondary memory is not affected when system crashes because the database has lots
of integrity. Checkpoint prevents the loss of data from secondary memory.
2. Transaction Failure
● The transaction failure is affected on only few tables or processes because of logical errors in the
code.
● This failure occurs when there are system errors like deadlock or unavailability of system resources to
execute the transaction.
3. Network Failure
● A network failure occurs when a client –server configuration or distributed database system are
connected by communication networks.
4. Disk Failure
● Disk Failure occurs when there are issues with hard disks like formation of bad sectors, disk head
crash, unavailability of disk etc.
5. Media Failure
● Media failure is the most dangerous failure because, it takes more time to recover than any other kind
of failures.
● A disk controller or disk head crash is a typical example of media failure.
Recovery
● Recovery is the process of restoring a database to the correct state in the event of a failure.
● It ensures that the database is reliable and remains in consistent state in case of a failure.
1. Rolling Back applies rollback segments to the datafiles. It is stored in transaction tables.
Excel
Modules:
1. Introductions to Excel
2. Excel Functions
3. Excel Charts
Example : click here for formula and function practical
Example : click here for count practical
Example : click here for countif practical
Example : click here for countifs practical
Example : click here for sum practical
Example : click here for sumif practical
Example : click here for average practical
Example : click here for averageif practical
Example : click here for averageifs practical
Example : click here for iferror practical
Example : click here for vlookup practical
Example : click here for hlookup practical
Practice
Descriptive statistics deals with the processing of data without attempting to draw any inferences from it.
The data are presented in the form of tables and graphs. The characteristics of the data are described in
simple terms. Events that are dealt with include everyday happenings such as accidents, prices of goods,
business, incomes, epidemics, sports data, population data.
Inferential statistics is a scientific discipline that uses mathematical tools to make forecasts and projections
by analyzing the given data. This is of use to people employed in such fields as engineering, economics,
biology, the social sciences, business, agriculture and communications.
Introduction to Population and Sample
A population often consists of a large group of specifically defined elements. For example, the population of a
specific country means all the people living within the boundaries of that country.
Usually, it is not possible or practical to measure data for every element of the population under study. We
randomly select a small group of elements from the population and call it a sample. Inferences about the
population are then made on the basis of several samples.
Example:
● A company is thinking about buying 50,000 electric batteries from a manufacturer. It will buy the
batteries if no more that 1% of the batteries are defective. It is not possible to test each battery in the
population of 50,000 batteries since it takes time and costs money. Instead, it will select few samples of
500 batteries each and test them for defects. The results of these tests will then be used to estimate the
percentage of defective batteries in the population.
Quantitative Data and Qualitative Data
Data is quantitative if the observations or measurements made on a given variable of a sample or
population have numerical values.
Data is qualitative if words, groups and categories represents the observations or measurements.
Quantitative data is discrete if the corresponding data values take discrete values and it is continuous if the
data values take continuous values.
Parameter vs statistic – both are similar, yet different measures. The first one
describes the whole population, while the second describes a part of the
population.
What is Parameter?
It is a measure of a characteristic of an entire population (a mass of all units under consideration that share
common characteristics) based on all the elements within that population. For example, all people living in
one city, all-male teenagers in the world, all elements in a shopping trolley, or all students in a classroom.
If you ask all employees in a factory what kind of lunch they prefer and half of them say pasta, you get a
parameter here – 50% of the employees like pasta for lunch. On the other hand, it’s impossible to count
how many men in the whole world like pasta for lunch, since you can’t ask all of them about their choice. In
that case, you’d probably survey just a representative sample (a portion) of them and extrapolate the
answer to the entire population of men. This brings us to the other measure called statistic.
It’s a measure of characteristic saying something about a fraction (a sample) of the population under study.
A sample in statistics is a part or portion of a population. The goal is to estimate a certain population
parameter. You can draw multiple samples from a given population, and the statistic (the result) acquired
from different samples will vary, depending on the samples. So, using data about a sample or portion allows
you to estimate the characteristics of an entire population.
Parameter vs Statistics
Can you tell the difference between statistics and parameters now?
● A parameter is a fixed measure describing the whole population (population being a group of people, things,
animals, phenomena that share common characteristics.) A statistic is a characteristic of a sample, a portion
of the target population.
● A parameter is fixed, unknown numerical value, while the statistic is a known number and a variable which
depends on the portion of the population.
● Sample statistic and population parameters have different statistical notations:
In population parameter, population proportion is represented by P, mean is represented by µ (Greek letter mu), σ2
represents variance, N represents population size, σ (Greek letter sigma) represents standard deviation, σx̄
represents Standard error of mean, σ/µ represents Coefficient of variation, (X-µ)/σ represents standardized variate
(z), and σp represents standard error of proportion.
In sample statistics, mean is represented by x̄ (x-bar), sample proportion is represented by p̂ (p-hat), s represents
standard deviation, s2 represents variance, sample size is represented by n, sx̄ represents Standard error of mean,
sp represents standard error of proportion, s/(x̄) represents Coefficient of variation, and (x-x̄)/s represents
standardized variate (z).
Example of Parameters
● 20% of U.S. senators voted for a specific measure. Since there are only 100 senators, you can count
what each of them voted.
Example of Statistic
● 50% of people living in the U.S. agree with the latest health care proposal. Researchers can’t ask
hundreds of millions of people if they agree, so they take samples, or part of the population and
calculate the rest.
What Are The Differences Between Population Parameters and Sample Statistics?
The average weight of adult men in the U.S. is a parameter with an exact value – but, we don’t know it.
Standard deviation and population mean are two common parameters.
A statistic is a characteristic of a group of population, or sample. You get sample statistics when you collect
a sample and calculate the standard deviation and the mean. You can use sample statistics to make certain
conclusions about an entire population thanks to inferential statistics. But, you need particular sampling
techniques to draw valid conclusions. Using these techniques ensures that samples deliver unbiased
estimates – correct on average. When it comes to based estimates, they are systematically too low or too
high, so you don’t need them.
To estimate population parameters in inferential statistics, you use sample statistics. For instance, if you
collect a random sample of female teenagers in the U.S. and measure their weights, you can calculate the
sample mean. You can use the sample mean as an unbiased estimate of the population mean.
Introduction of Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in
a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not,
however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding
any hypotheses we might have made. They are simply a way to describe our data.
Descriptive statistics are very important because if we simply presented our raw data it would be hard to
visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore
enables us to present the data in a more meaningful way, which allows simpler interpretation of the data.
For example, if we had the results of 100 pieces of students' coursework, we may be interested in the overall
performance of those students. We would also be interested in the distribution or spread of the marks.
Descriptive statistics allow us to do this. How to properly describe data through statistics and graphs is an
important topic and discussed in other Laerd Statistics guides. Typically, there are two general types of
statistic that are used to describe data
Introduction of Descriptive Statistics
● Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. In this case, the frequency distribution is simply the distribution and
pattern of marks scored by the 100 students from the lowest to the highest. We can describe this
central position using a number of statistics, including the mode, median, and mean.
● Measures of spread: these are ways of summarizing a group of data by describing how spread out
the scores are. For example, the mean score of our 100 students may be 65 out of 100. However, not
all students will have scored 65 marks. Rather, their scores will be spread out. Some will be lower
and others higher. Measures of spread help us to summarize how spread out these scores are. To
describe this spread, a number of statistics are available to us, including the range, quartiles,
absolute deviation, variance and standard deviation.
● When we use descriptive statistics it is useful to summarize our group of data using a combination of
tabulated description (i.e., tables), graphical description (i.e., graphs and charts) and statistical
commentary (i.e., a discussion of the results).
Measures of Central Tendency
A measure of central tendency (also referred to as measures of centre or central location) is a summary
measure that attempts to describe a whole set of data with a single value that represents the middle or
centre of its distribution.
Each of these measures describes a different indication of the typical or central value in the distribution.
Mode
The mode is the most frequent occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Total Frequency: 11
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or multi-
modal). The presence of more than one mode can limit the ability of the mode in describing the centre or
typical value of the distribution because a single value to describe the centre cannot be identified.
In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e. if
all values are different).
In cases such as these, it may be better to consider using the median or mean, or group the data in to
appropriate intervals, and find the modal class.
Median
The median is the middle value in distribution when the values are arranged in ascending or descending
order.
The median divides the distribution in half (there are 50% of observations on either side of the median
value). In a distribution with an odd number of observations, the median value is the middle value.
Median
Looking at the retirement age distribution (which has 11 observations), the median is the middle value,
which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the two
middle values. In the following distribution, the two middle values are 56 and 57, therefore the median
equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Advantage of the median:
The median is less affected by outliers and skewed data than the mean, and is usually the preferred
measure of central tendency when the distribution is not symmetrical.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
What else do I need to know about the mean?
The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean is
calculated on a distribution from a sample it is indicated by the symbol x̅ (pronounced X-bar).
Measures of Spread
Measures of spread describe how similar or varied the set of observed values are for a particular variable
(data item). Measures of spread include the range, quartiles and the interquartile range, variance and
standard deviation.
The lower quartile (Q1) is the point between the lowest 25% of values and the highest 75% of values. It is also
called the 25th percentile.
The second quartile (Q2) is the middle of the data set. It is also called the 50th percentile, or the median.
The upper quartile (Q3) is the point between the lowest 75% and highest 25% of values. It is also called the 75th
percentile.
Example
Interquartile Range(IQR)
The interquartile range (IQR) is the difference between the upper (Q3) and lower (Q1) quartiles,
and describes the middle 50% of values when ordered from lowest to highest. The IQR is often
seen as a better measure of spread than the range as it is not affected by outliers.
Introduction of Probability
How likely something is to happen.
Many events can't be predicted with total certainty. The best we can say is how likely they are to happen,
using the idea of probability.
Tossing a Coin
When a coin is tossed, there are two possible outcomes:
● Heads(H) or
● Tails(T)
● There are 5 marbles in a bag: 4 are blue, and 1 is red. What is the probability that a blue
marble gets picked?
Probability Tree
The tree diagram helps to organize and visualize the different possible outcomes. Branches and ends of the
tree are two main positions. Probability of each branch is written on the branch, whereas the ends are
containing the final outcome. Tree diagram is used to figure out when to multiply and when to add. You can
see below a tree diagram for the coin:
Types of Probability
There are two major types of probabilities:
● Theoretical Probability
● Experimental Probability
Theoretical Probability
Theoretical Probability is what is expected to happen based on
mathematics.
Example:
A coin is tossed.
Experimental Probability
Experimental Probability is found by repeating an experiment and observing
the outcomes.
Example:
A Coin is tossed 10 times: A head is recorded 7 times and a tail 3 times.
Probability Distribution
A probability distribution is a statistical function that describes all the possible values and likelihoods
that a random variable can take within a given range. This range will be bounded between the
minimum and maximum possible values, but precisely where the possible value is likely to be plotted
on the probability distribution depends on a number of factors. These factors include the
distribution's mean (average), standard deviation, skewness.
Types of Distributions:
1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution
6. Exponential Distribution
Bernoulli Distribution
All you cricket junkies out there! At the beginning of any cricket match, how do you decide who is
going to bat or ball? A toss! It all depends on whether you win or lose the toss, right? Let’s say if the
toss results in a head, you win. Else, you lose. There’s no midway.
A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a
single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the
probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.
the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible
outcomes.
Bernoulli Distribution
The probability mass function is given by: p^x(1-p)^1-x where x € (0, 1).
It can also be written as
The probabilities of success and failure need not be equally likely, like the result of a fight between
me and Undertaker. He is pretty much certain to win. So in this case probability of my success is 0.15
while my failure is 0.85
Bernoulli Distribution
Here, the probability of success(p) is not same as the probability of failure.
So, the chart below shows the Bernoulli Distribution of our fight.
Bernoulli Distribution
Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is exactly
what it sounds. If I punch you, I may expect you to punch me back. Basically expected value of any
distribution is the mean of the distribution. The expected value of a random variable X from a
Bernoulli distribution is found as follows:
There are only two possible outcomes. Head denoting success and tail denoting failure. Therefore,
probability of getting a head = 0.5 and the probability of failure can be easily computed as: q = 1- p =
0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or
lose and where the probability of success and failure is same for all the trials is called a Binomial
Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and
Undertaker? So, if the probability of success in an experiment is 0.2 then the probability of failure
can be easily computed as q = 1 – 0.2 = 0.8.
Binomial Distribution
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the
outcome of the current toss. An experiment with only two possible outcomes repeated n number of
times is called binomial. The parameters of a binomial distribution are n and p where n is the total
number of trials and p is the probability of success in each trial.
On the basis of the above explanation, the properties of a Binomial Distribution are
For example, you ask people outside a polling station who they voted for
until you find someone that voted for the independent candidate in a local
election. The geometric distribution would represent the number of people
who you had to poll before you found someone who voted independent. You
would need to get a certain number of failures before you got your first
Geometric Distribution
If you had to ask 3 people, then X=3; if you had to ask 4 people, then X=4
and so on. In other words, there would be X-1 failures before you get your
success.
If X=n, it means you succeeded on the nth try and failed for n-1 tries. The
probability of failing on your first try is 1-p. For example, if p = 0.2 then your
probability of success is .2 and your probability of failure is 1 – 0.2 = 0.8.
Independence (i.e. that the outcome of one trial does not affect the next)
means that you can multiply the probabilities together. So the probability of
failing on your second try is (1-p)(1-p) and your probability of failing on the
nth-1 tries is (1-p)^n-1. If you succeeded on your 4th try, n = 4, n – 1 = 3, so
the probability of failing up to that point is (1-p)(1-p)(1-p) = (1-p)^3.
Geometric Distribution
Example:-
If your probability of success is 0.2, what is the probability you meet an independent voter on your
third try?
Inserting 0.2 as p and with X = 3, the probability density function becomes:
● There are two possible outcomes for each trial (success or failure).
● The trials are independent.
● The probability of success is the same for each trial.
Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are
equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n
number of possible outcomes of a uniform distribution are equally likely.
Exponential distribution is widely used for survival analysis. From the expected life of a machine to
the expected life of a human, exponential distribution successfully delivers the result.
Exponential Distribution
A random variable X is said to have an exponential distribution with PDF:
f(x) = { λe^-λx, x ≥ 0
For survival analysis, λ is called the failure rate of a device at any time t, given that it has survived up
to t.
P{X≤x} = 1 – e-λx, corresponds to the area under the density curve to the left of x.
P{X>x} = e-λx, corresponds to the area under the density curve to the right of x.
P{x1<X≤ x2} = e-λx1 – e-λx2, corresponds to the area under the density curve between x1 and x2.
Normal Distribution
Normal distribution represents the behavior of most of the situations in the universe (That is why it’s
called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to
be normally distributed, contributing to its widespread application. Any distribution is known as
Normal distribution if it has the following characteristics:
A normal distribution is highly different from Binomial Distribution. However, if the number of trials
approaches infinity then the shapes will be quite similar.
Normal Distribution
The PDF of a random variable X following a normal distribution is given by:
The mean and variance of a random variable X which is said to be normally distributed is given by:
For example: Suppose we are interested in the exam marks of all the students in India. But it is not
feasible to measure the exam marks of all the students in India. So now we will measure the marks of a
smaller sample of students, for example 1000 students. This sample will now represent the large
population of Indian students. We would consider this sample for our statistical study for studying the
population from which it’s deduced.
Hypothesis testing
We evaluate 2 mutual exclusive statement on population data using sample
data
Steps:
net income, dividends, and total capital, and turning those data
points into an easy-to-understand percentage that can be used to
compare one company’s performance to others.
Diagnostic analytics
Diagnostic analytics is a form of advanced analytics that examines
data or content to answer the question, “Why did it happen?”
Prediction
Forecasting, etc
Prescriptive analytics
● Prescriptive analytics makes use of machine learning to help
businesses decide a course of action based on a computer
program’s predictions.
Positive Reward for Good Action and Negative for Bad Action given a State
in the Environment, then agent tries to Maximize the Reward points
Machine Learning Regression
Regression Algorithms
● Linear Regression
● Ridge Regression
● Lasso Regression
● Polynomial Regression
Linear Regression
Simple Linear Regression:
b0 = constant
b1 = determines how a unit change in x will make a unit
change in y.
DATA
train - test Split
>>>model=LinearRegression()
>>>model.fit(x_train, y_train)
Mean Square Error:
Measures the Average of the Squares of the error
r2_score=1 means:
● prediction = actual value
Simple Linear Regression
Practical
Practical link : Click_here
Gradient Descent
Gradient Descent is an Optimization Algorithm that find the local minimum of a
Loss/ Error function.
● i - ith Sample
● ŷ - Predicted Value
● y - Actual Value
Gradients
predicted value = mx + b
Learning Rate (alpha)
Learning rate controls how quickly or slowly a model learns a problem.
Updating Parameters
> m = m - alpha*gradient_m
> b = b - alpha*gradient_b
Practical For Gradient
Descent
Practical link : click_here
Multiple Linear Regression:
y = dependent variable
b0 = constant
b1, b2 … bn = coefficients
Necessary Assumption
◦ Linearity
◦ Homoscedasticity
◦ Multivariate Normality
◦ Independence of Error
◦ Lack of Multicollinearity
Practical for Multiple Linear
Regression
Practical link : Click _here
Ridge Regression
Ridge Regression is a technique for analyzing multiple regression data that
suffer from multicollinearity as well used when we have a high stepped line and
we want to reduce it using some regularization technique
Does same job as ridge Regression but hear it also used as feature selection as
its value shrink to zero wheres ridge move toward zero but never reach there.
This particular type of regression is well-suited for models showing high levels of
multi-collinearity.
In Linear regression
● Logistic Regression
● k-Nearest Neighbors
● Decision Tree Classifier
● Naive bayes Classifier
● SVM Classifier
Logistic Regression
clf=KNeighborsClassifier(n_neighbors=3)
Now you can simply fit this Classifier Algorithm, just like a normal
classifier, using clf.fit() method.
Practical For KNN Classifier
Practical : click_here
Decision Tree
Important Terminology related to Decision
Trees
1. Root Node: It represents the entire population or sample and
this further gets divided into two or more homogeneous sets.
1. Ginni Index
2. Entropy
3. Information Gain
Gini Index
Entropy:
Entropy is a measure of the randomness in the information being
processed. The higher the entropy, the harder it is to draw any
conclusions from that information.
The core algorithm for building decision trees called ID3 by J. R. Quinlan
which employs a top-down, greedy search through the space of possible
branches with no backtracking.
We will take a moment here to give entropy in case of binary event(like the coin toss,
where output can be either of the two events, head or tail) a mathematical face:
Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is
classified.
Naive Bayes Classifier
Naive Bayes Classifier
The Naive Bayesian classifier is based on Bayes’ theorem with the
independence assumptions between predictors.
Subset of Vector
A Support trainingMachine
points in the decision
(SVM) performsfunction (called
classification by support vectors)
finding the
hyperplane that maximizes the margin between the two classes. The vectors
(cases) that define the hyperplane are the support vectors.
High dimensionality means that the dataset has a large number of features.
Principal Component
The first principal component expresses the most amount of variance.
Classification accuracy: Variance based PCA framework does not consider the
differentiating characteristics of the classes. Also, the information that distinguishes one
class from another might be in the low variance components and may be discarded.
Practical For PCA
Supervised:
Click_here
LDA :
Clustering
Clustering vs Classification
● Observations (or data points) in a classification task have labels. Each observation is
● Observations (or data points) in clustering do not have labels. We expect the model to find
structures in the dataset so that similar observations can be grouped into clusters. We
- K-mean Clustering
- Hierarchical
K-mean Clustering
K-means clustering aims to partition data into k clusters in a way that data points in the same
cluster are similar and data points in the different clusters are farther apart.
There are many methods to measure the distance. Euclidean distance (minkowski distance
K-means clustering tries to minimize distances within a cluster and maximize the distance
between different clusters. K-means algorithm is not capable of determining the number of
clusters.
Hierarchical Clustering
Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the
following two steps: (1) Identify the two clusters that are closest together, and (2) Merge the two most similar
clustering
Hierarchical clustering typically works by sequentially
merging similar clusters, as shown above. This is known as
agglomerative hierarchical clustering. In theory, it can also
be done by initially grouping all the observations into one
cluster, and then successively splitting these clusters. This is
known as Divisive hierarchical clustering. Divisive
clustering is rarely done in practice.
Agglomerative clustering
Agglomerative clustering is kind of a bottom-up
approach. Each data point is assumed to be a separate cluster
at first. Then the similar clusters are iteratively combined.
between two clusters are above the threshold, these clusters will not be merged.
The figure above is called dendrogram which is a diagram representing
tree-based approach. In hierarchical clustering, dendrograms are used to
visualize the relationship among clusters.
Math Intuition ( What Is Linkage )
Here’s one way to calculate similarity – Take the distance between the centroids of these
clusters. The points having the least distance are referred to as similar points and we can
merge them. We can refer to this as a distance-based algorithm as well (since we are
calculating the distances between the clusters).
In hierarchical clustering, we have a concept called a proximity matrix. This stores the
distances between each point.
During both the types of hierarchical clustering, the distance between two sub-clusters needs to be computed.
The different types of linkages describe the different approaches to measure the distance between two sub-
clusters of data points.
Perform Hierarchical Clustering
Problem: Teacher Wants To Separate Students According Their Marks.
Initialization
Creating a Proximity Matrix
√(10-7)^2 = √9 = 3
____________________________________________________________
Next, we will look at the smallest distance in the proximity matrix and
merge the points with the smallest distance. We then update the proximity
matrix:
Let’s look at the updated clusters and accordingly update the proximity
matrix:
Updated Metrics
We will repeat step 2 until only a
single cluster is left
More the distance of the vertical lines in the dendrogram, more the
distance between those clusters.
Now, we can set a threshold distance and draw a horizontal l
1. K-Fold
a. Split dataset into k consecutive folds (without shuffling by
default).
2. LOOCV
a. Provides train/test indices to split data in train/test sets. Each
sample is used once as a test set (singleton) while the remaining
samples form the training set.
K-FOLD LOOCV
class sklearn.model_selection.LeaveOne
sklearn.model_selection.KFold(n_sp
Out
lits=5)
PRACTICAL
class
sklearn.model_selection.GridSearchCV(estimator,
param_grid, *, scoring=None, cv=None)
param_grid = [
class
sklearn.model_selection.RandomizedSearchCV(
estimator, param_distributions, scoring=None, cv = None)
PRACTICAL
Practical link :click_here
Ensemble Learning
Ensemble learning helps improve machine learning results by combining several models. Ensemble
methods are meta-algorithms that combine several machine learning techniques into one predictive
model in order to decrease variance (bagging), bias (boosting).
Bagging
■ Random Forest
Boosting
■ AdaBoost
■ XGBoost
Bagging
Boosting
Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision
trees are used for modelling. Multiple sequential models are created, each correcting the errors
from the last model. AdaBoost assigns weights to the observations which are incorrectly
predicted and the subsequent model works to predict these values correctly.
PRACTICAL
Practical link : click_here
Time - Series Forecasting
DATA:
TIME SERIES FOR APARTMENT PRICE IN VESU
STATIONARITY CHECK
- DICKEY-FULLER TEST
- H0 -> DATA IS NOT STATIONARY
known as ARIMA, which stands for Autoregressive Integrated Moving Average. ARIMA
models are denoted with the notation ARIMA(p, d, q). These three parameters
1.Log
2.Difference
3.Square root, etc
Note: After Transforming you will loose the original data, hence be sure
that you have option to transform prediction Similar to Original original
value Somehow
The ARIMA forcasting for stationary time series is nothing but linear equation(like
linear regression).
● In this type of Machine Learning, there is an agent which tries to learn what action it
should take for a given state in the environment in order to maximize the cumulative
Reward.
● In short Learning through Experience.
©2022 TOPS Technolgies. All Rights Reserved
Actions: Actions are the Agent’s methods which allow it to interact and change its
environment, and thus transfer between states. Every action performed by the
Agent yields a reward from the environment. The decision of which action to
choose is made by the policy.
Agent: The learning and acting part of a Reinforcement Learning problem, which
tries to maximize the rewards it is given by the Environment.
Environment: Everything which isn’t the Agent; everything the Agent can
interact with, either directly or indirectly. The environment changes as the Agent
performs actions; every such change is considered a state-transition. Every action
the Agent performs yields a reward received by the Agent.
New choices are explored to maximize rewards while exploiting the already explored choices.
1. The number of times each machine has been selected till round n
2. The sum of rewards collected by each machine till round n
Step 2: At each round, we compute the average reward and the confidence interval
of the machine i up to n rounds as follows:
Average UCB
Confidence Interval
©2022 TOPS Technolgies. All Rights Reserved
Practical
Click_here
What is Q?
follows the shape of [state, action] and we initialize our values to zero.
We then update and store our q-values after an episode. This q-table becomes
a reference table for our agent to select the best action based on the q-value.
1. Agent starts in a state (s1) takes an action (a1) and receives a reward (r1)
by random (epsilon, ε)
3. Update q-values
Click_here
Deep learning consists of artificial neural networks that are modeled on similar
networks present in the human brain. As data travels through this artificial
mesh, each layer processes an aspect of the data, filters outliers, spots familiar
entities, and produces the final output.
The human brain functions in a similar fashion — but only at a highly advanced
level. The human brain is a far more complex web of diverse neurons where
each node performs a separate task. Our understanding of things is far more
superior. If we are taught that lions are dangerous, we can deduce that bears
are too.
Single layer perceptron is the first proposed neural model created. The content of the local memory of the neuron
consists of a vector of weights. The computation of a single layer perceptron is performed over the calculation of sum
of the input vector each with the value multiplied by corresponding element of vector of the weights. The value
which is displayed in the output will be the input of an activation function.
● The weights are initialized with random values at the beginning of the training.
● For each element of the training set, the error is calculated with the difference between desired output and
the actual output. The error calculated is used to adjust the weights.
● The process is repeated until the error made on the entire training set is not less than the specified threshold,
until the maximum number of iterations is reached.
These are simple computational units that have weighted input signals
and produce an output signal using an activation function.
Like linear regression, each neuron also has a bias which can be thought of
as an input that always has the value 1.0 and it too must be weighted.
For example, a neuron may have two inputs in which case it requires three
weights. One for each input and one for the bias.
More recently the rectifier activation function has been shown to provide better results.
A row of neurons is called a layer and one network can have multiple layers.
The architecture of the neurons in the network is often called the network
topology.
Layers after the input layer are called hidden layers because that are
not directly exposed to the input. The simplest network structure is to
have a single neuron in the hidden layer that directly outputs the
value.
● A regression problem may have a single output neuron and the neuron may have no
activation function.
● A binary classification problem may have a single output neuron and use a sigmoid
activation function to output a value between 0 and 1 to represent the probability of
predicting a value for the class 1. This can be turned into a crisp class value by using a
threshold of 0.5 and snap values less than the threshold to 0 otherwise to 1.
● A multi-class classification problem may have multiple neurons in the output layer, one
for each class
● In this case a softmax activation function may be used to output a probability of the
network predicting each of the class values. Selecting the output with the highest
probability can be used to produce a crisp class classification value.
Max(0,Z) ,
If Negative == 0
Else
All The values of Z Or input value
max(0.01 * Z , Z)
The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more
training epochs given the smaller changes made to the weights each update, whereas larger learning rates
result in rapid changes and require fewer training epochs.
Think of a batch as a for-loop iterating over one or more samples and making predictions. At the end
of the batch, the predictions are compared to the expected output variables and an error is
calculated. From this error, the update algorithm is used to improve the model, e.g. move down
along the error gradient
When all training samples are used to create one batch, the learning algorithm is called batch
gradient descent. When the batch is the size of one sample, the learning algorithm is called
stochastic gradient descent. When the batch size is more than one sample and less than the size of
the training dataset, the learning algorithm is called mini-batch gradient descent.
One epoch means that each sample in the training dataset has had an opportunity to update the internal model
parameters. An epoch is comprised of one or more batches. an epoch that has one batch is called the batch
gradient descent learning algorithm.
It is common to create line plots that show epochs along the x-axis as time and the error or skill of the model on
the y-axis. These plots are sometimes called learning curves.
with neural networks, we seek to minimize the error. As such, the objective function is often referred
to as a cost function or a loss function and the value calculated by the loss function is referred to as
simply “loss.”
The cost or loss function has an important job in that it must faithfully distill all aspects of the model down
into a single number in such a way that improvements in that number are a sign of a better model.
The choice of cost function is tightly coupled with the choice of output unit. Most of the time, we
simply use the cross-entropy between the data distribution and the model distribution. The choice of
how to represent the output then determines the form of the cross-entropy function
Back-propagation is the essence of neural net training. It is the method of fine-tuning the weights of a neural
net based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights
allows you to reduce error rates and to make the model reliable by increasing its generalization.
It is a standard method of training artificial neural networks. This method helps to calculate the gradient of a
loss function with respects to all the weights in the network.
Optimizers are algorithms or methods used to change the attributes of your neural
network such as weights and learning rate in order to reduce the losses.
How you should change your weights or learning rates of your neural network to
reduce the losses is defined by the optimizers you use. Optimization algorithms or
strategies are responsible for reducing the losses and to provide the most accurate
results possible.
Equation = W = W - Alpha * DL / DW
Here Very the Alpha ( learning Rate )
Alpha = Alpha / Sqrt ( eta_t + sigma )
Where sigma = any small positive value
Eta_t = sum ( DL / DW_t)^2 ; sum range = t = 1 to t
W = W - Alpha * DL / DW
Alpha = Alpha / Sqrt ( Wavg + Sigma )
Wavg = Bita * Wavg_t-1 + (1 - Bita) * (DL / DW)^2
Generally B = 0.95
NN_From_Skretch : = click_here
Hand_DIgit_Recognition = click_here
Data Viz is The Representation of data or information in a graph, charts or Other Visual
Format . its just delts with the Graphic Represenation os the data .
Connects thousands and millions and lines of numbers with the nice visualt image to
effectively analyze it .
Importance of Viz
This is important because it allows trends and patterns to be more easily seen.
With the rise of big data upon us, we need to be able to interpret increasingly larger
batches of data.
Machine learning makes it easier to conduct analyses such as predictive analysis, which
can then serve as helpful visualizations to present.
Data Viz is not only related with Data Scientist and Data Analytics , This Skiils Can be
used at almost every filed to analyse the data
Need of Visualization
We need data visualization because a visual summary of information makes it easier to
identify patterns and trends than looking through thousands of rows on a spreadsheet.
It’s the way the human brain works.
Since the purpose of data analysis is to gain insights, data is much more valuable when
it is visualized.
Even if a data analyst can pull insights from data without visualization, it will be more
difficult to communicate the meaning without visualization. Charts and graphs make
communicating data findings easier
Use cases of data Viz
Some are..
1. Visme
2. Tableau
3. Power Bi
4. Infogram
5. Whatagraph
6. Sisense
7. DataBox
And Many More……………….
Tableau
Why tableau
1. Quick And Interactive Visualization
B = Canvas: logical layer - The canvas opens with the logical layer, where you can create
relationships between logical tables
C = Canvas: physical layer – Double-click a table in the logical layer to go to the physical
layer of the canvas, where you can add joins and unions between tables.
D = Data grid – Displays first 1,000 rows of the data contained in the Tableau data source.
B = Drag fields to the cards and shelves in the workspace to add data to your view
C = Use the toolbar to access commands and analysis and navigation tools
D = This is the canvas in the workspace where you create a visualisation (also referred to as a
"viz").
E = Click this icon to go to the Start page, where you can connect to data. For more
information
F = Side Bar - In a worksheet, the side bar area contains the Data pane and the Analytics
pane.
G = Click this tab to go to the Data Source page and view your data.
I = Sheet tabs - Tabs represent each sheet in your workbook. This can include worksheets,
dashboards and stories.
Data Types
There are primarily seven data types used in Tableau. Tableau automatically detects the
data types of various fields as soon as new data gets uploaded from source to Tableau and
assigns it to the fields. You can also modify these data types after uploading your data into
Tableau .
1. String values
2. Number (Integer) values
3. Date values
4. Date & Time values
5. Boolean values
6. Geographic values
7. Cluster or mixed values
Connecting To Data Source
Once we establish a successful connection with a data source, we can access all its data,
bring some part of it in Tableau’s repository (extract) and use it for our analysis.
Tableau offers a myriad of data sources such as local text files, MS Excel, PDFs, JSON or
databases and servers like Tableau Server, MySQL Server, Microsoft SQL Server, etc.
Categorically, there are two types of data sources that you can connect to in Tableau;
To a file and To a server.
Connecting To A File
Tableau offers a variety of options to connect and get data from a file in your
system.
The connection to a file section has file options such as MS Excel, MS Access,
JSON, text file, PDF file, spatial file, etc.
In addition to this, with the help of the More option, you can access the data
files residing in your system and connect them with Tableau.
Connecting To A Server
The connection to a server section has countless options for an online data source. Here you
will find connectors to different kinds of online data sources such as,
Generally we need to stuck our head in excel for doing the same .
Tableau Provide a Better Functionality to Doing the same , we can use the Tableau Prep for
Cleaning the data
Tableau Prep is designed to reduce the struggle of common yet complex tasks—such as
joins, unions, pivots, and aggregations—with a drag-and-drop visual experience. No
scripting required.
Joins
In general, there are four types of joins that you can use in Tableau: inner, left, right, and full outer.
Inner :
When you use an inner join to combine tables, the result is a table that contains values that have
matches in both tables.
When you use a left join to combine tables, the result is a table that contains all values from the left
table and corresponding matches from the right table.
When a value in the left table doesn't have a corresponding match in the right table, you see a null
value in the data grid.
Right :
When you use a right join to combine tables, the result is a table that contains all values from the
right table and corresponding matches from the left table.
When a value in the right table doesn't have a corresponding match in the left table, you see a null
value in the data grid.
Full Outer :
When you use a full outer join to combine tables, the result is a table that contains all values from
both tables.
When a value from either table doesn't have a match with the other table, you see a null value in the
data grid.
Union :
Though union is not a type of join, union is another method for combining two or more tables by
appending rows of data from one table to another. Ideally, the tables that you union have the same
number of fields, and those fields have matching names and data types.
Filters
Filters are a smart way to collate and segregate data based on its
dimensions and sets to reduce the overall data frequency for
faster processing.
As understood by its name, the extract filters are used to extract data from the
various sources,
Such methods can help in lowering the tableau queries to the data source.
Used mainly to restrict sensitive data from the data viewers, the data source filters
are similar to the extract filters in minimizing the data feeds for faster processing.
The data source filter in tableau helps in the direct application of the filter
environment to the source data and quickly uploads data that qualifies the scenario
into the tableau workbook.
3. Context Filter
A context filter is a discrete filter on its own, creating datasets based on the
original datasheet and the presets chosen for compiling the data. Since all
the types of filters in tableau get applied to all rows in the datasheet,
irrespective of any other filters, the context filter would ensure that it is first to
get processed.
Now that you’ve chosen the data, you can access the values highlighted or
remove them from the selected dimension, represented as strikethrough
values. You can click All or None to select or deselect based on your
operation in case of multiple dimensions.
5. Measure Filter
In this filter, you can apply the various operations like Sum, Avg, Median,
Standard Deviation, and other aggregate functions. In the next stage, you
would be presented with four choices: Range, At least, At most, and Special
for your values. Every time you drag the data you want to filter, you can do
that in a specific setting.
6. Table Filter
The last filter to process is the table calculation that gets executed once the
data view has been rendered. With this filter, you can quickly look into the
data without any filtering of the hidden data.
Charts And Graphs
We can Create many charts in tableau depending upon our requirement
Bar Chart
Line Chart
Pie Chart
Maps
Density Maps
Scatter Plot
Gantt Plot
Bubble Chart
Tree Map
● To segment data
● To convert the data type of a field, such as converting a string to a
date.
● To aggregate data
● To filter results
● To calculate ratios
How to perform Calculation on Tableau .
IF([sales] != 0 , [discount]/[sales],0)
This formula checks if sales is not equal to zero. If true, it returns the discount
ratio (Discount/Sales); if false, it returns zero.
Each story point can be based on a different view or dashboard, or the entire story can be based on the
same visualization seen at different stages, with different filters and annotations.
1. Click the New Story tab.
2. In the lower-left corner of the screen, choose a size for
your story. Choose from one of the predefined sizes, or
set a custom size, in pixels
3. By default, your story gets its title from the sheet name.
To edit it, right-click the sheet tab, and choose Rename
Sheet.
4. To start building your story, double-click a sheet on the
left to add it to a story point.
5. Click Add a caption to summarize the story point.
6. To further highlight the main idea of this story point,
you can change a filter or sort on a field in the view.
Then save your changes by clicking Update on the story
toolbar above the navigator box:
Dashboard And Reports
Reports VS Dashboard
A report is a more detailed collection of tables, charts, and graphs and it is used for a
much more detailed, full analysis while a dashboard is used for monitoring what is
going on. The behavior of the pieces that make up dashboards and reports are similar,
but their makeup itself is different.
The report can provide a more detailed view of the information that is presented on a
dashboard.
Practical For report And Dashboard