42 DWM CombinedManual
42 DWM CombinedManual
42 DWM CombinedManual
Education Certificate
Seal of
the
Institute
Content Page
List of Practical’s and Progressive Assessment Sheet
Sr. Title of the practical Date of Date of Assessment Dated sign. Remarks (if
No performance submission marks(50) of teacher any)
p76L371-12l020.20LARISSL-2ol8 zip
FOR EDUCATIONAL USE
undaram
HP-UX Tkanium e11311-12020-HfUx-TAGu-1f8.2ip
pH6AL311-1lo2o-HfUx-TH6L-2 2i
BH Ax on fower Suke pl?69L372-n\020-ATKSL-5L- 1of8 2ip
p769L31t nl020-ATX6-5L-2f 2i
1 Install he daubose acoding o an Orocle dalubose
installaion guid
afplicode to ua intalatian 78quirement
Con clusion
C Solauing, he ioceduse we Successhully indall
Oracale colabage SeIve and dient
Proctical No.2
Him: Impo souce dato stuucuze in Osoche.
Theou
The tobe is he msic dala sbicuneuSEd in
The in
elaicnal data base.H table is a callecion o ouws
Each TOw in 0 toble COnainS one d mae he
Columns
A View is on Oacle doo skiuctuze consucted
with o SaL shatement The soL stalen ent is
stoed i n he datuboSe Whenwe O O7e
7 e USe a
View in o duezyhe stored auey is edecutea and
he
h e base table dato is euoned o the USe Views
da 0ocontain dato, þut epesent waus to loos o t
the bose table dato in he wdug. hf quey speciies
A vieis built on o colecian o ase ades, uhich
can beeithe actual ades is on Osdde databaSe
O ofhe Views.
possuad
oS Enke he hos and seiice name
3 Click on es. Connection
ASuccessful messog? w
ui b dislaued i he
conne cion iSacpe
Concdu sion
We hove Jeatt ho tompank SOUZC Ce
data stctue in Oracle
7acical No 3
Aim-Develop Taraet Dato Sucures in O1acle.
heoT
Relotiom doodose
hese datobose
cakegoize bu a
oe
Set o tades neze dato as Stinto o e-deined
Cokeg0y The kdole consels o 20wSs and columns
where the coumn hos an enby fo7 dota Sov a
sPecific categy and 70ws (ntains instan ce f
Hhot data defined aCcoding to he cateaou The
souctuned Qvey Janguageis he shandazd Usez
and opplied ove he kabe uhich mahes these
hese
dotaases easieto etends join kwa databases
with o common Tebticn and modify al eisting
opplicokions
Dimensiona Ostolbases
A_demensiona duBabase is a dato
boSe thot uses o dimensignal doBa madel to onganize
dota This madelUses Aoct tables and dimen sian
todes in o Sa an snowlake schema
USe
Click on Ne
Condusion
We bove Suctsstuluy deleoped aost dola
Skuctunes in Cacle
7octical No. L
AimTnslal doto minino ool WEKh Sudy he
GUT egl.aze on WEa
Theou
WEKR1 s an aconym tha sands f
tne blakoto Envizonment S hnouwledge Andusis WEKA
On ofen Souce Sotu07e p7aidec o So dota
iocessing impementokion osenerd Machine ueor ing
olaszithms ond yisualizcticn ods so ho we cdn
develop mochine Jeaning echniques ond opply them
o eal- wd dako mining idblems
GUT expare
On the top o he explare we wisee tools:
P1epsocesS
Clossiy
3Cust
Asspciote
53Select atzibutes
6S Visualize
3 eroceSs Tab
Inikialuy aS upu agen he edflae7, only
he P7epdoces tab is enabled The Sis Sepin machine
eaoning i s oepocess he daBa
Thus, inhe Raeprocess option,you will select the dato
Sile,paocess il a k ei t f applsing he vazious
undaram
maChine Jeamino olgczithms
FOR EDUCATIONAL USE
3 Clossify Tab
The classiSy ab rovices upu Seea machine
econina olo7ithms Sov he cossihicction o he upuy
data
u s t e Tab
Unde the clushei tao, theze ame seuenal
clusteing aloonithms ovided- such as SimplekMeans
and S0 O0
LAssoote Tab
Uade he oSsociote tab, you wonld Sind
FPG7ounth Apsioi.
53Seled kiute Tab
Slecd Atibuke allows uou etune selections
bast on seyeral olgozthms Such as Princial Componem
etc.
63 Visualize Tab
Visualize oion olous uou to Visualize
uou ocessed doro S anal yze.
Condusien
We hove sucessiuly irslalled WEKA data
minino oo) and Lerin aouk GUS eplane
Proctica No 5
AimDeve lop on appliaticn fo OLAf ond ts
oll-vp, dill-down oferokions
Jntuaduction
An OLAP cupeisa dao Soucluie hat
Oveicomes the limitatons e\otionol dokabages by pioidia
7apid analusis of daBa Tk is Multidimen Sional cube that is
built using oLAr dotabases. OLif cubes con disfay and
SUm laoge omounts o doa uhile also OLhferoviding uses
with seacheble a.ccess o anu data poinms so thal the daa
Can be 7olled ups Siced and diced as needed to hondle he
wces variely of questions Pha ade eevant to a uSe's
Orea of interes An OLAP cube conmects to o data sounce
to 7eadand and ioces 70w doBoo peram acgegaticns and
calculations toits assocched measuneS: ne data souzCe
all Sev ice Maraa OLAP cuDes i s he doka mas, which
incdudes the doko mos Sn oth he Opeotian Managei ond
Conhiguiation Manoge
Outdoor Products
GO Sport Line
Envleonmental Line
ror East
orth Aerico
OLAP Operotions:
OLAr Provides a use- riendly envicnment fo imevactive
cota
dota andlusis One o he mose popula iont end
applicatians So OLAP is o PC Seveodsneek pragom
Rol) upCdil-up)
ROLLUR 1sUSed intnsls invalving subBoals Tk CRates
Subtutals a any Jevel aggmogation needed, Soam he
o s t detaile
Post up to o gand olal.ie climbing up o Concept
hieachy the dimensian such as time a geagapny
aUERY SINTAX
SELECT-GRODP B ROLLDOWN CCOUDMNS
ampe
SELECT TINE,LOCATION,PRODUCT,SUMCREENUE As PRoFTI
FRON SPLS GROUP B ROLDDuN CTINE LOCATION , PRODUCT)
Dill-daun COn e harmed ehe b:.
Stpping doun 0 Concep ierachy Ro o dimenSian
2 inboducing a neu dimensim
Concusian
oLA
OLAP ofeiatins u p and Dill-dotun ae pekorrd
rocbcol No6
Aim:Develop an application Soi OLAP ond ts Peotions
Slice ord Dice
Siceng
ASice in a multidimensiona oioy is a column o dato
Cornesponding to o sinle value fa one a7 m7e mempers o
he dimension.Tt hels the useto visnlize and ogthei he
inomotin specitic to a dimension When upu hia k o sicing
think o it as o sfecialized Gilte Sr a poaticula value in
o dinension Fo insance, i o Use uanted o Knaw the tolal
numbe o OPN Producs sod acioss al h e ohet locations
CEuzopeFoz-Eose Nosth Ameica, Souto 9merica,) he use
wou pe fom o hoizon tal sice CShown in fia)
iocaton
Outdoor Products
GO Sport Line
Envleonmental Line
ror East
orth Aerico
QUERY SYINTAX
SELECTION CONDTTIONS ON SONE AIRIBUTES USING<WHERE
CLAUSESGRoUP B> AND AGGREGATIONON SOME ATKiBUTE
EXANPLE
SELECT PRODUCTIS, SUNMCREVEN UE) FRO) SAES WHERE
PRODUCTSDPV' GRouf By PRODUCTS
Dicing
Ocing is smo to Sicing, bt it uoks a little
Dicing
diSSeently When one thinks o lick, Siltesing is done to
Socus on a prticulaz atbibte.Dcing.on 4he ohe hand, is.
MOoe o a
2com Seatue hat selecs subse ove all the
a
N. America
SAmenca
Proctical Mo7
Aim:-11 mpement Oata cleoning Aecahriques 1CDoto Pngiocesina
-Finding ond Replocina missing value in somple 0atosels)
Intsaduction
The ata cleanina ecniques indudes data
Prepccessing ard data rons fomation
Dato Pegiocessing
Osto prepoCessingis a dato rminirg tecnrigue that
inblies anSaiming 70w dato inko an Ude7sBard aole fanmat
Real waild dako i s otm noisy. missing Cincomplete,
inCansisten ,it may cohain many eias
Oako Cleuning
he dato Can have mak ielevant and missing gnts.To
hardle his fant, daBa cdleaning is cone.Tt imdlve s hondling
o missing dato, nosy dds etc
Smoohing by in boundaies
binl S3,3,5
Bin2 2 21.25.25
Bin3 2S,26, 26,3
Conclusion
Here we eon obout otez and befpre bin
Boundo7yalues and also Jened abau Dato Pre-pocessing,
Data Cleaning and he Smoot dato by he bin
Oounday ies
Pracical No8
Hhm21Tmlement Dato cleaning techiques TT (DoBo kranskomalio
7ansfoming doto am one oima to anothe Soyma
Inboducion:
undaram
Remaino dugliale dao
FOR EDUCATTONAL USE
This involves Solowing waus
NO7malizotion
tis dcne io cide o scdle tre daa vdues in seciied
1ange C\.0 lo \o o 0-0 to \:0)
t b h e Selectian
Tn his suahegy, neu artbutes and consouced Scn he
given set o atibes b help he mining pocess:
3 Oisaetization
This is done seplace he dw alues Snumec
ottibule ineval Jevel o Conceptual levels
ConcluSion:
Heance Hee uwe Jeco about daBa
rans fomatian and dato discielizdtion &drscovezy and
olso Jeoin abou asic Stees customized apenatians with
Piope eiomplr
You can launch Weka from C:\Program Files directory, from your desktop selecting
icon, or from the Windows task bar ‘Start’ ‘Programs’ ‘Weka 3-4’. When ‘WEKA
GUI Chooser’ window appears on the screen, you can select one of the four options at the bottom
of the window :
1. Simple CLI provides a simple command-line interface and allows direct execution of
Weka commands.
2. Explorer is an environment for exploring data.
For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the ‘WEKA
Rahul Shinde Roll No.42
GUI Chooser’ window.
3. Preprocessing Data
At the very top of the window, just below the title bar there is a row of tabs. Only the first
tab, ‘Preprocess’, is active at the moment because there is no dataset open. The first three
Rahul Shinde Roll No.42
buttons at the top of the preprocess section enable you to load data into WEKA. Data can be imported
from a file in various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an
SQL database (using JDBC) [4]. The easiest and the most common way of getting the data into WEKA
is to store it as Attribute-Relation File Format (ARFF) file.
You’ve already been given “weather.arff” file for this exercise; therefore, you can skip section 3.1 that
will guide you through the file conversion.
File Conversion
We assume that all your data stored in a Microsoft Excel spreadsheet “weather.xls”.
WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file. Before you apply
the algorithm to your data, you need to convert your data into comma-separated file into ARFF
format (into the file with .arff extension) [1]. To save you data in comma-separated format, select
the ‘Save As…’ menu item from Excel ‘File’ pull-down menu. In the ensuing dialog box select ‘CSV
(Comma Delimited)’ from the file type pop-up menu, enter a name of the file, and click ‘Save’
button. Ignore all messages that appear by clicking ‘OK’. Open this file with Microsoft Word. Your
screen will look like the screen below.
Rahul Shinde Roll No.42
The rows of the original spreadsheet are converted into lines of text where the elements are separated
from each other by commas. In this file you need to change the first line, which holds the attribute
names, into the header structure that makes up the beginning of an ARFF file. Add a @relation tag
with the dataset’s name, an @attribute tag with the attribute information, and a @data tag as
shown below.
Choose ‘Save As…’ from the ‘File‘ menu and specify ‘Text Only with Line Breaks’ as the file
type. Enter a file name and click ‘Save’ button. Rename the file to the file with extension .arff to
indicate that it is in ARFF format.
It brings up a dialog box allowing you to browse for the data file on the local file system, choose
“weather.arff” file.
Some databases have the ability to save data in CSV format. In this case, you can select CSV
file from the local filesystem. If you would like to convert this file into ARFF format, you can click
on ‘Save’ button. WEKA automatically creates ARFF file from your CSV file.
Rahul Shinde Roll No.42
A file can be opened from a website. Suppose, that “weather.arff” is on the following
website:
The URL of the web site in our example is http://gaia.ecs.csus.edu/~aksenovs/. It means that the
file is stored in this directory, just as in the case with your local file system. To open this file, click
on ‘Open URL…’ button, it brings up a dialog box requesting to enter source URL.
Rahul Shinde Roll No.42
Enter the URL of the web site followed by the file name, in this example the URL is
http://gaia.ecs.csus.edu/~aksenovs/weather.arff, where weather.arff is the name of the file you
are trying to load from the website.
Data can also be read from an SQL database using JDBC. Click on ‘Open DB…’ button,
‘GenericObjectEditor’ appears on the screen.
To read data from a database, click on ‘Open’ button and select the database from a filesystem.
Rahul Shinde Roll No.42
Preprocessing window
At the bottom of the window there is ‘Status’ box. The ‘Status’ box displays messages that keep you
informed about what is going on. For example, when you first opened the ‘Explorer’, the message
says, “Welcome to the Weka Explorer”. When you loading “weather.arff” file, the ‘Status’ box displays
the message “Reading from file…”. Once the file is loaded, the message in the ‘Status’ box changes
to say “OK”. Right-click anywhere in ‘Status box’, it brings up a menu with two options:
1. Available Memory that displays in the log and in ‘Status’ box the amount of
memory available to WEKA in bytes.
2. Run garbage collector that forces Java garbage collector to search for memory
that is no longer used, free this memory up and to allow this memory for new tasks.
To the right of ‘Status box’ there is a ‘Log’ button that opens up the log. The log records every action
in WEKA and keeps a record of what has happened. Each line of text in the log contains time of entry.
For example, if the file you tried to open is not loaded, the log will have record of the problem that
occurred during opening.
To the right of the ‘Log’ button there is an image of a bird. The bird is WEKA status icon.
The number next to ‘X’ symbol indicates a number of concurrently running processes. When you
loading a file, the bird sits down that means that there are no processes running. The number of
processes besides symbol ‘X’ is zero that means that the system is idle. Later, in classification
problem, when generating result look at the bird, it gets up and start moving that indicates that a
process started. The number next to ‘X’ becomes 1 that means that there is one process running, in
this case calculation.
Rahul Shinde Roll No.42
If the bird is standing and not moving for a long time, it means that something has gone wrong.
In this case you should restart WEKA Explorer.
Loading data
Lets load the data and look what is happening in the ‘Preprocess’ window.
The most common and easiest way of loading data into WEKA is from ARFF file, using ‘Open
file…’ button (section 3.2). Click on ‘Open file…’ button and choose “weather.arff” file from your
local filesystem. Note, the data can be loaded from CSV file as well because some databases
have the ability to convert data only into CSV format.
Once the data is loaded, WEKA recognizes attributes that are shown in the ‘Attribute’ window. Left
panel of ‘Preprocess’ window shows the list of recognized attributes:
No. is a number that identifies the order of the attribute as they are in data file, Selection tick boxes
allow you to select the attributes for working relation, Name is a name of an attribute as it was
declared in the data file.
The ‘Current relation’ box above ‘Attribute’ box displays the base relation (table) name and the current
working relation (which are initially the same) - “weather”, the number of instances - 14 and the number
of attributes - 5.
During the scan of the data, WEKA computes some basic statistics on each attribute. The following
statistics are shown in ‘Selected attribute’ box on the right panel of ‘Preprocess’ window:
An attribute can be deleted from the ‘Attributes’ window. Highlight an attribute you would like to
delete and hit Delete button on your keyboard.
By clicking on an attribute, you can see the basic statistics on that attribute. The frequency for
each attribute value is shown for categorical attributes. Min, max, mean, standard deviation
(StdDev) is shown for continuous attributes.
Outlook is nominal. Therefore, you can see the following frequency statistics for this attribute in the
‘Selected attributes’ window:
Missing = 0 means that the attribute is specified for all instances (no missing values), Distinct = 3
means that Outlook has three different values: sunny, overcast, rainy, and Unique = 0 means that
other instances do not have the same value as Outlook has.
Just below these values there is a table displaying count of instances of the attribute Outlook. As you
can see, there are three values: sunny with 5 instances, overcast with 4 instances, and rainy with 5
instances. These numbers match the numbers of instances in the base relation and table
“weather.xls”.
Temperature is a numeric value; therefore, you can see min, max, means, and standard deviation in
‘Selected Attribute’ window.
Missing = 0 means that the attribute is specified for all instances (no missing values), Distinct = 12
means that Temperature has twelve different values, and
Unique = 10 means that other attributes or instances have the same 10 value as Temperature has.
Temperature is a Numeric value; therefore, you can see the statistics describing the distribution of
values in the data - Minimum, Maximum, Mean and Standard Deviation. Minimum = 64 is the lowest
temperature, Maximum = 85 is the highest temperature, mean and standard deviation.
Compare the result with the attribute table “weather.xls”; the numbers in WEKA match the numbers
in the table.
You can select a class in the ‘Class’ pull-down box. The last attribute in the ‘Attributes’ window is
the default class selected in the ‘Class’ pull-down box.
Rahul Shinde Roll No.42
You can Visualize the attributes based on selected class. One way is to visualize selected
attribute based on class selected in the ‘Class’ pull-down window, or visualize all attributes by
clicking on ‘Visualize All’ button.
Setting Filters
Pre-processing tools in WEKA are called “filters”. WEKA contains filters for discretization,
normalization, resampling, attribute selection, transformation and combination of attributes. Some
techniques, such as association rule mining, can only be performed on categorical data. This requires
performing discretization on numeric or continuous attributes. For classification example you do not
need to transform the data. For you practice, suppose you need to perform a test on categorical data.
There are two attributes that need to be converted: ‘temperature’ and ‘humidity’. In other words, you
will keep all of the values for these attributes in the data. This means you can discretize by removing
the keyword "numeric" as the type for the
Rahul Shinde Roll No.42
‘temperature’ attribute and replace it with the set of “nominal” values. You can do this by applying a
filter.
In ‘Filters’ window, click on the ‘Choose’ button.
This will show pull-down menu with a list of available filters. Select Supervised Attribute
Discretize and click on ‘Apply’ button. The filter will convert Numeric values into Nominal.
When filter is chosen, the fields in the window changes to reflect available options.
Rahul Shinde Roll No.42
As you can see, there is no change in the value Outlook. Select value Temperature, look at the
‘Selected attribute’ box, the ‘Type’ field shows that the attribute type has changed from Numeric to
Nominal. The list has changed as well: instead of statistical values there is count of instances, and
the count of it is 14 that means that there are 14 instances of the value Temperature.
Note, when you right-click on filter, a ‘GenericObjectEditor’ dialog box comes up on your screen.
The box lets you to choose the filter configuration options. The same box can be used for
classifiers, clusterers and association rules.
Clicking on ‘More’ button brings up an ‘Information’ window describing what the different options
can do.
Rahul Shinde Roll No.42
At the bottom of the editor window there are four buttons. ‘Open’ and ‘Save’ buttons allow you to
save object configurations for future use. ‘Cancel’ button allows you to exit without saving
changes. Once you have made changes, click ‘OK’ to apply them.
4. Building “Classifiers”
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The
learning schemes available in WEKA include decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons, logistic regression, and bayes’ nets. “Meta”-
classifiers include bagging, boosting, stacking, error-correcting output codes, and locally weighted
learning .
Once you have your data set loaded, all the tabs are available to you. Click on the ‘Classify’ tab.
Now you can start analyzing the data using the provided algorithms. In this exercise you will
analyze the data with C4.5 algorithm using J48, WEKA’s implementation of decision tree learner.
The sample data used in this exercise is the weather data from the file “weather.arff”. Since C4.5
algorithm can handle numeric attributes, in contrast to the ID3 algorithm from which C4.5 has
evolved, there is no need to discretize any of the attributes. Before you start this exercise, make
sure you do not have filters set in the ‘Preprocess’ window. Filter exercise in section 3.6 was just
a practice.
Choosing a Classifier
Click on ‘Choose’ button in the ‘Classifier’ box just below the tabs and select C4.5
classifier WEKA Classifiers Trees J48.
Before you run the classification algorithm, you need to set test options. Set test options in
the ‘Test options’ box. The test options that available to you are [2]:
Rahul Shinde Roll No.42
1. Use training set. Evaluates the classifier on haw well it predicts the class of the
instances it was trained on.
2. Supplied test set. Evaluates the classifier on how well it predicts the class of a set of
instances loaded from a file. Clicking on the ‘Set…’ button brings up a dialog allowing
you to choose the file to test on.
3. Cross-validation. Evaluates the classifier by cross-validation, using the number of folds
that are entered in the ‘Folds’ text field.
4. Percentage split. Evaluates the classifier on how well it predicts a certain percentage of
the data, which is held out for testing. The amount of data held out depends on the value
entered in the ‘%’ field.
In this exercise you will evaluate classifier based on how well it predicts 66% of the
tested data. Check ‘Percentage split’ radio-button and keep it as default 66%. Click on ‘More
options…’ button.
Identify what is included into the output. In the ‘Classifier evaluation options’ make sure that the
following options are checked [2]:
1. Output model. The output is the classification model on the full training set, so that it
can be viewed, visualized, etc.
2. Output per-class stats. The precision/recall and true/false statistics for each class
output.
3. Output confusion matrix. The confusion matrix of the classifier’s predictions is included
in the output.
4. Store predictions for visualization. The classifier’s predictions are remembered so
that they can be visualized.
5. Set ‘Random seed for Xval / % Split’ to 1. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.
Rahul Shinde Roll No.42
The remaining options that you do not use in this exercise but that available to you are:
Once the options have been specified, you can run the classification algorithm. Click on ‘Start’
button to start the learning process. You can stop learning process at any time by clicking on ‘Stop’
button.
When training set is complete, the ‘Classifier’ output area on the right panel of ‘Classify’
window is filled with text describing the results of training and testing. A new entry appears in the
‘Result list’ box on the left panel of ‘Classify’ window.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42
Analyzing Results
humidity
windy
play
Test mode: split 66% train, remainder test • the test mode you selected: split=66%
=== Confusion Matrix === Detailed Accuracy By Class demonstrates a more detailed per-
class break down of the classifier’s prediction accuracy.
a b <-- classified as
2 1 | a = yes From the Confusion matrix you can see that one instance of a
2 0 | b = no class ‘yes’ have been assigned to a class ‘no’, and two of class
‘no’ are assigned to class ’yes’.
Rahul Shinde Roll No.42
Visualization of Results
WEKA lets you to see a graphical representation of the classification tree. Right-click on the entry
in ‘Result list’ for which you would like to visualize a tree. It invokes a menu containing the
following items:
Select the item ‘Visualize tree’; a new window comes up to the screen displaying the tree.
Rahul Shinde Roll No.42
WEKA also lets you to visualize classification errors. Right-click on the entry in ‘Result list’ again
and select ‘Visualize classifier errors’ from the menu:
On the ‘Weka Classifier Visualize’ window, beneath the X-axis selector there is a drop- down list,
‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the
attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility
you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select lighter
color from the color palette.
To the right of the plot area there are series of horizontal strips. Each strip represents an attribute, and
the dots within it show the distribution values of the attribute. You can choose what axes are used in
the main graph by clicking on these strips (left-click changes X-axis, right- click changes Y- axis).
Change X - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out in the plot
area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement given to
all points in the plot, to the right, until you can spot concentration points.
On the plot you can see the results of classification. Correctly classified instances are represented
as crosses, incorrectly classified once represented as squares. In this example in the left lower corner
you can see blue cross indicating correctly classified instance: if Outlook = ‘sunny’ € play = ‘yes’
Rahul Shinde Roll No.42
1. Fire up WEKA to get the GUI Chooser panel. Select Explorer from the four
choices on the right side.
2. We are on Preprocess now. Click the Open file button to bring up a standard
dialog through which you can select a file. Choose the customer_labThree.cvs
file.
3. To perform classification with Weka, the last attribute in the dataset is taken
asclass label and it should be nominal. Since the last attribute of data set
customer_labThree.cvs is numeric type (1/0), we should convert it to nominal
type in next step.
Rahul Shinde Roll No.42
4. Unsupervised attribute filter – NumericToNominal is chosen to perform this
conversion. Since we would like to convert the last attribute only, change the
attributeIndices to last.
5. After applying the filter, the last attribute becomes nominal type and it is taken
asthe class label for the dataset – now the data set is visualized in two colors.
Rahul Shinde Roll No.42
6. If the class attribute is not the last attribute, you could set it in edit window.
7. You should also to convert the types of other attributes. Attributes region,
townsize, agecat, jobcat, empcat, card2tenurecat, and internet are all nominal
values, however, they are treated as numeric type by Weka. And attributes
gender,union, equip, wireless, called, callwait, forward, confer, ebill are binary
values,
Rahul Shinde Roll No.42
they are treated as numeric types as well. NumericToNominal filter should be
applied to convert them. You could also normalize attribute educat to [0, 1] since
education categories are rankings.
Attribute Selection - Since not all attributes are relevant to the classification
job, you should perform attribute selection before training the classifier.
8. You could remove irrelevant attributes by hand. For example, the first attribute
custId should be removed. Select it and click Remove button to remove it.
Rahul Shinde Roll No.42
9. You also could run automatic attribute selection. We have introduced two
methods of evaluating attributes individually – InfoGainAttributeEval and
ChiSquaredAttributeEval. The default attribute selection method of Weka is
CfsSubsetEval, which evaluates subsets of attributes.
11. Run feature selection the second time with CfsSubsetEval and BestFirst search
method. Compare results of two feature selection methods.
Rahul Shinde Roll No.42
12. If you decide to reduce the dataset by removing unimportant attributes, you
couldchoose to save the reduced dataset by right-click the Result list. Save the
file name as customer.arff.
13. Open the saved processed data file customer.arff and then click Classify Tab on
top of the window. Click Choose button under Classifier. The drop down list of
all classifiers show. Choose NaiveBayes from bayes folder.
Rahul Shinde Roll No.42
14. Left click the field of Classifier, choose Show Property from the drop down list.
The property window of NaiveBayes opens, if you do not want to use Normal
Distribution for numeric data, set useKernelEstimator to ture; You also could
perform supervised discretization on numeric data by setting
useSupervisedDiscretization to ture. Click OK button to save all the settings.
15. To partition the training data set and test data set, choose 10-fold
cross-validation.
Rahul Shinde Roll No.42
16. Click Start button on the left of the window, the algorithm begins to run.
Theoutput is showing in the right window.
parameters of normal
distributions for numeric
frequency counts of
nominal values
Accuracy
Rahul Shinde Roll No.42
K-Nearest-Neighbor: lazy/IBK
18. We would like to build a Decision Tree model on the same given training data
set.Take all default values of the parameters.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42
19. To visualize the decision tree we build, right-click the Result list item for J48.
20. All trained classification models could be saved by right-click the Result list items.
Rahul Shinde Roll No.42
Rahul Shinde Roll No.42
Proctical No1
Aim: Per opm ossociotion echnique on cuskomer daase
T.(mpementing Apriori olggrinhm on Cucmez dalaset)
13
Ty
IS 2
Tkemset Sup-Count
I2
T3 6
T
T5 2
SteP-2 K2
DGeneote Candidate celc2 uSing LChis is called
oin step Condition of joining LIk-and Lk-| is
Phat i Shauld haVe Ck-2) eemen-s n common
Now Sind suppa caunt o hese iemself by
Searching in datasel
Ttemset Sup-CouD
I T2
la,13
TL,IL
2
Iz ,13
12 T5 2
13,
13 T5
15 O
undaran FOR EDUuCATIONAL USE
D Comfase CardidoBe Cc supro Coont wh mimnimum
Suppot Count
This gves Us ikemset Lz.
Tensel Sue-Count
I1,T2
S1,T3
T,15
T T3
J,Tu
J,ts 2
Sep-3
DFind supfoz Count ok these emaining itemset ay
sea7ching in daBo set Cc3)
TAemset Sup-count
IL,J2,13 2
IIJs 2
SupC12 3) /4 l00So'/.
C13Lt2 T I/ConGidence Sup CTT2 13)/
Sup CID 2/6 o033
CI23LI 133 l Confidence Sup C T2 T3)/
SupCI2) 2/1l0o 22
CIs3 LTIA TS l Considence up CTT3)/
Sup C13) 22/6 *loa 33/
Hee he miniun confidece 1s Co/
Now, we Sind
find ong ossaciation 2ules with he hele
G Second iemset iTi,2,Is3
FOR EDUCATIONAL USE
onteran
Ttemset :SILT2 153 //fan L3
70les can be
CAJl>[is3 /LConkidence suP CIA T2 Is)/
SupCiiAT2) - 2/4 loo So
Practical No. 13
Aim:- Perform association technique on customer dataset using apriori algorithm
Support
Support refers to the default popularity of an item and can be calculated by finding
number of transactions containing a particular item divided by total number of
transactions. Suppose we want to find support for item B. This can be calculated as:
Support(B)=(Transactionscontaining(B))/(TotalTransactions)
For instance if out of 1000 transactions, 100 transactions contain Ketchup then the
support for item Ketchup can be calculated as:
Support(Ketchup)=(TransactionscontainingKetchup)/(TotalTransactions)
Support(Ketchup)=100/1000=10%
Confidence
Confidence refers to the likelihood that an item B is also bought if item A is bought. It
can be calculated by finding the number of transactions where A and B are bought
together, divided by total number of transactions where A is bought. Mathematically,
it can be represented as:
Confidence(A→B)=(Transactionscontainingboth(AandB))/(TransactionscontainingA)
Coming back to our problem, we had 50 transactions where Burger and Ketchup were
bought together. While in 150 transactions, burgers are bought. Then we can find
likelihood of buying ketchup when a burger is bought can be represented as
confidence of Burger -> Ketchup and can be mathematically written as:
Confidence(Burger→Ketchup)=(Transactionscontainingboth(Burger
AndKetchup))/(TransactionscontainingA)
Confidence(Burger→Ketchup)=50/150=33.3%
You may notice that this is similar to what you'd see in the Naïve BayesAlgorithm,
however, the two algorithms are meant for different types of problems.
Lift
Rahul Shinde Roll No.42
Lift(A->B)refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can
be calculated by dividingConfidence(A->B)divided bySupport(B).Mathematically it can
be represented as:
Lift(A→B)=(Confidence(A→B))/(Support(B))
Lift basically tells us that the likelihoodof buying a Burger and Ketchup together is 3.33
times more than the likelihood of just buying the ketchup. A Lift of 1 means there is
no association between products A and B. Lift of greater than 1 means products A and
B are more likely to be bought together. Finally, Lift of less than 1 refers to the case
where two products are unlikely to be bought together.
different products given 7500 transactions over the course of a week at a French retail
store. The dataset can be downloaded from the following link:
https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp
=sharing
Another interesting point is that we do not need to write the script to calculate
support, confidence, and lift for all the possible combination of items. We willuse an
off-the-shelf library where all of the code has already been implemented.
The library I'm referring to isapyoriand the source can be foundhere. I suggest you to
download and install the library in the default path for your Python libraries before
proceeding.
Note: All the scripts in this article have been executed usingSpyderIDEfor Python.
Follow these steps to implement Apriori algorithm in Python:
Import the Libraries
The first step, as always, is to import the required libraries. Execute the following script
to doso:
Importnumpyasnp
importmatplotlib.pyplotasplt
importpandasaspd
fromapyoriimportapriori
In the script above we import pandas, numpy, pyplot, and apriori libraries.
A snippet of the dataset is shown in the above screenshot. If you carefully look at the
data, we can see that the header is actually the first transaction. Each row
corresponds to a transaction and each column corresponds toan item purchased in
that specific transaction. TheNaNtells us that the item represented by the column
was not purchased in that specific transaction.
In this dataset there is no header row. But by default,pd.read_csvfunction treats first
row as header. To get rid of this problem, addheader=Noneoption
topd.read_csvfunction, as shown below:
store_data=pd.read_csv('D:\\Datasets\\store_data.csv',header=None)
In this updated output you will see that the first line is now treated as a record
instead of header as shown below:
Now we will use the Apriori algorithm to find out which items are commonly sold
together, so that store owner can take action to place the related items together or
advertise them together in order to have increased profit.
Rahul Shinde Roll No.42
Data Proprocessing
The Apriori library we are going to use requires our dataset to be in the form of a list
of lists, where the whole dataset is a big list and each transaction in the dataset is an
inner list within the outer big list. Currently we have datain the form of a pandas
dataframe. To convert our pandas dataframe into a list of lists, execute the following
script:
records=[]
foriinrange(0,7501):
records.append([str(store_data.values[i,j])
forjinrange(0,20)])
Applying Apriori
The next step is to apply the Apriori algorithm on the dataset. To do so, we can use
theaprioriclass that we imported from the apyori library.
Theaprioriclass requires some parameter values to work. The first parameter is the
list of list that you want to extract rules from. The second parameter is
themin_supportparameter. This parameter is used to select the items with support
values greater than the value specified by the parameter. Next,
themin_confidenceparameter filters those rules that have confidence greater than
the confidence threshold specified by the parameter. Similarly, themin_liftparameter
specifies the minimum lift value for the shortlisted rules. Finally,
themin_lengthparameter specifies the minimum number of items that you want in
your rules.
Let's suppose that we want rules for only those items that are purchased at least 5
times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time
period. The support for those items can be calculated as 35/7500 = 0.0045. The
minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift
as 3 and finallymin_lengthis 2 since we want at least two products in our rules. These
values are mostly just arbitrarily chosen, so you can play with these values and
seewhat difference it makes in the rules you get back out.
Execute the following script:
association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,
min_lift=3,min_length=2)
association_results=list(association_rules)
Rahul Shinde Roll No.42
In the second line here we convert the rules found by theaprioriclass into alistsince it
is easier to view the results in this form.
The script above should return 48. Each item corresponds to one rule.
Let's print the first item in theassociation_ruleslist to see the first rule. Execute the
following script:
print(association_rules[0])
lift=4.84395061728395)])
The first item in the list is a list itself containing three items. The first item of the list
shows the grocery items in the rule.
For instance, from the first item, we can see that light cream and chicken are
commonly bought together. This makes sense sincepeople who purchase light cream
are careful about what they eat hence they are more likely to buy chicken i.e. white
meat instead of red meat i.e. beef. Or this could mean that light cream is commonly
used in recipes for chicken.
The support value for the first rule is 0.0045. This number is calculated by dividing
the number of transactions containing light cream divided by total number of
transactions. The confidence level for the rule is 0.2905 which shows that out of all
the transactions that contain light cream, 29.05% of the transactions also contain
chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be
bought by the customers who buy light cream compared to the default likelihood of
the sale of chicken.
Rahul Shinde Roll No.42
The following script displays the rule, the support, the confidence, and lift for each
rule in a more clear way:
Foriteminassociation_rules:
#firstindexoftheinnerlist
#Containsbaseitemandadditem
pair=item[0]
items=[xforxinpair]
print("Rule:"+items[0]+">"+items[1])
#secondindexofthe innerlist
print("Support:"+str(item[1]))
#thirdindexofthelistlocatedat0th
#ofthethirdindexoftheinnerlistprint("Confidence:"+str(item[2][0][2]))
print("Lift:"+str(item[2][0][3]))
print("=====================================")
If you execute the above script, you will see all the rules returned by theaprioriclass.
The first four rules returned by theaprioriclass look like this:
Rule:lightcream->chicken
Support:0.004532728969470737
Confidence:0.29059829059829057
Lift:4.84395061728395
=====================================
Rule:mushroomcreamsauce->escalope
Support:0.005732568990801126
Confidence:0.3006993006993007
Lift:3.790832696715049
=====================================
Rule:escalope->pasta
Support:0.005865884548726837
Confidence:0.3728813559322034
Rahul Shinde Roll No.42
Lift:4.700811850163794
=====================================
Rule:groundbeef->herb&pepper
Support:0.015997866951073192
Confidence:0.3234501347708895
Lift:3.2919938411349285
=====================================
We have already discussed the first rule. Let's now discuss the second rule. The
second rule states that mushroomcream sauce and escalope are bought frequently.
The support for mushroom cream sauce is 0.0057. The confidence for this rule is
0.3006 which means that out of all the transactions containing mushroom, 30.06% of
the transactions are likely to contain escalope as well. Finally, lift of 3.79 shows that
the escalope is 3.79 more likely to be bought by the customers that buy mushroom
cream sauce, compared to its default sale.
Conclusion
Association rule mining algorithms such as Apriori are very useful for finding simple
associations between our data items. They are easy to implement and have high
explain-ability. However for more advanced insights, such those used by Google or
Amazon etc., more complex algorithms, such asrecommendersystems, are used.
However, you can probably see that this method is a very simple way to get basic
associations if that's all youruse-case needs.
Rahul Shinde Roll No.42
Practical No. 14
Theory :
Introduction
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which
can be used for both classification as well as regression predictive problems.
However, it is mainly used for classification predictive problems in industry. The
following two properties would define KNN well −
➢ Lazy learning algorithm − KNN is a lazy learning algorithm because it does
not have a specialized training phase and uses all the data for training
while classification.
➢ Non-parametric learning algorithm − KNN is also a non-parametric
learning algorithm because it doesn’t assume anything about the
underlying data.
• 3.1 − Calculate the distance between test data and each row of training
data with the help of any of the method namely: Euclidean, Manhattan or
Hamming distance. The most commonly used method to calculate distance
is Euclidean.
• 3.2 − Now, based on the distance value, sort them in ascending order.
• 3.3 − Next, it will choose the top K rows from the sorted array.
• 3.4 − Now, it will assign a class to the test point based on most frequent
class of these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN
algorithm −
Suppose we have a dataset which can be plotted as follows
−−
Now, we need to classify new data point with black dot (at point 60,60) into blue
or red class. We are assuming K = 3 i.e. it would find three nearest data points. It
is shown in the next diagram −
Rahul Shinde Roll No.42
We can see in the above diagram the three nearest neighbors of the data point
with black dot. Among those three, two of them lies in Red class hence the black
dot will also be assigned in red class.
Implementation in Python
As we know K-nearest neighbors (KNN) algorithm can be used for both
classification as well as regression. The following are the recipes in Python to use
KNN as classifier as well as regressor −
KNN as Classifier
First, start with importing necessary python packages −
import numpy as np import
matplotlib.pyplot as plt import
pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names = headernames) dataset.head()
Rahul Shinde Roll No.42
Data Preprocessing will be done with the help of following script lines.
X = dataset.iloc[:, :-1].values y =
dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the
dataset into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
= 0.40)
Next, data scaling will be done as follows −
from sklearn.preprocessing import StandardScaler scaler
= StandardScaler() scaler.fit(X_train)
X_train = scaler.transform(X_train) X_test =
scaler.transform(X_test)
Next, train the model with the help of KNeighborsClassifier class of sklearn as
follows −
Rahul Shinde Roll No.42
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support Iris-setosa 1.00 1.00
1.00 21
Iris-versicolor 0.70 1.00 0.82 16 Iris-virginica 1.00
0.70 0.82 23 micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60 weighted avg 0.92
0.88 0.88 60
Accuracy: 0.8833333333333333
KNN as Regressor
First, start with importing necessary Python packages −
Rahul Shinde Roll No.42
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan
approval? Does that individual have the characteristics similar to the defaulters
one?
Politics
With the help of KNN algorithms, we can classify a potential voter into various
classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote
to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video Recognition.