Machine Learning
Machine Learning
Machine Learning
2. Classification Workflow
2.1. Overview
Overview of the Classification Workflow
Handwriting Data
Las cartas escritas a mano se almacenaron como archivos de texto individuales. Cada
archivo está delimitado por comas y contiene cuatro columnas: una marca de tiempo, la
ubicación horizontal del lápiz, la ubicación vertical del lápiz y la presión del lápiz. La
marca de tiempo es el número de milisegundos transcurridos desde el comienzo de la
recopilación de datos. Las otras variables están en unidades normalizadas (0 a 1). Para las
ubicaciones de la pluma, 0 representa el borde inferior e izquierdo de la superficie de
escritura y 1 representa el borde superior y el derecho.
Puede usar la función de lectura para importar datos tabulares desde una hoja de cálculo o
un archivo de texto y almacenar el resultado como una tabla.
data = readtable ("myfile.xlsx");
Esto importa los datos de la hoja de cálculo myfile.xlsx y los almacena en una tabla
llamada datos.
2.2. Import Data
Puede usar la notación de puntos para referirse a cualquier variable individual dentro de una
tabla
x = mytable.Xdata;
y = mytable.Ydata;
Esto extrae la variable Xdata de la tabla mytable y almacena el resultado en una nueva
variable llamada x. Del mismo modo, la variable Ydata se extrae en y.
Los límites de eje predeterminados distorsionan la relación de aspecto de la letra. Puede
usar el comando de axis para forzar los ejes para preservar la relación de aspecto de los
datos.
Tarea
Use el axis equal para corregir la relación de aspecto de la gráfica.
Use las herramientas de figuras interactivas o las funciones xlim e ylim para ampliar los puntos de
datos cerca del origen. ¿Crees que es posible hacer un modelo razonablemente preciso que use
estas dos características para distinguir estas tres letras?
xlim([0 4])
ylim([0 1.5])
2.5. Build a model
What is a model?
Haciendo predicciones
Habiendo construido un modelo a partir de los datos, puede usarlo para clasificar nuevas
observaciones. Esto solo requiere calcular las características de las nuevas observaciones y
determinar en qué región del espacio predictor se encuentran.
Antecedentes
La función de predict determina la clase predicha de nuevas observaciones.
predClass = predic (model, newdata)
Las entradas son el modelo entrenado y una tabla de observaciones, con las mismas
variables predictoras que se usaron para entrenar el modelo. El resultado es una matriz
categórica de la clase predicha para cada observación en newdata.
El archivo featuredata.mat contiene una tabla testdata que tiene las mismas variables que
las características. Sin embargo, las observaciones en testdata no están incluidas en las
características.
Tenga en cuenta que testdata contiene observaciones para las que se conoce la clase
correcta (almacenada en la variable Carácter). Esto proporciona una forma de probar su
modelo comparando las clases predichas por el modelo con las clases verdaderas. La
función de predict ignorará la variable Character al hacer predicciones del modelo.
Tarea
Use la función de predict con el modelo entrenado knnmodel para clasificar las letras en la
tabla testdata. Almacene las predicciones en una variable llamada predictions.
predictions = predict(knnmodel,testdata);
Opciones de algoritmo
Por defecto, fitcknn se ajusta a un modelo kNN con k = 1. Es decir, el modelo usa solo el
ejemplo más cercano conocido para clasificar una observación dada. Esto hace que el
modelo sea sensible a los valores atípicos en los datos de entrenamiento, como los
resaltados en la imagen de arriba. Es probable que las nuevas observaciones cerca de los
valores atípicos se clasifiquen erróneamente.
Puede hacer que el modelo sea menos sensible a las observaciones específicas en los datos
de entrenamiento aumentando el valor de k (es decir, use la clase más común de varios
vecinos). A menudo, esto mejorará el rendimiento del modelo en general. Sin embargo, el
rendimiento de un modelo en un conjunto de pruebas en particular depende de las
observaciones específicas en ese conjunto.
Antecedentes
Puede especificar el valor de k en un modelo kNN configurando la propiedad
"NumNeighbours" al llamar a fitcknn.
mdl = fitcknn (datos, "ResponseVariable", ...
"NumNeighbours", 10);
Tarea
Repita los comandos de las dos tareas anteriores, pero use la opción "NumNeighbours" para
cambiar el número de vecinos en el modelo a 5.
2.6. Evaluate the Model
¿Qué tan bueno es el modelo kNN? La tabla testdata incluye la clase conocida para las
observaciones de prueba. Puede comparar las clases conocidas con las predicciones del
modelo kNN para ver qué tan bien funciona el modelo con los nuevos datos.
Tarea
Utilice el operador == para comparar predictions con las clases conocidas (almacenadas en
la variable Character en la tabla testdata). Almacene el resultado en una variable llamada
iscorrect.
iscorrect = predictions == testdata.Character
1
1
1
1
0
0
1
1
1
1
Tarea
Calcule la proporción de predicciones correctas dividiendo el número de predicciones
correctas entre el número total de predicciones. Almacene el resultado en una variable
llamada accuracy. Puede usar la función de sum para determinar el número de predicciones
correctas y la función numl para determinar el número total de predicciones.
accuracy = sum(iscorrect)/numel(predictions)
0.800
En lugar de la precisión (la proporción de predicciones correctas), una métrica de uso
común para evaluar un modelo es la tasa de clasificación errónea (misclassification rate)(la
proporción de predicciones incorrectas).
Task
Use the ~= operator to determine the misclassification rate. Store the result in a variable
called misclassrate.
swrong = predictions ~= testdata.Character
misclassrate = sum(swrong)/numel(predictions)
0
0
0
0
1
1
0
0
0
0
Misclassrate = 0.200
Puede ser útil investigar las características de las clases comúnmente confundidas. Intente
usar la matriz lógica de clasificaciones incorrectas para indexar en datos de prueba y
predicciones para obtener los datos para las observaciones mal clasificadas. ¿Dónde viven
estas observaciones en el espacio de características? A partir de esto, ¿puede decir por qué
estas observaciones se clasificaron erróneamente?
2.7. Review
Hacer un modelo para 13 letras
Ahora tiene un modelo simple de dos características que funciona bien para tres letras en
particular (J, M y V). ¿Podría este modelo funcionar también para todo el alfabeto? En esta
interacción, creará y probará el mismo modelo kNN que antes, pero para 13 letras (la mitad
del alfabeto inglés).
Antecedentes
El archivo MAT featuredata13letters.mat contiene una tabla (características) de las mismas
características que antes. Sin embargo, ahora los datos incluyen muestras de 13 letras
diferentes.
Tarea
Use la función gscatter para trazar las observaciones en features, con relación de aspecto en
el eje horizontal y duración en el eje vertical, coloreadas por clase (almacenadas en la
variable Character).
Los límites del eje no se evaluarán, pero es posible que desee experimentar con los límites
para ampliar la mayor parte de las observaciones.
gscatter(features.AspectRatio,features.Duration,features.Character)
Use la función fitcknn para ajustar un modelo a los datos. Establezca la propiedad
"NumNeighbours" en 5. Almacene el modelo en una variable llamada knnmodel. Use el
modelo para predecir las clases para las observaciones almacenadas en testdata. Almacene
las predicciones en una variable llamada predictions.
knnmodel = fitcknn(features,"Character","NumNeighbors",5)
predictions = predict(knnmodel,testdata)
Puede usar la función read para importar los datos de un archivo en el almacén de datos.
datos = read (ds);
El uso de la función read la primera vez importará los datos del primer archivo. Usarlo por
segunda vez importará los datos del segundo archivo, y así sucesivamente.
Task
Import the data from the first file into a table called data.
data = read(letterds)
Task
Visualize the data by plotting the X variable of data on the horizontal axis and the Y
variable on the vertical axis.
plot(data.X,data.Y)
Calling the read function again imports the data from the next file in the datastore.
Task
Import and plot the data from the second file.
data = read(letterds)
plot(data.X,data.Y)
The readall function imports the data from all the files in the datastore into a single
variable.
Task
Use the readall function to import the data from all the files into a table called data.
Visualize the data by plotting Y against X.
Antecedentes
Para usar una función como entrada a otra función, cree un identificador de función
agregando el símbolo @ al comienzo del nombre de la función.
transform(ds, @ myfun)
Un identificador de función es una referencia a una función. Sin el símbolo @, MATLAB
interpretará el nombre de la función como una llamada a esa función.
Tarea
Use la función de transform para crear un almacén de datos transformado llamado
preprocds. Este almacén de datos debe aplicar la función de scale a los datos a los que se
hace referencia con letterds.
preprocds = transform(letterds,@scale)
La función de escala ahora debe aplicarse automáticamente cada vez que se leen datos del
almacén de datos de preprocds.
Tarea
Use la función readall para importar todos los datos. Verifique que la función de
preprocesamiento se haya aplicado a cada archivo trazando la variable Y en función del
Time.
data = readall(preprocds)
plot(data.Time,data.Y)
Normalizando datos
La ubicación de una carta no es importante para clasificarla. Lo que importa es la forma.
Un paso de preprocesamiento común para muchos problemas de aprendizaje automático es
normalizar los datos.
Las normalizaciones típicas incluyen el desplazamiento por la media (de modo que la
media de los datos desplazados es 0) o desplazar y escalar los datos en un rango fijo (como
[-1, 1]). En el caso de las letras escritas a mano, cambiar los datos x e y para que tengan una
media 0 asegurará que todas las letras estén centradas alrededor del mismo punto.
Task
Modify the scale function to subtract the mean position from both components:
data.X = data.X - mean(data.X);
data.Y = data.Y - mean(data.Y);
Note that this will introduce a problem that will make the plot appear blank. You will fix
this in the next task. (Tenga en cuenta que esto introducirá un problema que hará que el
gráfico aparezca en blanco. Lo arreglará en la próxima tarea.)
function data = scale(data)
data.Time = (data.Time - data.Time(1))/1000;
data.X = 1.5*data.X;
data.X = data.X - mean(data.X);
data.Y = data.Y - mean(data.Y);
end
4. Engineering Features
Calculate features from raw signals.
4.1 Types of Signal
Los algoritmos de machine learning necesitan datos con un formato concreto: cierto
numero de observaciones, cada una de ellas con varias características. Cuando se crea un
modelo de predicción estas características son las variables de predicción: las entradas que
el modelo usa para determinar la salida.
4.2 Calculating Summary Statistics
Statistical Functions
Measures of Central Tendency
Function Description
Mean Arithmetic mean
median Median (middle) value
mode Most frequent value
trimmean Trimmed mean (mean, excluding outliers)
geomean Geometric mean
harmean Harmonic mean
Measures of Spread
Functio
n Description
range Range of values (largest – smallest)
std Standard deviation
var Variance
mad Mean absolute deviation
iqr Interquartile range (75th percentile minus 25th percentile)
Measures of Shape
Function Description
skewnes Skewness (third central moment)
s
kurtosi Kurtosis (fourth central moment)
s
Estadísticas descriptivas
Todas las muestras de escritura a mano se han desplazado para que tengan una media cero
tanto en posición horizontal como vertical. ¿Qué otras estadísticas podrían proporcionar
información sobre la forma de las letras? Las diferentes letras tendrán diferentes
distribuciones de puntos. Las medidas estadísticas que describen la forma de estas
distribuciones podrían ser características útiles.
Antecedentes
El archivo MAT sampleletters.mat contiene tablas b1, b2, d1, d2, m1, m2, v1 y v2 que
contienen los datos de algunos ejemplos específicos seleccionados del conjunto completo
de datos de escritura a mano.
TASK
Use the range function to calculate the aspect ratio of the letter b1 by dividing the range of
values of Y by the range of values of X. Assign the result to a variable called aratiob.
aratiob = range(b1.Y)/range(b1.X);
aratiob = 2.0952
Las letras se preprocesan para tener una media de 0 (tanto en X como en Y). La mediana es
menos sensible a los valores atípicos que la media. Comparar la media con la mediana
puede dar una idea de cuán asimétrica es una distribución.
TASK
Use the median function to calculate the median of b1.X and b1.Y. Store the results in
variables called medxb and medyb, respectively. Remember to use the "omitnan" flag.
medxb = median(b1.X,"omitnan")
medyb = median(b1.Y,"omitnan")
medxb = -0.0538
medyb = -0.0336
La dispersión de los valores se puede medir con la desviación media absoluta (MAD), la
desviación estándar y la varianza. Cada uno de estos calcula el promedio de alguna medida
de la desviación de la media.
TASK
Use the mad function to calculate the mean absolute deviation of b1.X and b1.Y. Store the
results in variables called devxb and devyb, respectively. Note that mad ignores NaNs by
default.
devxb = mad(b1.X)
devyb = mad(b1.Y)
devxb = 0.1519
devyb = 0.4195
TASK
Calculate the same statistics for some other sample letters:
The aspect ratio of v1, stored in aratiov
The median of d1.X, stored in medxd
The median of d1.Y, stored in medyd
The mean absolute deviation of m1.X, stored in devxm
The mean absolute deviation of m1.Y, stored in devym
aratiov = range(v1.Y)/range(v1.X)
medxd = median(d1.X,"omitnan")
medyd = median(d1.Y,"omitnan")
devxm = mad(m1.X)
devym = mad(m1.Y)
Task 2
Local minima and maxima are defined by computing the prominence of each value in the
signal. The prominence is a measure of how a value compares to the other values around it.
You can obtain the prominence value of each point in a signal by obtaining a second output
from islocalmin or islocalmax.
[idx,p] = islocalmin(x);
TASK
Determine the prominence values for islocalmin(m1.X). Store the result in a variable called
prom. Plot the prominence as a function of the Time variable of m1.
plot(m1.Time,m1.X)
hold on
plot(m1.Time(idxmin),m1.X(idxmin),"o")
plot(m1.Time(idxmax),m1.X(idxmax),"s")
hold off
[idx,prom] = islocalmin(m1.X)
plot(m1.Time,prom)
By default, islocalmin and islocalmax find points with any prominence value above 0. This
means that a maximum is defined as any point that is larger than the two values on either
side of it. For noisy signals you might want to consider only minima and maxima that have
a prominence value above a given threshold.
idx = islocalmin(x,"MinProminence",threshvalue)
When choosing a threshold value, note that prominence values can range from 0 to
range(x).
TASK
Recalculate idxmin and idxmax for m1.X using a minimum prominence threshold of 0.1.
Copy the plotting code from task 1 to visualize the result.
idxmin = islocalmin(m1.X,"MinProminence",0.1)
idxmax = islocalmax(m1.X,"MinProminence",0.1)
Puede pasar idxmin a las funciones nnz o sum para contar el número de mínimos. Intente
calcular el número de mínimos y máximos locales en diferentes señales. ¿Podría ser esta
una característica útil para distinguir entre letras?
4.4 Computing Derivatives
Approximating Velocity
Un aspecto importante de detectar letras escritas en una tableta es que hay información útil
en el ritmo y el flujo de cómo se escriben las letras. Para describir la forma de las señales a
través del tiempo, puede ser útil conocer la velocidad de la pluma o, de manera equivalente,
la pendiente de la gráfica de posición a través del tiempo.
Los datos sin procesar registrados de la tableta solo tienen posición (no velocidad) a través
del tiempo, por lo que la velocidad debe calcularse a partir de los datos sin procesar. Con
puntos de datos discretos, esto significa estimar la velocidad utilizando una aproximación
de diferencia finita
v=Δx/Δt
Background
The diff function calculates the difference between successive elements of an array. That is,
if y = diff(x), then y1=x2−x1, y2=x3−x2, and so on. Note that y will be one element shorter
than x.
TASK
Use the diff function to find the differences between the elements of m2.X. Store the result
in a variable called dX. Similarly, find the differences between the elements of m2.Time
and store the result in a variable called dT.
load sampleletters.mat
plot(m2.Time,m2.X)
grid
dX = diff(m2.X);
dT = diff(m2.Time);
Task 2
TASK
Calculate the approximate derivative of m2.X by dividing dX by dT. Remember to use the
array division operator. Store the result in a variable called dXdT.
dXdT = dX./dT
Task 3
Recall that the output from the diff function is one element shorter than the input.
TASK
Plot dXdT as a function of m2.Time, excluding the final value. Recall that you can use the
end keyword to refer to the last element in an array.
plot(m2.Time(1:end-1),dXdT)
Task 4
TASK
Calculate the approximate derivative of m2.Y. Store the result in a variable called dYdT.
Calculate the maximum values of both dXdT and dYdT. Store the results in variables called
maxdx and maxdy, respectively.
You might want to leave off the semicolons so that you can see the values of maxdx and
maxdy.
dY = diff(m2.Y);
dYdT = dY./dT;
maxdx = max(dXdT) = 4.2971
maxdy = max(dYdT) = Inf
Task 5
Due to limits on the resolution of the data collection procedure, the data contains some
repeated values. If the position and the time are both repeated, then the differences are both
0, resulting in a derivative of 0/0 = NaN. However, if the position values are very slightly
different, then the derivative will be Inf (nonzero divided by 0).
Note that max ignores NaN but not Inf because Inf is larger than any finite value. However,
for this application, both NaN and Inf can be ignored, as they represent repeated data.
You can use the standardizeMissing function to convert a set of values to NaN (or the
appropriate missing value for nonnumeric data types).
xclean = standardizeMissing(x,0);
Here, xclean will be the same as x (including any NaNs), but will have NaN wherever x
had the value 0.
Debido a los límites en la resolución del procedimiento de recopilación de datos, los datos
contienen algunos valores repetidos. Si la posición y el tiempo se repiten, las diferencias
son ambas 0, lo que resulta en una derivada de 0/0 = NaN. Sin embargo, si los valores de
posición son muy diferentes, entonces la derivada será Inf (no cero dividida por 0).
Tenga en cuenta que max ignora NaN pero no Inf porque Inf es mayor que cualquier valor
finito. Sin embargo, para esta aplicación, tanto NaN como Inf pueden ignorarse, ya que
representan datos repetidos.
Puede usar la función standardizeMissing para convertir un conjunto de valores a NaN (o el
valor que falta apropiado para los tipos de datos no numéricos).
xclean = standardizeMissing (x, 0);
Aquí, xclean será el mismo que x (incluidos los NaN), pero tendrá NaN donde x tenga el
valor 0.
TASK
Use the standardizeMissing function to modify dYdT so that all values of Inf are now NaN.
dYdT = standardizeMissing(dYdT,Inf)
maxdy = max(dYdT)
Try calculating the derivatives of different sample letters. Note that a negative value
divided by zero will result in -Inf. You can pass a vector of values to standardizeMissing to
deal with multiple missing values at once.
xclean = standardizeMissing(x,[-Inf 0 Inf]);
4.5 Calculating Correlations
Measuring Similarity
The pair of signals on the left have a significantly different shape to the pair of signals on
the right. However, the relationship between the two signals in each pair is similar in both
cases: in the blue regions, the upper signal is increasing while the lower signal is
decreasing, and vice versa in the yellow regions. Correlation attempts to measure this
similarity, regardless of the shape of the signal.
Background
For the first half of the letter V, the horizontal and vertical positions have a strong negative
linear correlation: when the horizontal position increases, the vertical position decreases
proportionally. Similarly, for the second half, the positions have a strong positive
correlation: when the horizontal position increases, the vertical position also increases
proportionally.
The corr function calculates the linear correlation between variables.
C = corr(x,y);
TASK
Use the corr function to calculate the linear correlation between v2.X and v2.Y. Store the
result in a variable called C.
load sampleletters.mat
plot(v2.X,v2.Y,"o-")
C = corr(v2.X,v2.Y) = NaN
Task 2
Because both variables contain missing data, C is NaN. You can use the "Rows" option to
specify how to avoid missing values.
C = corr(x,y,"Rows","complete");
TASK
Recalculate the correlation between v2.X and v2.Y, this time with the "Rows" option set to
"complete". Store the result in C.
C = corr(v2.X,v2.Y,"rows","complete") = 0.6493
Task 3
The correlation coefficient is always between -1 and +1.
A coefficient of -1 indicates a perfect negative linear correlation
A coefficient of +1 indicates a perfect positive linear correlation
A coefficient of 0 indicates no linear correlation
In this case, there is only a moderate correlation because the calculation has been
performed on the entire signal. It may be more informative to consider the two halves of the
signal separately.
TASK
Use concatenation ([ ]) to make a matrix M with four columns: the first half (elements 1 to
11) of v2.X, the first half of v2.Y, the second half (elements 12 to 22) of v2.X, the second
half of v2.Y.
M = [v2.X(1:11) v2.Y(1:11) v2.X(12:22),v2.Y(12:22)]
Task 4
To calculate the correlation between each pair of several variables, you can pass a matrix to
the corr function, where each variable is a column of the matrix.
M = [x y z];
C = corr(M);
TASK
Use the corr function to calculate the correlations between the columns of M. Store the
result in a variable called Cmat. Don't forget to ignore missing values
Cmat = corr(M,"rows","complete")
The output Cmat is a 4-by-4 matrix of the coefficients of correlation between each pairwise
combination of the columns of M. That is, Cmat(j,k) is the correlation of M(:,j) and M(:,k).
The matrix is symmetric because the correlation between x and y is the same as the
correlation between y and x. The diagonal elements are always equal to 1, because a
variable is always perfectly correlated with itself.
Which variables are highly correlated? Remember that the first two columns of M are the
horizontal and vertical positions of the first half of the signal; the last two columns are the
second half of the signal. Are the correlation coefficients what you expect? Try calculating
the same correlations for v2 and for some other sample letters. Could the correlation
coefficients be useful features to distinguish between letters?
4.6 Automating Feature Extraction: (1/2) Creating a Feature Extraction Function
Custom Preprocessing Functions
Once you have determined the features you want to extract, you will need to apply the
appropriate calculations to every sample in your data set. The first step to automating this
procedure is to make a custom function that takes the data as input and returns an array of
features as output.
Background
Currently the script calculates six features for a given letter (stored in the variable letter).
The six features are stored in six separate variables.
You can use the table function to combine separate variables into a table.
T = table(x,y,z);
TASK
Use the table function to make a table from the features stored in the variables aratio,
numXmin, numYmax, avgdX, avgdY, and corrXY. Store the result in a variable called feat.
load sampleletters.mat
letter = b1;
aratio = range(letter.Y)/range(letter.X)
idxmin = islocalmin(letter.X,"MinProminence",0.1);
numXmin = nnz(idxmin)
idxmax = islocalmax(letter.Y,"MinProminence",0.1);
numYmax = nnz(idxmax)
dT = diff(letter.Time);
dXdT = diff(letter.X)./dT;
dYdT = diff(letter.Y)./dT;
avgdX = mean(dXdT,"omitnan")
avgdY = mean(dYdT,"omitnan")
corrXY = corr(letter.X,letter.Y,"rows","complete")
featurenames = ["AspectRatio","NumMinX","NumMinY","AvgU","AvgV","CorrXY"];
feat = table(aratio,numXmin,numYmax,avgdX,avgdY,corrXY)
Task 2
By default, the table constructed with the table function has default variable names. To
make a table with more useful names, use the 'VariableNames' option.
T = table(x,y,z,'VariableNames',["X","Y","Z"]);
Typically you can use either single or double quotes to specify option names. However,
because strings can represent data for your table, you need to use single quotes when
specifying the 'VariableNames' option.
TASK
Recreate the table of features, feat, but with the table variable names stored in the array
featurenames.
feat =
table(aratio,numXmin,numYmax,avgdX,avgdY,corrXY,'VariableNames',featurenames)
Task 3
TASK
At the end of the script, add a local function called extract that takes a single variable,
letter, as input and returns a table of features, feat, as output. Copy the code from the
beginning of the script and from task 2 to make the body of the function. Test your function
by calling it with b2 as input. Store the result in a variable called featB2.
function feat = extract(letter)
aratio = range(letter.Y)/range(letter.X);
idxmin = islocalmin(letter.X,"MinProminence",0.1);
numXmin = nnz(idxmin);
idxmax = islocalmax(letter.Y,"MinProminence",0.1);
numYmax = nnz(idxmax);
dT = diff(letter.Time);
dXdT = diff(letter.X)./dT;
dYdT = diff(letter.Y)./dT;
avgdX = mean(dXdT,"omitnan");
avgdY = mean(dYdT,"omitnan");
corrXY = corr(letter.X,letter.Y,"rows","complete");
featurenames = ["AspectRatio","NumMinX","NumMinY","AvgU","AvgV","CorrXY"];
feat =
table(aratio,numXmin,numYmax,avgdX,avgdY,corrXY,'VariableNames',featurenames);
end
featB2 = extract(b2)
You can now call extract on any of the sample letters. Because the resulting tables always
have the same size and variable names, you can vertically concatenate them into one table
of features.
(2/2) Extracting Features from Multiple Data Files
Transformed Datastores
To automate your feature extraction, you want your datastore to apply your extraction
function whenever the data is read. As with preprocessing, you can do this with a
transformed datastore.
Task 1
Background
From the raw data, you will typically need to apply both preprocessing and feature
extraction functions. You can apply the transform function repeatedly to add any number of
transformations to the datastore to the raw data.
The script currently applies the scale function to the files in the datastore letterds. The
transformed datastore is stored in the variable preprocds.
TASK
Use the transform function to apply the extract function to the datastore preprocds. Store
the result in a variable called featds.
letterds = datastore("*.txt");
preprocds = transform(letterds,@scale)
featds = transform(preprocds,@extract)
Task 2
TASK
Use the readall function to read, preprocess, and extract features from all the data files.
Store the result in a variable called data.
There are 12 files and the extract function calculates six features for each. Hence, data
should be a 12-by-6 table.
Visualize the imported data by making a scatter plot of AspectRatio on the x-axis and
CorrXY on the y-axis.
data = readall(featds)
scatter(data.AspectRatio,data.CorrXY)
Task 3
The letters that the data represents are given in the data file names, which are of the form
usernnn_X_n.txt. Note that the letter name appears between underscore characters (_X_).
You can use the extractBetween function to extract text that occurs between given strings.
extractedtxt = extractBetween(txt,"abc","xyz")
If txt is the string array ["hello abc 123 xyz","abcxyz","xyzabchelloxyzabc"], then
extractedtxt will be [" 123 ","","hello"].
TASK
Use the extractBetween function to obtain the known letter names from the file names by
looking for text between two underscore characters (_). Store the result in a variable called
knownchar. Recall that the file names are stored in the Files property of the datastore
letterds.
knownchar=extractBetween(letterds.Files,"_","_")
Task 4
For classification problems, you typically want to represent the known label as a categorical
variable. You can use the categorical function to convert an array to categorical type.
xcat = categorical(x)
By default, the unique values in x will be used to define the set of categories.
TASK
Use the categorical function to make knownchar categorical.
knownchar = categorical(knownchar)
Task 5
It is convenient to have the known classes associated with the training data. Recall that you
can create new variables in a table by assigning to a variable using dot notation.
T.newvar = workspacevar
TASK
Add knownchar to the table data as a new variable called Character.
Use the gscatter function to make a grouped scatter plot of AspectRatio on the x-axis and
CorrXY on the y-axis, grouped by Character.
data.Character = knownchar
gscatter(data.AspectRatio,data.CorrXY,data.Character)
Try modifying extract to change the features being calculated from the data. Check that you
can rerun the script to obtain a new version of the table data.
5 Classification Models
5.1 Training and Testing Data
Un modelo simple a uno complejo dado que se puede dar un sobre ajuste y sus resultados
son un poco peores al momento de clasificar nuevos datos.
5.2 Machine Learning Models
Obtener un modelo de entrenamiento a partir de los datos.
“Machine” porque es una máquina (ordenador) que sigue una receta. “Learning” porque el
modelo que resulta depende de los datos de entrenamiento utilizados. El ordenador a
aprendido de los datos.
histogram(traindata.Character)
Task 2
A boxplot is a simple way to visualize multiple distributions.
boxplot(x,c)
This creates a plot where the boxes represent the distribution of the values of x for each of
the classes in c. If the values of x are typically significantly different for one class than
another, then x is a feature that can distinguish between those classes. The more features
you have that can distinguish different classes, the more likely you are to be able to build an
accurate classification model from the full data set.
Task
Use the boxplot function to make a boxplot of the values of the MADX feature (mean
absolute deviation of the horizontal position) for each letter. The known letter classes are
stored in the variable called Character.
predLetter = predict(knnmodel,testdata);
Task 2
In this case, the correct classes for the test data are known. They are stored in the Character
variable of the table testdata.
TASK
Use the ~= operator to determine the misclassification rate (the number of incorrect
predictions divided by the total number of predictions). Store the result in a variable called
misclassrate.
misclassrate = sum(predLetter ~= testdata.Character)/numel(predLetter)
Task 3
The response classes are not always equally distributed in either the training or test data.
Loss is a fairer measure of misclassification that incorporates the probability of each class
(based on the distribution in the data).
loss(model,testdata)
TASK
Use the loss function to determine the test data loss for the kNN model knnmodel. Store the
result in a variable called testloss.
testloss = loss(knnmodel,testdata)
You can calculate the loss on any data set where the correct class is known. Try to
determine the loss of the original training data (traindata). This is known as the
resubstitution loss (the loss when the training data is “resubstituted” into the model). You
can calculate resubstitution loss directly with resubLoss(knnmodel).
5.5 Investigating Misclassification: : (1/2) Identifying Common Misclassifications
The Confusion Matrix
For any response class X, you can divide a machine learning model's predictions into four
groups:
True positives (green) – predicted to be X and was actually X
True negatives (blue) – predicted to be not X and was actually not X
False positives (yellow) – predicted to be X but was actually not X
False negatives (orange) – predicted to be not X but was actually X
Task 1
Background
When making a confusion chart, you can add information about the false negative and false
positive rate for each class by adding row or column summaries, respectively.
confusionchart(...,"RowSummary","row-normalized)
TASK
Recreate the confusion chart with normalized row summary information.
confusionchart(testdata.Character,predLetter,"RowSummary","row-normalized")
False Negatives
The row summary shows the false negative rate for each class. (With 26 letters, you will
need to enlarge the plot to make the values visible.) This shows which letters the kNN
model has the most difficulty identifying (i.e., the letters the model most often thinks are
something else.) This model has particular difficulty with the letter U, most often mistaking
it for M, N, or V.
Some confusions seem reasonable, such as U/V or H/N. Others are more surprising, such as
U/K. Having identified misclassifications of interest, you will probably want to look at
some the specific data samples to understand what is causing the misclassification.
Task 2
Background
You can use relational and logical operators (such as ==, ~=, &, and |) to identify
observations to study further.
TASK
Use relational and logical operators to create a logical array called falseneg that identifies
instances of the test data where the letter U was classified as something else. That is,
elements where the true class (testdata.Character) is "U" and the predicted class
(predLetter) was not "U".
falseneg = (testdata.Character == "U")&(predLetter~="U")
Task 3
Recall that the Files property of a datastore contains the file names of the original data.
Hence, when you import the data and extract the features, you can keep a record of which
data file is associated with each observation. The string array testfiles contains the file
names for the test data.
TASK
Use the logical array falseneg as an index into testfiles to determine the file names of the
observations that were incorrectly classified as the letter U. Store the result in a variable
called fnfiles.
Similarly, use falseneg as an index into predLetter to determine the associated predicted
letters. Store the result in a variable called fnpred.
fnfiles = testfiles(falseneg)
fnpred = predLetter(falseneg)
Task 4
The fourth element of fnpred is N, which is a common misclassification for the letter U.
What does this particular sample look like?
TASK
Use the readtable to import the data in the fourth element of fnfiles into a table called badU.
Visualize the letter by plotting Y against X.
badU = readtable(fnfiles(4))
plot(badU.X,badU.Y)
Think about the pen position through time for this U and for a typical N. Is it reasonable
that they could be confused with each other?
Try looking at some of the other misclassifications. You can add a title to the plot to show
what the model predicted:
title("Prediction: "+string(fnpred(4)))
5.5 Investigating Misclassifications: (2/2) Investigating Features
Task 1
Having identified examples of interest, you will typically want to examine the
corresponding features.
TASK
Use logical indexing to extract the training data for just the letters N and U. Store the result
in a table called UorN.
Similarly, extract the test data where the letter U was misclassified (i.e., the false negatives
for U). Store the result in a table called fnU
idx = (traindata.Character == "N") | (traindata.Character == "U");
UorN = traindata(idx,:)
Task 2
Categorical variables maintain the full list of possible classes, even when only a subset are
present in the data. When examining a subset, it can be useful to redefine the set of possible
classes to only those that are in the data. The removecats function removes unused
categories.
cmin = removecats(cfull)
TASK
Use the removecats function to remove unused categories from UorN.Character. Assign the
result back to UorN.Character.
UorN.Character = removecats(UorN.Character);
Task 3
You can use curly braces ({ }) to extract data from a table into an array of a single type.
datamatrix = datatable{1:10,4:6}
This extracts the first 10 elements of variables 4, 5, and 6. If these variables are numeric,
datamatrix will be a 10-by-3 double array.
TASK
Extract the numeric feature data from UorN and fnU into matrices called UorNfeat and
fnUfeat, respectively.
Note that the last variable in both tables is the response. All other variables are the features.
UorNfeat = UorN{1:219,1:25}
fnUfeat = fnU{1:17,1:25}
Task 4
A parallel coordinates plot shows the value of the features (or “coordinates”) for each
observation as a line.
parallelcoords(data)
To compare the feature values of different classes, use the "Group" option.
parallelcoords(data,"Group",classes)
TASK
Use the parallelcoords function to plot the features in the training data (UorNfeat), grouped
by letter (UorN.Character).
parallelcoords(UorNfeat,"Group",UorN.Character)
Task 5
Because a parallel coordinates plot is just a line plot, you can add individual observations
using the regular plot function.
TASK
Use the plot function to add the values of the features for the fourth misclassified U to the
plot as a black line. (The features for the misclassified letters are stored in the matrix
fnUfeat).
hold on
plot(fnUfeat(4,:),"k")
hold off
Use the zoom tool to explore the plot. Note that N and U have similar values for many
features. Are there any features that help distinguish these letters from each other? A kNN
model uses the distance between observations, where the distance is calculated over all the
features. Does this explain why N and U are hard to distinguish, even if there are some
features that separate them?
When plotting multiple observations by groups, it can be helpful to view the median and a
range for each group, rather than every individual observation. You can use the "Quantile"
option to do this.
parallelcoords(...,"Quantile",0.2)
6. Conclusion
Learn next steps and give feedback on the course.
6.1 Additional Resources: (1/2) More machine learning applications
Si la salida está intentando predecir es un valor numérico, como el precio de una casa, está
usando regresión en lugar de la clasificación.
Aprendizaje supervisado: Entrena modelos con ejemplos en los que se conoce la salida
correcta
Deep Learning: Es una técnica concreta de machine learnig que utiliza redes neuronales
para extraer características y realizar predicciones.
Unsupervised Learning: Si se desea ver si hay alguna estructura o patrón en los datos.
Reinforcement Learning: Se define una recompensa y se deja que la maquina intente
distintas estrategias para ver cuanta recompensa consigue.