-
Notifications
You must be signed in to change notification settings - Fork 84
Introduction to SharpLearning
This guide is an introduction on how to use SharpLearning for learning, evaluating and saving a machine learning model.
This guide will use the wine quality data set, which is also included in SharpLearning.Examples. The dataset can be used for both classification and regression. In this case, we will use the data to create a regression model for scoring the quality of white wine. The full code examples from this guide can be found in SharpLearning.Examples.
The guide will cover the following topics:
- Importing/reading data from csv.
- Splitting data into training/test for evaluating models.
- Learning a machine learning model.
- Using variable importance to gain insights about the model and data.
- Saving/loading the model for use in another application.
Notation
- Learner - Machine learning algorithm.
- Model - Machine learning model.
- Hyperparameters - The parameters used to regulate the complexity of a machine learning model.
- Target(s) - The value(s) we are trying to model, also known as the dependent variable. In some libraries this is called (y).
- Observation(s) - Feature matrix, also known as the independent variables, contains all the information we have to describe the targets. In some libraries this is called (x).
In SharpLearning, csv data can be read using the CsvParser located in the namespace SharpLearning.InputOutput.Csv. Below, the CsvParser is created using a StreamReader to read from the filesystem.
// Setup the CsvParser
var parser = new CsvParser(() => new StreamReader("winequality-white.csv", separator: ';'));
// the column name in the wine quality data set we want to model.
var targetName = "quality";
// read the "quality" column, this is the targets for our learner.
var targets = parser.EnumerateRows(targetName)
.ToF64Vector();
// read the feature matrix, all columns except "quality",
// this is the observations for our learner.
var observations = parser.EnumerateRows(c => c != targetName)
.ToF64Matrix();
The methods ToF64Vector and ToF64Matrix, converts from CsvRows to double format. ToF64Vector returns a double[] and ToF64Matrix returns a F64Matrix. There are corresponding methods to convert to string[] and StringMatrix in case further transforms has to be done before converting to double format.
In SharpLearning, splitting data into training/test is done using the TrainingTestIndexSplitters. There are various versions of these, corresponding to how the data should be distributed between the training and test set:
- NoShuffleTrainingTestIndexSplitter - Keeps the data in the original order before splitting.
- RandomTrainingTestIndexSplitter - Randomly shuffles the data before splitting. Usually used for regression.
- StratifiedTrainingTestIndexSplitter - Ensures that the distribution of unique target values are similar between training and test set. Usually used for classification.
Since we want to learn a regression model from the wine quality data set, we will be using the RandomTrainingTestIndexSplitter. Here we specify that we are going to use 70% of the data for the training set, which leaves 30% of the data for the test set.
// 30 % of the data is used for the test set.
var splitter = new RandomTrainingTestIndexSplitter<double>(trainingPercentage: 0.7, seed: 24);
var trainingTestSplit = splitter.SplitSet(observations, targets);
var trainSet = trainingTestSplit.TrainingSet;
var testSet = trainingTestSplit.TestSet;
Now that we have read the data and have a training and test set available, we can create a machine learning model and measure how well it performs on the test set. The test set error is our estimate of how well the model generalizes to new data.
Before we can evaluate the model, we have to decide how we want to measure the performance of the model. In SharpLearning, there are several different metrics available in SharpLearning.Metrics. A standard metric for evaluating a regression model is the mean square error. Since we are creating a regression model, this is the metric we are going to use.
For this problem we are going to use a RegressionRandomForestLearner with 100 trees. There are many hyperparameters to adjust on a RandomForest, but the default parameters usually provides good results and does not require further tuning:
// Create the learner and learn the model.
var learner = new RegressionRandomForestLearner(trees: 100);
var model = learner.Learn(trainSet.Observations, trainSet.Targets);
// predict the training and test set.
var trainPredictions = model.Predict(trainSet.Observations);
var testPredictions = model.Predict(testSet.Observations);
// create the metric
var metric = new MeanSquaredErrorRegressionMetric();
// measure the error on training and test set.
var trainError = metric.Error(trainSet.Targets, trainPredictions);
var testError = metric.Error(testSet.Targets, testPredictions);
We measure the error on both the training and test set:
Algorithm | Train Error | Test Error |
---|---|---|
RegressionDecisionTreeLearner(default) | 0.0518 | 0.4037 |
As can bee seen, the training error is a lot lower than the test error. However, Since the test error is our estimate of how well the model generalizes to new data, this is the measure to use when reporting the performance of the model.
A RandomForest is a large and complex model consisting of many decision trees. However, it is still possible to get insights from the model using variable importance. Variable importance describe the relative importance of the individual features used in the model. This provides information about which features, from the data set, are most important according to the model.
In SharpLearning, most models are able to provide variable importances:
// the variable importance requires the featureNameToIndex
// from the data set. This mapping describes the relation
// from column name to index in the feature matrix.
var featureNameToIndex = parser.EnumerateRows(c => c != targetName)
.First().ColumnNameToIndex;
// Get the variable importance from the model.
var importances = model.GetVariableImportance(featureNameToIndex);
Below, the variable importances from the random forest model can be seen:
FeatureName | Importance |
---|---|
alcohol | 100.00 |
density | 53.52 |
chlorides | 31.37 |
volatile acidity | 25.55 |
free sulfur dioxide | 18.13 |
total sulfur dioxide | 13.76 |
citric acid | 11.16 |
residual sugar | 5.93 |
pH | 4.98 |
fixed acidity | 3.58 |
sulphates | 2.49 |
According to the RegressionForestModel we just learned, "alcohol", "density" and "chlorides" are the most important features when predicting the quality of a white wine.
In SharpLearning, all models have a Save and a Load method. These methods can be used to save/load the models for use in another application.
Saving the RegressionForestModel we just learned can be done easily as shown below:
// default format is xml.
model.Save(() => StreamWriter(@"C:\randomforest.xml"));
Loading the model again can be done using the static Load method on the RegressionForestModel:
// default format is xml.
var loadedModel = RegressionForestModel.Load(() => new StreamReader(@"C:\randomforest.xml"));
The static Save/Load methods use xml to store the models.
In SharpLearning, it is also possible to save the models using one of the serializers provided in SharpLearning.InputOutput.Serialization. Using a serializer enables to choose between xml or binary serialization. It also provides the option of serializing the models to the IPredictorModel interface, which makes replacing model types easier. For instance, if we want to change from a RandomForestModel to a NeuralNetModel.
Saving/Loading the RegressionForestModel using the GenericXmlDataContractSerializer. Here we serialize/deserialize the model as an IPredictorModel, this enables us to use the same code for serializing/deserializing other model types:
//Save/load model as xml, in the file system use new StreamWriter(filePath);
var xmlSerializer = new GenericXmlDataContractSerializer();
xmlSerializer.Serialize<IPredictorModel<double>>(model,
() => new StreamWriter(@"C:\randomforest.xml"));
var loadedModelXml = xmlSerializer
.Deserialize<IPredictorModel<double>>(() => new StreamReader(@"C:\randomforest.xml"));
Saving/Loading the RegressionForestModel using the GenericBinarySerializer, Here we serialize/deserialize the model as an IPredictorModel, this enables us to use the same code for serializing/deserializing other model types:
//Save/load model as Base64 binary, in the file system use new StreamWriter(filePath);
var binarySerializer= new GenericBinarySerializer();
binarySerializer.Serialize<IPredictorModel<double>>(model,
() => new StreamWriter(@"C:\randomforest.bin"));
var loadedModelBinary = binarySerializer
.Deserialize<IPredictorModel<double>>(() => new StreamReader(@"C:\randomforest.bin"));
The GenericXmlDataContractSerializer does quicker serialization/deserialization, however the size of the serialized files will be larger compared to the binarySerializer.
In this introduction, we learned some of the basic concepts of SharpLearning:
- Importing/reading data from csv.
- Splitting data into training/test for evaluating models.
- Learning a machine learning model.
- Using variable importance to gain insights about the model and data.
- Saving/loading the model for use in another application.