Imagine there’s a sequence of operations you need to perform on a dataset, and this dataset is very large. There is absolutely no way the entire dataset could fit in memory. Instead you need to store the dataset on disk. In fact each intermediate calculation step needs to be persisted to disk, and read back to perform the next step in the sequence of operations. Of course all of this reading, processing and writing needs to be parallelised to be as fast as possible. Data should be stored in shards. Multiple shards could be read from or written to simultaneously, and disk IO and CPU processing should be done in separate threads. All in all it gets fairly complicated…
BatchNode
encapsulates and simplifies much of the complexity in doing all that. Specifically a BatchNode
represents (temporary) state that need to be persisted before it’s possible to continue with the next step. Most likely there will be a series of such nodes created before the calculations are done. For each node the key thing you need to specify is what to persist – how to (de)serialise or (un)marshal the data. Then there’s a set of configurations to tweak for best performance, but all of those have workable defaults.
Probably best to just look at he example code below. In the example BatchNode
is used to preprocess data and train a neural network. Other examples have shown how to do this without the use of BatchNode
, and the size of the dataset is certainly not big enough to warrant its use. The example still uses it only to demonstrate the basics on how to use it.
Example Code
Console Output
class FashionMNISTWithBatchNode ojAlgo 2022-05-12 Parsed IDX training data files: 1.333656766s Initial training data: 2.077556784s Scaled training data: 2.890675091s Duplicated training data: 78.529841326s Randomised training data: 124.459885072s There are 768000 T-shirt/top instances in the scaled/duplicated/randomised traing set. There are 768000 Trouser instances in the scaled/duplicated/randomised traing set. There are 768000 Pullover instances in the scaled/duplicated/randomised traing set. There are 768000 Dress instances in the scaled/duplicated/randomised traing set. There are 768000 Coat instances in the scaled/duplicated/randomised traing set. There are 768000 Sandal instances in the scaled/duplicated/randomised traing set. There are 768000 Shirt instances in the scaled/duplicated/randomised traing set. There are 768000 Sneaker instances in the scaled/duplicated/randomised traing set. There are 768000 Bag instances in the scaled/duplicated/randomised traing set. There are 768000 Ankle boot instances in the scaled/duplicated/randomised traing set. Sample set Size=10, Mean=768000.0, Var=0.0, StdDev=0.0, Min=768000.0, Max=768000.0 Training data verified: 147.036370166s Done 1000000 training iterations: 248.466925026s Done 2000000 training iterations: 350.38887831s Done 3000000 training iterations: 457.541165275s Done 4000000 training iterations: 569.745276533s Done 5000000 training iterations: 692.703884745s Done 6000000 training iterations: 811.113295485s Done 7000000 training iterations: 929.132814585s Training done: 1017.177961434s Parsed IDX test data files: 1020.346941825s Image 0: Ankle boot <=> Ankle boot + +++ ++ +++X++++++ ++++X++X+++ ++++++++++++ +++++++++X+++ +++++++++++++++ +++++++++++++++X ++++++++++X+++++++ ++++++++++++++++++X+X ++++++++++++++++++++++++X +++++++++++++++++XXXXXXXXX +XXXXX++++++XXXXXXXXXXXXXX ++XXXXXXX+ XXXX+++ Image 1: Pullover <=> Pullover +XXXXXXXXXXX +XXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXX+ XXXXXXXXXXXX XXXX XXX XX +XXX+ XXXXX XX ++ XXX+ XXXXX X XXXX XXXXX X XXX XXXXXX XXXXX + X+ +XXXX +XXXXX++XXXXXX XXXXX +XXXXXXXXXXXXXXXXXXX +XXXXXXXXXXXXXXXXXXX +XXX+XXXXXXXXXXX+XXX XXXX XXXXXXXXXXX XXX XXXX+XXXXXXXXXXX XXX XXX++XXXXXXXXXXX XXX+ XXX+XXXXXXXXXXXX XXX+ XXX+XXXXXXXXXXXX XXX+ XXX+XXXXXXXXXXXX+XXX+ XXX+XXXXXXXXXXXX++XX+ XXXXXXXXXXXXXXXXX+XX+ XXX XXXXXXXXXXXXX XX+ XXX XXXXXXXXXXXX+ XXX +XX XX+ XX XX+ XX XX Image 2: Trouser <=> Trouser X++++++X XXXXXXXXX XXXXXXXXX+ XXXXXXXXX+ XXXXXXXXX+ XXXXXXXXXX +XXXXXXXXXX +XXXXXXXXX+ XXXXX XXXX+ XXXX+ XXXX+ XXXX XXXX+ XXXX XXXX+ XXXX +XXX+ XXXX +XXX+ XXXX XXX+ XXX+ XXX+ XXX+ XXX+ XXX+ XXX+ +XXX XXX+ +XXX XXX+ XXX XXX+ XXX XXX+ XXX XXX+ +XX+ XXX XXX XXX XXX +XX+ XXX +XX+ ++ X+ Image 3: Trouser <=> Trouser ++++ ++ +XXXXXXXX++ +XXXXXXXX++ +XXXXXXXX+ +XXXXXXXX+ ++XXXXXXX+ ++++XXX+++ + ++X+++ +++++X++ ++++ X+++ ++++ X++ ++++ X++ ++++ X+ ++++ X+ + ++ X+ + ++ X+ ++++ +X + X +X XXX X++ XXX XX+ XXX XX+ XXX XX+ XXX XX+ XXX XX+ XXX XX+ +XX XX +XX XX X+ ++ Image 4: Shirt <=> Shirt +X++XXXXX+ +++++++++++++X+ +++++++X+++++++++ ++++++ +X+++++++++ +++++++ XX++++++++ ++++++++ +++++++++X ++++ ++++++ ++++X+X ++++ ++++++ ++++X+X ++++++++++++++++X+X+ ++++++++++++++++XX+++ +++++++++++++++++X++X +++X+++++++++++++X++X +++X +++++++++++ X++X +++X +++++++++++ X++X +++X +++++++++++ X++X ++++ +++++++++++ X++X ++++ ++++++++++X +++X ++++ ++++++++++X ++++ +++X ++++++++++X++++X+ +++X ++++++++++X+ X+++ ++++++++++++X ++++++++++++X ++++++++++++++ +++++++++++++++ +++++ ++++++++X +XX++X+++XXX+ ++++++ Image 5: Trouser <=> Trouser XXXX++XXXX XXXXXXXXXX XXXXXXXXXX +XXXXXXXXXX XXXXXXXXXXX+ XXXXXXXXXXX+ XXXXXXXXXXX+ XXXXXXXXXXX+ +X+XXX XXXXX+ +XXXXX +XXXX+ +XXXX+ XXXX+ XXXXX XXXX+ XXXXX XX+X+ +XXX+ XX+X+ +XXX XX+++ +XXX +X+++ +XXX X+++ XXX X+++ XXX+ X+X+ XXXX XXX+ +XXX XXX+ XXX XXX+ XXX+ XXX+ +XX+ XXX+ XXX +XX+ XXX +XX+ +XX+ +XX+ ++ ++ Image 6: Coat <=> Coat XXXXXXX+ ++XX+XXX + XXXX ++ XX + + + + + + ++ ++ X + X X X X X X+ X X+ +X XX +X XX +X XX +X +X XX +X X+ +X X+ +X X X X X+ + + Image 7: Shirt <=> Shirt +++++ XXXXX XXX+ + +++ +++ ++ + ++ +++++ +++ X + + + + +X+++X+++X +++ + + ++ +++ + X+ + + ++ +++++ + + + + ++ + ++ +X ++X +++++ X ++ X +++ +++ + X+X + + X+X+ + +++++X ++X ++X + + +++++X+++X +++ ++ +++X + X+++ +++X + + +++ ++++ X + + +++ ++ + + ++ + +X +++ ++++ + X+XX + ++ XXX+ ++ + + + + +X++X++XX++X++++ + + + +++ +++ ++X + + +++ + ++ ++++++ + ++ +X ++ ++++++ + ++ +X + X++++X + +X+ + + Image 8: Sandal <=> Sandal + + XXXXXX++X X++ ++XXXXX X +XXX++++++++ + +++++++ ++++++++ ++++++++++ Image 9: Sneaker <=> Sneaker +X+X +X X++++X +XX X++ +++XX+ +XXX X+ ++++++XXXXXX+X +X +++++++++++XX++X +++++++++++++++XXXXX++++XX +XXXX+++++++XXXXXXXX+XXXXXX ++XXXXX+XXXXXX+++X + + Done: 1022.108845606s or 17.035148605216666min =========================================================== Error rate: 0.1089
The Fashion MNIST dataset
https://github.com/zalandoresearch/fashion-mnist
The Fashion MNIST dataset is a drop in alternative to the original MNIST dataset. The idea is to enable testing a model developed for the MNIST dataset on something harder. The images are still grayscale 28×28 pixels, but instead of handwritten digits the images show “photographs” of clothing and accessories. Just as before there are 10 categories of “fashion”. Here are some example images:
At the Fashion-MNIST GitHub page they publish benchmark results comparing 129 classifiers on both the original MNIST and this Fashion-MNIST datasets. The best result on the original dataset is 97.8%, and 89.7% on the fashion dataset. The ojAlgo neural network got 98.1% and 89.1% respectively. (Results are not exactly the same each time you train the network, but those are the numbers published in the blog posts.)
The neural networks used in the ojAlgo examples are very simple – just 1 hidden layer with 200 nodes – and in this case it was only trained for 17min (on my laptop).