wasshuber

wasshuber

Programming Machine Learning: a tip and a gotcha

Tip: If you are on a slow or old machine like me, or if you want to run many different examples to explore the design space you can speed up the calculations by removing a border from the MNIST image data. Every image has a 1-pixel white border. Removing this border reduces the number of input variables by 108 or more than 13%. In fact, you can drop even a 3-pixel border without any impact that I can notice. Dropping more is also possible, but then the expected max accuracy will also start to drop. But it is quite remarkable that even using only the innermost 8x8 image fragment one can easily get above 80% accuracy.

Gotcha: I have run the one hidden layer with 100 nodes scenario with the original test set of 10,000 examples. I did not split it into the 5,000 for validation and 5,000 for testing. I was surprised that the maximum accuracy I could achieve was only 97.8%, not the 98.6% stated in the book. However, this is purely an effect of the training set. When I did the splitting into validation and testing set with 5,000 for testing I got the 98.6% accuracy with the same network weights. This was surprising to me, that there is that big a change in accuracy due to the size of the test set.

Most Liked

wasshuber

wasshuber

Another tip that seems to be helping speed up training: I do a batch-size ramp. I start with batches of about 2-3 times the class size (for MNIST class size is 10). For example, I start with batch size of 20. I double the batch size with each epoch until I reach the final batch size of my choice and then continue with this batch size until the end.

The advantage here is that at the beginning when the weights are far away from their optimum, it is not necessary to have a particularly good estimator for the gradient, thus small batch sizes are fine and faster. But as we are approaching the optimum larger batch sizes are helpful to get an accurate gradient.

This reduces the importance of setting a proper batch size. One can take a larger batch size without negatively impacting the final accuracy of the model. Large batch size can sometimes mean that one gets stuck in a local minimum and the final accuracy of the model suffers. Ramping the batch size combines the advantages of small and large batch sizes.

wasshuber

wasshuber

If you like numerical issues then I will describe a problem I chassed for 3 days. During implementing dropout regularization I encountered an issue with the implementation of softmax that cost me three days delay. In your book the implementation of softmax is fine but basic. Meaning it does not protect against over- or underflow issues with the exponentials. What some do, for example, is to subtract the maximum value first before the exponential is applied. Mathematically this is equivalent because it is simply a multiplication of a constant factor of the numerator and denominator in the softmax formula. Nothing changes. Online I even found Python code for it that was something like

e = np.exp(x - np.max(x))

The problem with this code is subtle but numerically it is stupid. What happens is the following. np.max(x) returns the maximum from the entire matrix, meaning the maximum in the entire mini-batch. But we only need the maximum for each input (image) and not across several inputs. Numerically this causes problems because in some cases it can push the argument of the exponential so far to negative values that they all underflow and all exponentials return zero. The solution for this is to implement it such that the maximum subtracted is only the row maximum not the maximum across the entire mini-batch. Something like

e = np.exp(x - np.max(x,axis=1).reshape(-1,1))

This numerical issue manifested itself in the following way. Initially, the network was training perfectly fine. It reached about the accuracy it should reach. Then the accuracy started to drop, first slowly but then very quickly, and over the course of a few epochs the entire network blew up with all weights increasing until everything was saturated. Nothing could stop it. I tried clipping the gradients and limiting the weights norms, etc. The issue was the above-mentioned bad implementation of the softmax function.

wasshuber

wasshuber

I discovered this myself by experimenting with all kinds of activation functions. It was easy to change the code from sigmoid to other activation functions and I was curious about what changes if I used different functions. I tried some really weird ones, too.

This is why I choose your path of coding it myself because then it is much easier to change the things I wanted to change. With a library, one is in a straight-jacket and one can only change what the library allows you to change.

What made me analyze it more carefully was the fact that this shifted ReLU learned better in combination with dropout. So I tried to see why and noticed that the magnitude of the weights going from layer to layer stayed about the same when with ReLU they keep growing. I don’t have any good explanation for why this is better except that if there is a sort of additional bias the weights have to learn (their magnitude increases with deeper layers) then this will take longer in the learning process than if they do not have to learn this bias.

Then again, this is such a simple modification that I would be surprised if nobody has tried this before and noted the improvement. Searching online I do see shifted ReLUs being mentioned in lists of activation functions, but I have not found anything that mentions the improvement to learning they achieve and how this may be connected to the weight magnitude staying the same. We should also not forget that I only applied this to the MNIST data set. I don’t know if my observations hold in general.

Where Next?

Popular Pragmatic Bookshelf topics Top

New
iPaul
page 37 ANTLRInputStream input = new ANTLRInputStream(is); as of ANTLR 4 .8 should be: CharStream stream = CharStreams.fromStream(i...
New
GilWright
Working through the steps (checking that the Info,plist matches exactly), run the demo game and what appears is grey but does not fill th...
New
yulkin
your book suggests to use Image.toByteData() to convert image to bytes, however I get the following error: "the getter ‘toByteData’ isn’t...
New
jdufour
Hello! On page xix of the preface, it says there is a community forum "… for help if your’re stuck on one of the exercises in this book… ...
New
AndyDavis3416
@noelrappin Running the webpack dev server, I receive the following warning: ERROR in tsconfig.json TS18003: No inputs were found in c...
New
fynn
This is as much a suggestion as a question, as a note for others. Locally the SGP30 wasn’t available, so I ordered a SGP40. On page 53, ...
New
Charles
In general, the book isn’t yet updated for Phoenix version 1.6. On page 18 of the book, the authors indicate that an auto generated of ro...
New
brunogirin
When running tox for the first time, I got the following error: ERROR: InterpreterNotFound: python3.10 I realised that I was running ...
New
AufHe
I’m a newbie to Rails 7 and have hit an issue with the bin/Dev script mentioned on pages 112-113. Iteration A1 - Seeing the list of prod...
New

Other popular topics Top

New
DevotionGeo
I know that these benchmarks might not be the exact picture of real-world scenario, but still I expect a Rust web framework performing a ...
New
AstonJ
I ended up cancelling my Moonlander order as I think it’s just going to be a bit too bulky for me. I think the Planck and the Preonic (o...
New
AstonJ
I have seen the keycaps I want - they are due for a group-buy this week but won’t be delivered until October next year!!! :rofl: The Ser...
New
PragmaticBookshelf
Learn different ways of writing concurrent code in Elixir and increase your application's performance, without sacrificing scalability or...
New
foxtrottwist
A few weeks ago I started using Warp a terminal written in rust. Though in it’s current state of development there are a few caveats (tab...
New
First poster: joeb
The File System Access API with Origin Private File System. WebKit supports new API that makes it possible for web apps to create, open,...
New
New
First poster: bot
zig/http.zig at 7cf2cbb33ef34c1d211135f56d30fe23b6cacd42 · ziglang/zig. General-purpose programming language and toolchain for maintaini...
New
RobertRichards
Hair Salon Games for Girls Fun Girls Hair Saloon game is mainly developed for kids. This game allows users to select virtual avatars to ...
New

Sub Categories: