wasshuber

wasshuber

Programming Machine Learning: a tip and a gotcha

Tip: If you are on a slow or old machine like me, or if you want to run many different examples to explore the design space you can speed up the calculations by removing a border from the MNIST image data. Every image has a 1-pixel white border. Removing this border reduces the number of input variables by 108 or more than 13%. In fact, you can drop even a 3-pixel border without any impact that I can notice. Dropping more is also possible, but then the expected max accuracy will also start to drop. But it is quite remarkable that even using only the innermost 8x8 image fragment one can easily get above 80% accuracy.

Gotcha: I have run the one hidden layer with 100 nodes scenario with the original test set of 10,000 examples. I did not split it into the 5,000 for validation and 5,000 for testing. I was surprised that the maximum accuracy I could achieve was only 97.8%, not the 98.6% stated in the book. However, this is purely an effect of the training set. When I did the splitting into validation and testing set with 5,000 for testing I got the 98.6% accuracy with the same network weights. This was surprising to me, that there is that big a change in accuracy due to the size of the test set.

Most Liked

wasshuber

wasshuber

Another tip that seems to be helping speed up training: I do a batch-size ramp. I start with batches of about 2-3 times the class size (for MNIST class size is 10). For example, I start with batch size of 20. I double the batch size with each epoch until I reach the final batch size of my choice and then continue with this batch size until the end.

The advantage here is that at the beginning when the weights are far away from their optimum, it is not necessary to have a particularly good estimator for the gradient, thus small batch sizes are fine and faster. But as we are approaching the optimum larger batch sizes are helpful to get an accurate gradient.

This reduces the importance of setting a proper batch size. One can take a larger batch size without negatively impacting the final accuracy of the model. Large batch size can sometimes mean that one gets stuck in a local minimum and the final accuracy of the model suffers. Ramping the batch size combines the advantages of small and large batch sizes.

wasshuber

wasshuber

If you like numerical issues then I will describe a problem I chassed for 3 days. During implementing dropout regularization I encountered an issue with the implementation of softmax that cost me three days delay. In your book the implementation of softmax is fine but basic. Meaning it does not protect against over- or underflow issues with the exponentials. What some do, for example, is to subtract the maximum value first before the exponential is applied. Mathematically this is equivalent because it is simply a multiplication of a constant factor of the numerator and denominator in the softmax formula. Nothing changes. Online I even found Python code for it that was something like

e = np.exp(x - np.max(x))

The problem with this code is subtle but numerically it is stupid. What happens is the following. np.max(x) returns the maximum from the entire matrix, meaning the maximum in the entire mini-batch. But we only need the maximum for each input (image) and not across several inputs. Numerically this causes problems because in some cases it can push the argument of the exponential so far to negative values that they all underflow and all exponentials return zero. The solution for this is to implement it such that the maximum subtracted is only the row maximum not the maximum across the entire mini-batch. Something like

e = np.exp(x - np.max(x,axis=1).reshape(-1,1))

This numerical issue manifested itself in the following way. Initially, the network was training perfectly fine. It reached about the accuracy it should reach. Then the accuracy started to drop, first slowly but then very quickly, and over the course of a few epochs the entire network blew up with all weights increasing until everything was saturated. Nothing could stop it. I tried clipping the gradients and limiting the weights norms, etc. The issue was the above-mentioned bad implementation of the softmax function.

wasshuber

wasshuber

I discovered this myself by experimenting with all kinds of activation functions. It was easy to change the code from sigmoid to other activation functions and I was curious about what changes if I used different functions. I tried some really weird ones, too.

This is why I choose your path of coding it myself because then it is much easier to change the things I wanted to change. With a library, one is in a straight-jacket and one can only change what the library allows you to change.

What made me analyze it more carefully was the fact that this shifted ReLU learned better in combination with dropout. So I tried to see why and noticed that the magnitude of the weights going from layer to layer stayed about the same when with ReLU they keep growing. I don’t have any good explanation for why this is better except that if there is a sort of additional bias the weights have to learn (their magnitude increases with deeper layers) then this will take longer in the learning process than if they do not have to learn this bias.

Then again, this is such a simple modification that I would be surprised if nobody has tried this before and noted the improvement. Searching online I do see shifted ReLUs being mentioned in lists of activation functions, but I have not found anything that mentions the improvement to learning they achieve and how this may be connected to the weight magnitude staying the same. We should also not forget that I only applied this to the MNIST data set. I don’t know if my observations hold in general.

Popular Pragmatic topics Top

kuroneko
Whilst the author has been careful to provide exact results for the tests elsewhere in the book (such as surds with the transformation te...
New
raul
Page 28: It implements io.ReaderAt on the store type. Sorry if it’s a dumb question but was the io.ReaderAt supposed to be io.ReadAt? ...
New
conradwt
First, the code resources: Page 237: rumbl_umbrella/apps/rumbl/mix.exs Note: That this file is missing. Page 238: rumbl_umbrella/app...
New
Charles
In general, the book isn’t yet updated for Phoenix version 1.6. On page 18 of the book, the authors indicate that an auto generated of ro...
New
brunogirin
When trying to run tox in parallel as explained on page 151, I got the following error: tox: error: argument -p/–parallel: expected one...
New
New
akraut
The markup used to display the uploaded image results in a Phoenix.LiveView.HTMLTokenizer.ParseError error. lib/pento_web/live/product_l...
New
s2k
Hi all, currently I wonder how the Tailwind colours work (or don’t work). For example, in app/views/layouts/application.html.erb I have...
New
ggerico
I got this error when executing the plot files on macOS Ventura 13.0.1 with Python 3.10.8 and matplotlib 3.6.1: programming_ML/code/03_...
New
roadbike
From page 13: On Python 3.7, you can install the libraries with pip by running these commands inside a Python venv using Visual Studio ...
New

Other popular topics Top

malloryerik
Any thoughts on Svelte? Svelte is a radical new approach to building user interfaces. Whereas traditional frameworks like React and Vue...
New
AstonJ
poll poll Be sure to check out @Dusty’s article posted here: An Introduction to Alternative Keyboard Layouts It’s one of the best write-...
New
Exadra37
On modern versions of macOS, you simply can’t power on your computer, launch a text editor or eBook reader, and write or read, without a ...
New
gagan7995
API 4 Path: /user/following/ Method: GET Description: Returns the list of all names of people whom the user follows Response [ { ...
New
AstonJ
Biggest jackpot ever apparently! :upside_down_face: I don’t (usually) gamble/play the lottery, but working on a program to predict the...
New
PragmaticBookshelf
Author Spotlight Jamis Buck @jamis This month, we have the pleasure of spotlighting author Jamis Buck, who has written Mazes for Prog...
New
PragmaticBookshelf
Author Spotlight Rebecca Skinner @RebeccaSkinner Welcome to our latest author spotlight, where we sit down with Rebecca Skinner, auth...
New
PragmaticBookshelf
Author Spotlight: VM Brasseur @vmbrasseur We have a treat for you today! We turn the spotlight onto Open Source as we sit down with V...
New
PragmaticBookshelf
Author Spotlight: Karl Stolley @karlstolley Logic! Rhetoric! Prag! Wow, what a combination. In this spotlight, we sit down with Karl ...
New
New

Latest in PragProg

View all threads ❯