wasshuber

Programming Machine Learning: a tip and a gotcha

Tip: If you are on a slow or old machine like me, or if you want to run many different examples to explore the design space you can speed up the calculations by removing a border from the MNIST image data. Every image has a 1-pixel white border. Removing this border reduces the number of input variables by 108 or more than 13%. In fact, you can drop even a 3-pixel border without any impact that I can notice. Dropping more is also possible, but then the expected max accuracy will also start to drop. But it is quite remarkable that even using only the innermost 8x8 image fragment one can easily get above 80% accuracy.

Gotcha: I have run the one hidden layer with 100 nodes scenario with the original test set of 10,000 examples. I did not split it into the 5,000 for validation and 5,000 for testing. I was surprised that the maximum accuracy I could achieve was only 97.8%, not the 98.6% stated in the book. However, this is purely an effect of the training set. When I did the splitting into validation and testing set with 5,000 for testing I got the 98.6% accuracy with the same network weights. This was surprising to me, that there is that big a change in accuracy due to the size of the test set.

23 comments

/book-programming-machine-learning

9 952 23

2023-03-06 14:59:39 UTC

Most Liked

wasshuber

Another tip that seems to be helping speed up training: I do a batch-size ramp. I start with batches of about 2-3 times the class size (for MNIST class size is 10). For example, I start with batch size of 20. I double the batch size with each epoch until I reach the final batch size of my choice and then continue with this batch size until the end.

The advantage here is that at the beginning when the weights are far away from their optimum, it is not necessary to have a particularly good estimator for the gradient, thus small batch sizes are fine and faster. But as we are approaching the optimum larger batch sizes are helpful to get an accurate gradient.

This reduces the importance of setting a proper batch size. One can take a larger batch size without negatively impacting the final accuracy of the model. Large batch size can sometimes mean that one gets stuck in a local minimum and the final accuracy of the model suffers. Ramping the batch size combines the advantages of small and large batch sizes.

Post #12

wasshuber

If you like numerical issues then I will describe a problem I chassed for 3 days. During implementing dropout regularization I encountered an issue with the implementation of softmax that cost me three days delay. In your book the implementation of softmax is fine but basic. Meaning it does not protect against over- or underflow issues with the exponentials. What some do, for example, is to subtract the maximum value first before the exponential is applied. Mathematically this is equivalent because it is simply a multiplication of a constant factor of the numerator and denominator in the softmax formula. Nothing changes. Online I even found Python code for it that was something like

e = np.exp(x - np.max(x))

The problem with this code is subtle but numerically it is stupid. What happens is the following. np.max(x) returns the maximum from the entire matrix, meaning the maximum in the entire mini-batch. But we only need the maximum for each input (image) and not across several inputs. Numerically this causes problems because in some cases it can push the argument of the exponential so far to negative values that they all underflow and all exponentials return zero. The solution for this is to implement it such that the maximum subtracted is only the row maximum not the maximum across the entire mini-batch. Something like

e = np.exp(x - np.max(x,axis=1).reshape(-1,1))

This numerical issue manifested itself in the following way. Initially, the network was training perfectly fine. It reached about the accuracy it should reach. Then the accuracy started to drop, first slowly but then very quickly, and over the course of a few epochs the entire network blew up with all weights increasing until everything was saturated. Nothing could stop it. I tried clipping the gradients and limiting the weights norms, etc. The issue was the above-mentioned bad implementation of the softmax function.

Post #7

wasshuber

I discovered this myself by experimenting with all kinds of activation functions. It was easy to change the code from sigmoid to other activation functions and I was curious about what changes if I used different functions. I tried some really weird ones, too.

This is why I choose your path of coding it myself because then it is much easier to change the things I wanted to change. With a library, one is in a straight-jacket and one can only change what the library allows you to change.

What made me analyze it more carefully was the fact that this shifted ReLU learned better in combination with dropout. So I tried to see why and noticed that the magnitude of the weights going from layer to layer stayed about the same when with ReLU they keep growing. I don’t have any good explanation for why this is better except that if there is a sort of additional bias the weights have to learn (their magnitude increases with deeper layers) then this will take longer in the learning process than if they do not have to learn this bias.

Then again, this is such a simple modification that I would be surprised if nobody has tried this before and noted the improvement. Searching online I do see shifted ReLUs being mentioned in lists of activation functions, but I have not found anything that mentions the improvement to learning they achieve and how this may be connected to the weight magnitude staying the same. We should also not forget that I only applied this to the MNIST data set. I don’t know if my observations hold in general.

Post #11