wasshuber

wasshuber

Programming Machine Learning: Help: weird results I don't understand

I encountered something that I can’t explain. Any help, tips, or explanations would be great.

I followed the one hidden layer example with 100 nodes and sigmoid activation function. Works great and I can get to 98.6% accuracy with a learning rate of 1.0, a batch size of 1000, and 100 epochs.

I then decided to exchange the sigmoid activation function with the ReLU. This is not done in the book at this point but it is easy enough to program the ReLU and its derivative. Here is the Python code I used:

def relu(z):
    return np.maximum(0.0,z)
def relu_gradient(z):
    return (z > 0)*1

Works fine as long as one reduces the learning rate which I did reduce to 0.1. It reaches about the same level of accuracy as with the sigmoid. I then did one insignificant change in the gradient of the ReLU. Instead of z > 0 I wrote z >= 0. So the code for the gradient was now:

def relu_gradient(z):
    return (z >= 0)*1

This I thought should not make any difference because how often would z be exactly zero? How often would the weighted sum of all inputs in the floating point format be exactly zero? Perhaps never. Even if it is zero occasionally it should hardly make any big difference. But to my surprise, it makes a profound difference. I can only get to about 95%. Why? Why is there almost 4% difference in accuracy for this insignificant change? There must be something weird happening.

I tried this several times to rule out that somehow the random initialization was unusual. I tried it with different learning rates and different batch sizes. None made any difference in the result. I checked for dead neurons. Found none. If somebody can tell me what is going on here I would really appreciate it.

Most Liked

wasshuber

wasshuber

Turns out it was a bug. Using the nomenclature of the book I was feeding h into the gradient function when I should have fed a into it. With the >= comparison this made all the gradients 1 and thus it acted like the linear activation function. (The linear activation function does produce about 94% accuracy.) Properly using the gradient function produces the expected results. It doesn’t matter if one uses > or >=.

I am happy I found this bug. But this is also part of why your book is so great. Programming it yourself forces one to understand the little details and allows one to change and modify the algorithms at the very core, which leads to much deeper understanding of how this all works.

Here is an insight that my experimentation produced. I tested a bunch of different activation functions including weird piecewise linear ones, periodic ones with sin and cos, combinations thereof etc. It surprised me that many work just as good as ReLU or sigmoid with a single hidden layer. (I intend to extend this experimentation to multiple hidden layers.) For example, it is kind of shocking at first that the absolute-value-function works just as good as ReLU. This kind of makes sense in the biological case. A neuron being a cell would not be completely identical to its neighbor neuron. Neurons in nature would certainly have different activation functions. Perhaps not as different as I experimented with but they would perhaps be noisy and distorted versions of sigmoid or ReLU. It doesn’t matter, it still works fine.

Further, this makes me wonder if perhaps that variation in activation functions in nature is a benefit. I am wondering if folks have tried to make nets where each activation function of each neuron is different. Perhaps that confers a training advantage to the network because not everything behaves in exactly the same way? I will try to explore this question. But first I need to extend the code to allow for multiple hidden layers.

This is one critique I have to make. In my opinion, it would have been better to go further with the code and extend it to multiple hidden layers than to switch to libraries. The point of the book is programming it yourself to allow full unmitigated experimentation. I would have added one or two chapters to extend the code further even if that would have meant leaving out libraries altogether. Numpy should be fast enough to explore multilayer networks on a single average computer.

Where Next?

Popular Pragmatic Bookshelf topics Top

jon
Some minor things in the paper edition that says “3 2020” on the title page verso, not mentioned in the book’s errata online: p. 186 But...
New
herminiotorres
Hi @Margaret , On page VII the book tells us the example and snippets will be all using Elixir version 1.11 But on page 3 almost the en...
New
gilesdotcodes
In case this helps anyone, I’ve had issues setting up the rails source code. Here were the solutions: In Gemfile, change gem 'rails' t...
New
jskubick
I think I might have found a problem involving SwitchCompat, thumbTint, and trackTint. As entered, the SwitchCompat changes color to hol...
New
adamwoolhether
I’m not quite sure what’s going on here, but I’m unable to have to containers successfully complete the Readiness/Liveness checks. I’m im...
New
hgkjshegfskef
The test is as follows: Scenario: Intersecting a scaled sphere with a ray Given r ← ray(point(0, 0, -5), vector(0, 0, 1)) And s ← sphere...
New
brunogirin
When I run the coverage example to report on missing lines, I get: pytest --cov=cards --report=term-missing ch7 ERROR: usage: pytest [op...
New
hazardco
On page 78 the following code appears: <%= link_to ‘Destroy’, product, class: ‘hover:underline’, method: :delete, data: { confirm...
New
AufHe
I’m a newbie to Rails 7 and have hit an issue with the bin/Dev script mentioned on pages 112-113. Iteration A1 - Seeing the list of prod...
New
New

Other popular topics Top

AstonJ
poll poll Be sure to check out @Dusty’s article posted here: An Introduction to Alternative Keyboard Layouts It’s one of the best write-...
New
AstonJ
I’ve been hearing quite a lot of comments relating to the sound of a keyboard, with one of the most desirable of these called ‘thock’, he...
New
AstonJ
This looks like a stunning keycap set :orange_heart: A LEGENDARY KEYBOARD LIVES ON When you bought an Apple Macintosh computer in the e...
New
AstonJ
If you are experiencing Rails console using 100% CPU on your dev machine, then updating your development and test gems might fix the issu...
New
PragmaticBookshelf
Build highly interactive applications without ever leaving Elixir, the way the experts do. Let LiveView take care of performance, scalabi...
New
foxtrottwist
A few weeks ago I started using Warp a terminal written in rust. Though in it’s current state of development there are a few caveats (tab...
New
husaindevelop
Inside our android webview app, we are trying to paste the copied content from another app eg (notes) using navigator.clipboard.readtext ...
New
hilfordjames
There appears to have been an update that has changed the terminology for what has previously been known as the Taskbar Overflow - this h...
New
First poster: bot
zig/http.zig at 7cf2cbb33ef34c1d211135f56d30fe23b6cacd42 · ziglang/zig. General-purpose programming language and toolchain for maintaini...
New
AnfaengerAlex
Hello, I’m a beginner in Android development and I’m facing an issue with my project setup. In my build.gradle.kts file, I have the foll...
New

Sub Categories: