### 0001
We used a lot of activation functions in our daily projects but do we really know why do we need an activation function in the first place?
Reason 1:
Well, if you chain several linear transformations, all you get is a linear transformation.
For example, if f(x) = 2x + 3 and g(x) = 5x - 1, then chaining these two linear functions gives you another linear function: f(g(x)) = 2(5x-1)+3 = 10x + 1. So if you don't have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer and you can't solve very complex problem with that.
Reason 2:
If you want to guarantee that the output will always be positive, then you can use the ReLU activation function in the output layer. Alternatively, you can use the "softplus" activation function, which is a smooth variant of ReLU: softplus(z) = log(1+exp(z)). It is close to 0 when z is negative and close to z when z is positive.
Finally, if you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the hyperbolic tangents and then scale the labels to the appropriate range: 0 to 1 for the logistic function and -1 to 1 for the hyperbolic tangents.
Source: Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow, Chapter-10