Building Blocks
Gradient Descent
finding the optimum in parameter space
Strategies:
- Batch: $\theta = \theta - \eta \cdot \nabla_ \theta J(\theta)$
- Stochastic: $\theta = \theta - \eta \cdot \nabla_\theta J(\theta;x_i;y_i)$, faster but biased
- Mini-batch: $\theta = \theta - \eta \cdot \nabla_\theta J(\theta;x_{i:i+m};y_{i:i+m})$, intermediate
Challenges:
- $\eta$: apply learning rate schedules
- Saddle points as local minimum: spiking NN for quantum tunneling?
Improvements for SDG:
- add momentum: $v_t = \gamma v_t - 1 + \eta \cdot \nabla_\theta J(\theta); \theta = \theta -v_t$, reduce damping
- NAG, Nesterov accelerated gradient: add correction based on momentum, $J(\theta) -> J(\theta - \gamma v_t - 1)$, avoid momentum crash
- Adagrad, adaptive gradient: $\eta_i$ for each $\theta_i$, $\theta_{t+1}=\theta_t - \frac{\eta}{\sqrt{G_t+\epsilon}}\odot g_t$
- Adadelta: reduce the calculation of $G_t$ with a window-based average, $E_t$
- RMSprop: $E_t = \gamma E_{t-1} + (1-\gamma)g_t^2$
- Adam, adaptive moment estimation: bias-corrected from weighted average mean and variance
Asynchronous SGD for parallelization:
- Hogwild
- Downpour SGD
- Delay-tolerant Algorithms for SGD
- TensorFlow
- EASGD, elastic Averaging SGD: local variable fluctuate from center variable
Extra to explore:
- Shuffling and Curriculum Learning
- Batch normalization: keep N(0,1)
- Early stopping: avoid overfitting
- Gradient noise: shuffle away from local
Overview of Gradient Descent refs
Activation Function
back-propagatable, non-linear projection
keep nonlinearity & non-vanishing gradient
- Sigmoid: $f(x) = \frac{1}{1 + e^{-x}}$, work with cross-entropy cost function $y\ln{a}+(1-y)\ln{(1-a)}$ to avoid gradient vanishing
- tanh, $tanh(x)= 2 \sigmoid(2x)-1$: better but not the final solution
- ReLU, $y=x\ge0? x:0$
- constant gradient on one side
- output shift & hard to converge
- PReLU, $f(y_i)= y_i>0? y_i: a_iy_i$: regulate the left side
- Maxout, $\max{w_ix+b_i}$
- ELU, $f(x)= x>0?x:\alpha (\exp(x)-1)$
- Noisy Activation Functions, Gulcehre, C., et al., Noisy Activation Functions, in ICML 2016. 2016
- CReLU, pair-grouping phenomenon
- MPELU
Practical tips:
- ReLU
- ELU
- PReLU/MPELU with regularizer/penalty
Regularization
minimize the loss function with reasonable complexity
Supervised learning: $w^* = \arg\min_w \sum_i L(y_i, f(x_i;w)) + \lambda \Omega(w)$, solve ill-conditioned matrix
L0/L1, minimize the absolute sum: Lasso regularization, make $w$ sparse, $||w||_1 \le C$, linear bound
- Feature Selection: less feature needed
- Interpretability: less complex
less features
L2, minimize the square sum: Ridge Regression/Weight Decay, $||w||_2 \le C$, quadratic bound
won't reduce features
but reduce certain feature dependence
Tuning Methods
the evil approach
Neural Networks
- Feedforward Neural Network: word2vec
- Huffman tree
- CBOW: words -> word
- skip-gram: word -> words
- Denoising Autoencoders: obtain good representation
- perform noise mapping: $x \rightarrow \tilde{ x}$
- keep loss as $\mathcal{L}(x,\tilde{x'})$
- Restricted Boltzmann Machine: unsupervised learning
- generative approach: obtain $P(X,Y)$ for P(Y|X); in contrast of discriminative
approach, which only cares about $P(Y|X)$
- methods: Markov process & Gibbs sampling
- metrics: KL divergence, Shannon entropy, $\sum_i P(i)\log\frac{P(i)}{Q(i)}$
- probability distribution: $P = \frac{1}{Z}e^{-E(v,h)}$
- energy function: $E(v,h) =-v^T Wh -a^Tv -b^Th$, $v$ visible unit, $h$ hidden unit
- generative approach: obtain $P(X,Y)$ for P(Y|X); in contrast of discriminative
approach, which only cares about $P(Y|X)$
- Generative Adversarial Network: Turing learning
- discriminator: convolutional
- generator: deconvolutional
- applications
- Residual Neural Network
- plain layer + shortcuts
- residual function: $\mathcal{F}(x):=h(x) - x$
- shortcut: $y=w_s x + \mathcal{F}(x, {w_i})$
- Convolutional Neural Network: computer vision
- convolutional layer with kernels
- Recurrent Neural Network
- LSTM: a memory cell $c_t$, an input gate $i_t$, an 'output' gate $o_t$ and a forget gate $f_t$. $x_t \rightarrow h_t$
- Attention: potential of
memory
- applications
- Semantic MEDLINE: MEDLINE intelligence, NLP on medical papers
- ORiGAMI: term weight and so on
- Acoustic modeling: ASR
- Image description
Spiking Neural Network (3rd generation)
Spiking neuron: accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset.
Integrator: memory or nonlinear response
- BP on SNN
- SNN on very low bit rate speech coding: replace HMM
- DL in SNN: low energy cost, but no BP, spike time & spike rate
Refs
To Explore
Extras
- ethics principles: bias imposed by training data
Comments !