## Building Blocks

### Gradient Descent

`finding the optimum in parameter space`

Strategies:

- Batch: \(\theta = \theta - \eta \cdot \nabla_ \theta J(\theta)\)
- Stochastic: \(\theta = \theta - \eta \cdot \nabla_\theta J(\theta;x_i;y_i)\), faster but biased
- Mini-batch: \(\theta = \theta - \eta \cdot \nabla_\theta J(\theta;x_{i:i+m};y_{i:i+m})\), intermediate

Challenges:

- \(\eta\): apply learning rate schedules
- Saddle points as local minimum: spiking NN for quantum tunneling?

Improvements for SDG:

*add momentum:*\(v_t = \gamma v_t - 1 + \eta \cdot \nabla_\theta J(\theta); \theta = \theta -v_t\), reduce damping- NAG, Nesterov accelerated gradient: add correction based on momentum, \(J(\theta) -> J(\theta - \gamma v_t - 1)\), avoid momentum crash
- Adagrad, adaptive gradient: \(\eta_i\) for each \(\theta_i\), \(\theta_{t+1}=\theta_t - \frac{\eta}{\sqrt{G_t+\epsilon}}\odot g_t\)
- Adadelta: reduce the calculation of \(G_t\) with a window-based average, \(E_t\)
- RMSprop: \(E_t = \gamma E_{t-1} + (1-\gamma)g_t^2\)
- Adam, adaptive moment estimation: bias-corrected from weighted average mean and variance

Asynchronous SGD for parallelization:

- Hogwild
- Downpour SGD
- Delay-tolerant Algorithms for SGD
- TensorFlow
- EASGD, elastic Averaging SGD: local variable fluctuate from center variable

Extra to explore:

- Shuffling and Curriculum Learning
- Batch normalization: keep N(0,1)
- Early stopping: avoid overfitting
- Gradient noise: shuffle away from local

Overview of Gradient Descent refs

### Activation Function

back-propagatable, non-linear projection

`keep nonlinearity & non-vanishing gradient`

- Sigmoid: \(f(x) = \frac{1}{1 + e^{-x}}\), work with cross-entropy cost function \(y\ln{a}+(1-y)\ln{(1-a)}\) to avoid gradient vanishing
- tanh, \(tanh(x)= 2 \sigmoid(2x)-1\): better but not the final solution
- ReLU, \(y=x\ge0? x:0\)
- constant gradient on one side
- output shift & hard to converge

- PReLU, \(f(y_i)= y_i>0? y_i: a_iy_i\): regulate the left side
- Maxout, \(\max{w_ix+b_i}\)
- ELU, \(f(x)= x>0?x:\alpha (\exp(x)-1)\)
- Noisy Activation Functions, Gulcehre, C., et al., Noisy Activation Functions, in ICML 2016. 2016
- CReLU, pair-grouping phenomenon
- MPELU

Practical tips:

- ReLU
- ELU
- PReLU/MPELU with
**regularizer**/penalty

### Regularization

`minimize the loss function with reasonable complexity`

Supervised learning: \(w^* = \arg\min_w \sum_i L(y_i, f(x_i;w)) + \lambda \Omega(w)\), solve ill-conditioned matrix

L0/L1, minimize the absolute sum: Lasso regularization, make \(w\) sparse, \(||w||_1 \le C\), linear bound

- Feature Selection: less feature needed
- Interpretability: less complex

less features

L2, minimize the square sum: Ridge Regression/Weight Decay, \(||w||_2 \le C\), quadratic bound

won't reduce features

but reduce certain feature dependence

### Tuning Methods

the evil approach

## Neural Networks

- Feedforward Neural Network: word2vec
- Huffman tree
- CBOW: words -> word
- skip-gram: word -> words

- Denoising Autoencoders: obtain good representation
- perform noise mapping: \(x \rightarrow \tilde{ x}\)
- keep loss as \(\mathcal{L}(x,\tilde{x'})\)

- Restricted Boltzmann Machine: unsupervised learning
- generative approach: obtain \(P(X,Y)\) for P(Y|X); in contrast of discriminative approach, which only cares about \(P(Y|X)\)
- methods: Markov process & Gibbs sampling
- metrics: KL divergence, Shannon entropy, \(\sum_i P(i)\log\frac{P(i)}{Q(i)}\)

- probability distribution: \(P = \frac{1}{Z}e^{-E(v,h)}\)
- energy function: \(E(v,h) =-v^T Wh -a^Tv -b^Th\), \(v\) visible unit, \(h\) hidden unit

- generative approach: obtain \(P(X,Y)\) for P(Y|X); in contrast of discriminative approach, which only cares about \(P(Y|X)\)
- Generative Adversarial Network: Turing learning
- discriminator: convolutional
- generator: deconvolutional
- applications

- Residual Neural Network
- plain layer + shortcuts
- residual function: \(\mathcal{F}(x):=h(x) - x\)
- shortcut: \(y=w_s x + \mathcal{F}(x, \{w_i\})\)

- Convolutional Neural Network: computer vision
- convolutional layer with kernels

- Recurrent Neural Network
- LSTM: a memory cell \(c_t\), an input gate \(i_t\), an 'output' gate \(o_t\) and a forget gate \(f_t\). \(x_t \rightarrow h_t\)
- Attention: potential of
`memory`

- applications
- Semantic MEDLINE: MEDLINE intelligence, NLP on medical papers
- ORiGAMI: term weight and so on
- Acoustic modeling: ASR
- Image description

### Spiking Neural Network (3rd generation)

Spiking neuron: accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset.

Integrator: memory or nonlinear response

- BP on SNN
- SNN on very low bit rate speech coding: replace HMM
- DL in SNN: low energy cost, but no BP, spike time & spike rate

### Refs

### To Explore

### Extras

- ethics principles: bias imposed by training data

## Comments !