Neural Networks

Building Blocks

Gradient Descent

finding the optimum in parameter space

Strategies:

  • Batch: $\theta = \theta - \eta \cdot \nabla_ \theta J(\theta)$
  • Stochastic: $\theta = \theta - \eta \cdot \nabla_\theta J(\theta;x_i;y_i)$, faster but biased
  • Mini-batch: $\theta = \theta - \eta \cdot \nabla_\theta J(\theta;x_{i:i+m};y_{i:i+m})$, intermediate

Challenges:

  • $\eta$: apply learning rate schedules
  • Saddle points as local minimum: spiking NN for quantum tunneling?

Improvements for SDG:

  • add momentum: $v_t = \gamma v_t - 1 + \eta \cdot \nabla_\theta J(\theta); \theta = \theta -v_t$, reduce damping
  • NAG, Nesterov accelerated gradient: add correction based on momentum, $J(\theta) -> J(\theta - \gamma v_t - 1)$, avoid momentum crash
  • Adagrad, adaptive gradient: $\eta_i$ for each $\theta_i$, $\theta_{t+1}=\theta_t - \frac{\eta}{\sqrt{G_t+\epsilon}}\odot g_t$
  • Adadelta: reduce the calculation of $G_t$ with a window-based average, $E_t$
  • RMSprop: $E_t = \gamma E_{t-1} + (1-\gamma)g_t^2$
  • Adam, adaptive moment estimation: bias-corrected from weighted average mean and variance

Asynchronous SGD for parallelization:

  • Hogwild
  • Downpour SGD
  • Delay-tolerant Algorithms for SGD
  • TensorFlow
  • EASGD, elastic Averaging SGD: local variable fluctuate from center variable

Extra to explore:

  • Shuffling and Curriculum Learning
  • Batch normalization: keep N(0,1)
  • Early stopping: avoid overfitting
  • Gradient noise: shuffle away from local

Overview of Gradient Descent refs

Activation Function

back-propagatable, non-linear projection

keep nonlinearity & non-vanishing gradient

Practical tips:

Regularization

minimize the loss function with reasonable complexity

Supervised learning: $w^* = \arg\min_w \sum_i L(y_i, f(x_i;w)) + \lambda \Omega(w)$, solve ill-conditioned matrix

L0/L1, minimize the absolute sum: Lasso regularization, make $w$ sparse, $||w||_1 \le C$, linear bound

  • Feature Selection: less feature needed
  • Interpretability: less complex

less features

L2, minimize the square sum: Ridge Regression/Weight Decay, $||w||_2 \le C$, quadratic bound

won't reduce features
but reduce certain feature dependence

Tuning Methods

the evil approach

Neural Networks

Spiking Neural Network (3rd generation)

Spiking neuron: accumulate their activation into a potential over time, and only send out a signal (a "spike") when this potential crosses a threshold and the neuron is reset.

Integrator: memory or nonlinear response

Refs

To Explore

Extras

Published: Thu 05 July 2018. By Dongming Jin in

Comments !