notes are mainly from DLCV 2021 (NTU COMME5052)

For DNN

For weight in $y = W * x + b$

  • $W_i$看起來很像所有類別$i$的trainin data平均的結果 ($W$ can be viewed as an examplar of the corresponding class) since $y_1 = W_1^T * x + b_1, ……, y_n = W_n^T * x + b_n$ for $n$ class classification 內積大,代表做內積的兩個vector相像

Why softmax?

==> interpret classifier scores as probabilities $$P(Y=k|X=x_i) = \frac{exp(s_k)}{\sum_{j}{exp(s_j)}}$$ with $s = f(x_i;W)$ as classifier output

  • exp –> change value to positive
  • normalize –> sum = 1

Loss function why -log

  • $L_i = -log(1) = 0$ ==> predict很好的話,loss會是0
  • $L_i = -log(0.00001) -> \infty$ ==> predict很差的話,loss會超大

Cross-Entropy loss

==> How similar the predicted vector and the truth vector look like?

Gradient Descent v.s. Stochastic Gradient Descent

  • GD: using all training data for updaing gradient per iteration
  • SGD: only using minibatch of training data per iteration (re-sampled the batch of data for every iter)

Why Sigmoid?

$$\sigma(t) = \frac{1}{1+e^{-t}}$$ ==> from linear to non-linear Non-linear疊加多層NN才有意義,否則若都是linear的話,$W_1 \times W_2 \times W_3$直接用成一個大$W$就好

Why multi-layer?

讓資料可以分的更好

Input-output function of single neuron

example

Why regularization?

$$E(w) = 1/2 \sum_{i}{w_i^2}$$ ==> regulariser discourages the network using extreme weights ==> to avoid **overfitting**

For CNN

  • Property 1: local connectivity ==> only care about a small region in an image one time
  • Property 2: weight sharing (左眼右眼都是眼睛,或許可以share weight)

Convolution 意義

==> weighted moving sum 很像filter跟後面那小塊圖片做內積,而當二者很像時,內積會接近1 也就是在input圖片上找和filter相似的pattern

example example example

Why padding?

為了讓圖片邊界也被掃到,也為了不讓output feature map shrink

Why stride?

一次多走幾步

Output size?

$$\frac{W+2p-k}{S}+1$$ [(input_size + 2*padding - kernel_size)/stride] + 1