notes are mainly from DLCV 2021 (NTU COMME5052)
For DNN
For weight in $y = W * x + b$
- $W_i$看起來很像所有類別$i$的trainin data平均的結果 ($W$ can be viewed as an examplar of the corresponding class) since $y_1 = W_1^T * x + b_1, ……, y_n = W_n^T * x + b_n$ for $n$ class classification 內積大,代表做內積的兩個vector相像
Why softmax?
==> interpret classifier scores as probabilities $$P(Y=k|X=x_i) = \frac{exp(s_k)}{\sum_{j}{exp(s_j)}}$$ with $s = f(x_i;W)$ as classifier output
- exp –> change value to positive
- normalize –> sum = 1
Loss function why -log
- $L_i = -log(1) = 0$ ==> predict很好的話,loss會是0
- $L_i = -log(0.00001) -> \infty$ ==> predict很差的話,loss會超大
Cross-Entropy loss
==> How similar the predicted vector and the truth vector look like?
Gradient Descent v.s. Stochastic Gradient Descent
- GD: using all training data for updaing gradient per iteration
- SGD: only using minibatch of training data per iteration (re-sampled the batch of data for every iter)
Why Sigmoid?
$$\sigma(t) = \frac{1}{1+e^{-t}}$$ ==> from linear to non-linear Non-linear疊加多層NN才有意義,否則若都是linear的話,$W_1 \times W_2 \times W_3$直接用成一個大$W$就好
Why multi-layer?
讓資料可以分的更好
Input-output function of single neuron
Why regularization?
$$E(w) = 1/2 \sum_{i}{w_i^2}$$ ==> regulariser discourages the network using extreme weights ==> to avoid **overfitting**
For CNN
- Property 1: local connectivity ==> only care about a small region in an image one time
- Property 2: weight sharing (左眼右眼都是眼睛,或許可以share weight)
Convolution 意義
==> weighted moving sum 很像filter跟後面那小塊圖片做內積,而當二者很像時,內積會接近1 也就是在input圖片上找和filter相似的pattern
Why padding?
為了讓圖片邊界也被掃到,也為了不讓output feature map shrink
Why stride?
一次多走幾步
Output size?
$$\frac{W+2p-k}{S}+1$$ [(input_size + 2*padding - kernel_size)/stride] + 1