プログラミング練習: MIT OCW, Machine Learning 08日目カーネル

Rohit Singh, Tommi Jaakkola, and Ali Mohammad. 6.867 Machine Learning. Fall 2006. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.

- Lecture 6.
  - Active Learning (cont)
  - Non-linear Predictions, Kernels
- Lecture 7.
  - Linear Regression and Kernels
  - Kernels

Lecture 6.

Active Learning (cont)

$y = \theta^{*}\mathbf{x} + \theta^*_0 + \epsilon, \ \ \ \epsilon \sim N(0,\sigma^2)$ というlinear modelについて,最尤法で推測されるパラメータ $\hat{\theta}, \hat{\theta_0}$ のMSEは
$E\left[\left\|\left[\begin{array}{} \hat{\theta} \\ \hat{\theta_0} \end{array} \right] - \left[\begin{array}{} {\theta}^{*} \\ {\theta_0}^* \end{array} \right] \right\|^2 | \mathbf{X} \right] = \sigma^{*2} Tr[(\mathbf{X^TX})^{-1}]$
となることから, $\mathbf{X}$ をうまく設計することで少ないexampleからよりよい推測を行うことをactive learningといった. $\mathbf{X}$ の設計で最も単純な方法は, $\mathbf{x_1}, ..., \mathbf{x_k}$ があるときに, $Tr[\mathbf{X^TX}]$ が最少になるように $\mathbf{x}_{k+1}$ を選ぶという操作を繰り返すというのがある. すでに $\mathbf{X}$ があって, $\mathbf{A}=(\mathbf{X^TX})^{-1}$ とする. $[\mathbf{x}^T, 1]$ を $\mathbf{X}$ の行に新たに加えることを考える.
$\left[\begin{array}{} \mathbf{X} \\ \mathbf{x}^T 1 \end{array} \right]^T\left[\begin{array}{} \mathbf{X} \\ \mathbf{x}^T 1 \end{array} \right] = (\mathbf{X^TX}) + \left[\begin{array}{} \mathbf{x} \\ 1 \end{array} \right]\left[\begin{array}{} \mathbf{x} \\ 1 \end{array} \right]^T = \mathbf{A}^{-1} + \mathbf{vv^T} \ \ \ (\mathbf{v}=[\mathbf{x}^T, 1]^T)$
$Tr[(\mathbf{A}^{-1} + \mathbf{vv^T})^{-1}]$
を最小化する $\mathbf{v}$ を考える.
$(\mathbf{A}^{-1} + \mathbf{vv^T})^{-1} = \mathbf{A} - \frac{1}{1 + \mathbf{v^TAv}} \mathbf{Avv^TA}$
であって, $Tr(A+B) = Tr(A)+Tr(B), Tr(AB)=Tr(BA)$ を考えれば
$Tr[(\mathbf{A}^{-1}+\mathbf{vv^T})^{-1}] = Tr[A] - \frac{\mathbf{v^T AAv}}{1 + \mathbf{v^TAv}}$
が成立する. ( $\mathbf{v^TAAv}$ は実数で,traceはその実数そのもの)
任意の $\mathbf{v}$ に $\frac{\mathbf{v^TAAv}}{1+\mathbf{v^TAv}}>$ だから,どのような $\mathbf{x}$ を加えたとしてもMSEは減少するが,減少量が最大であるような $\mathbf{x}$ を求めたい.
$\frac{\mathbf{v^TAAv}}{1+\mathbf{v^TAv}}$
の大きさは $A$ の最大の固有値が上限である. 言い換えると,新しいexampleによってパラメータ空間からせいぜい1つだけ自由度を減じることが出来る. $\mathbf{x}$ に制限がなければ, $\mathbf{A}$ の最大の固有値に対応する固有ベクトルに平行な長さ無限のベクトルを $\mathbf{v}$ とするのだが, $\|\mathbf{v}\|\leq c$ という制限が有る場合には,最大固有値に対応する固有ベクトルと平行でながさ $c$ のベクトルを $\mathbf{v}$ とする. ほかにも $\mathbf{x}$ に制限が有る場合には, $\mathbf{v}$ もその制限を考慮することになる.

これまでMSEを推定量の良さの基準としてきたが,今度はvarianceを考える.
$\begin{aligned} var[y|\mathbf{x, X}] &= E\left[(\hat{\theta}^T\mathbf{x}+\hat{\theta}_0 - \theta^{*T}\mathbf{x} -\theta^{*}_0)^2 | \mathbf{x, X} \right]\\ &=E\left[\left[\begin{array}{} \mathbf{x} \\ 1 \end{array} \right]^T \left(\left[ \begin{array}{} \hat{\theta} \\ \hat{\theta_0} \end{array} \right] - \left[ \begin{array}{} {\theta}^* \\ {\theta_0}^* \end{array} \right]\right) \left(\left[ \begin{array}{} \hat{\theta} \\ \hat{\theta_0} \end{array} \right] - \left[ \begin{array}{} {\theta}^* \\ {\theta_0}^* \end{array} \right]\right)^T\left[\begin{array}{} \mathbf{x} \\ 1 \end{array} \right]| \mathbf{x, X}\right]\\ &= \left[\begin{array}{} \mathbf{x} \\ 1 \end{array} \right]^T \sigma^{*2}(\mathbf{X^TX})^{-1}\left[\begin{array}{} \mathbf{x} \\ 1 \end{array} \right] \\ &= \sigma^{*2}\cdot \mathbf{v^TAv}\end{aligned}$

よって, $\mathbf{v^TAv}$ が最大になるような $\mathbf{v}$ がよいが,これはMSEを小さくするような $\mathbf{v}$ と同じである.
(MSEを小さくしつつvarianceを小さくすることが対立することを言いたいのかと思ったら,varianceを大きくしたいらしい・・・)

Non-linear Predictions, Kernels

$\mathbf{x}$ の非線形な写像に対する像 $\phi(\mathbf{x})$ に対してこれまで議論してきた方法が使える.例えば $y = \theta x + \theta_0 + \epsilon, \epsilon \sim N(0, \sigma^2)$ というlinear modelが有るとき, $x$ を $x^2$ を含む高次元のベクトルに写像してquadratic(二次) modelが得られ, $x^3$ を含む高次元のベクトルに写像するとthird order modelが得られる.
$\phi(x) = [1, \sqrt{2}x, x^2]^T, \phi(x)=[1, \sqrt{3}x, \sqrt{3}x^2, x^3]^T$ のような感じである. $\sqrt{2}$ や $\sqrt{3}$ の意味は後で見る.
新しいpolynomial regression modelは
$y = \theta^T \phi(x) + \theta_0 + \epsilon, \ \ \epsilon \sim N(0, \sigma^2)$
となる. 高次元空間に写像してから線形回帰するわけだが,このときregularizationを行わないとoverfittingが起きることが多い.(figure 2)
!

$\mathbf{x}$ が多次元の場合も,
$\mathbf{x}=[x_1, x_2]^T\mapsto^{\phi} [1, x_1, x_2, \sqrt{2}x_1x_2, x_1^2,x_2^2]^T = \phi(\mathbf{x})$
というふうにしてより高次元な空間に写像できる.
高次元な空間への変換は計算コストが膨大になることが有るが, $\phi$ を直接計算せずとも,例えば
$\begin{aligned} \phi(x) &= [1, \sqrt{3}x, \sqrt{3}x^2, x^3]^T \\ \phi(x') &= [1, \sqrt{3}x', \sqrt{3}x^{'2}, x^{'3}]^T \\ \phi(x)^T\phi(x') &= 1 + 3xx' + 3(xx')^2 + (xx')^3 = (1+xx')^3 \end{aligned}$
のように, $\phi(x)^T\phi(x')=k(x,x')$ と, $\phi$ を暗黙に表現する計算が簡単な $K$ が存在することが有る(存在するように $\phi$ を定めたのである). $\phi$ ではなく計算が簡単な $K$ を使うように問題を書き換えることを考える.

Lecture 7.

Linear Regression and Kernels

$\theta_0$ を外したモデル $y = \theta^T \phi(\mathbf{x}) + \epsilon$ はの推測は
$J(\theta) = \sum_{t=1}^n (y_t-\theta^T \phi(\mathbf{x}_t))^2 + \lambda\|\theta\|^2$
の最適化問題である. 前節で述べたとおり, $\phi$ ではなく $k$ でこの最適化問題を表現する.
regularizationによって $\theta$ は $0$ に圧縮され, $\theta$ のtraining feature vectorと関係ない次元は $0$ になる. よってこの問題の解は $\{\phi(\mathbf{x}_t)\}$ の張る空間の元である.
proof.

局地の条件を考えると
$\frac{dJ}{d\theta} = -2 \sum_{t=1}^n \underline{(y_t-\theta^T\phi(\mathbf{x}_t))}_{\alpha_t}\phi(\mathbf{x}_t) + 2\lambda \theta=0$
$\theta = \frac{1}{\lambda} \sum_{t=1}^n \alpha_t \phi(\mathbf{x}_t)$
は $\frac{dJ}{d\theta}=0$ を満たして,最適解である.

$\alpha_t = y_t - \theta^T \phi(\mathbf{x}_t)=y_t - \frac{1}{\lambda}\sum_{t'=1}^n \alpha_{t'} \phi(\mathbf{x_{t'}})^T \phi(\mathbf{x}_t)$
が成立するから, $\alpha_t$ は $y_t$ と $\phi(\mathbf{x}), \phi(\mathbf{x'})$ だけで決まる.
Gram行列
$\mathbf{K} = \left[\begin{array}{} \phi(\mathbf{x}_1)^T\phi(\mathbf{x}_1) & \phi(\mathbf{x}_1)^T\phi(\mathbf{x}_2) & \cdots & \phi(\mathbf{x}_1)^T\phi(\mathbf{x}_n) \\ \cdots & \cdots & \cdots & \cdots \\ \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_1) & \cdots & \cdots &\phi(\mathbf{x}_n)^T\phi(\mathbf{x}_n) \end{array} \right]$
によってベクトルで書くと
$\begin{aligned} \mathbf{a} &= [\alpha_1, ..., \alpha_n]^T \\ \mathbf{y} &= [y_1, ..., y_n]^T \\ \mathbf{a} &= \mathbf{y} - \frac{1}{\lambda} \mathbf{Ka} \end{aligned}$
そして解は
$\hat{\mathbf{a}} = \lambda(\lambda \mathbf{I} + K)^{-1} \mathbf{y}$
$\hat{\alpha}_t$ が得られたら,
$y = \hat{\theta}^T \phi(\mathbf{x}) = \sum_{t=1}^n (\hat{\alpha_t}/\lambda)\phi(\mathbf{x}_{t'})^T\phi(\mathbf{x})=\sum_{t=1}^n\hat{\alpha}_tK(\mathbf{x_{t'}, x})$
によって,新しいexample $\mathbf{x}$ に対してresponseの推測 $y$ が計算できる.ここで $K(\mathbf{x_{t'}, x})$ はkernel functionという.

Kernels

以上で, regularized linear regressionをkernel formに変形できた. kernel function $K$ を変えることで,例えば任意の次数のpolynomial expansionが実現できるし,polynomial expansion以外の $\mathbf{x}$ を高次元に写した像を使ったlinear regressionも実現できる.
実現される高次元への写像の種類によって $K$ を分類することが有る.例えば
- Polynomial kernel

$K(\mathbf{x', x})=(1 + \mathbf{x^Tx'})^p, \ \ p = 1,2,...$
- Radial basis kernel
$K(\mathbf{x', x}) = \exp \left(-\frac{\beta}{2}\|\mathbf{x}-\mathbf{x}'\|^2 \right), \ \ \beta>0$

polynomial kernelは, $\mathbf{x}=[x_1,...,x_n]^T$ を $(x_1+\cdots +x_n)^p$ を二項展開したときの各項へと写す写像 $\phi$ を考えたときのkernel functionで, radial basis kernelは無限次元空間への写像のkernel functionである. radial basis functionは $\mathbf{x}$ と $\mathbf{x'}$ の近さを表していると考えることが出来る.

プログラミング練習

2017年8月24日木曜日

MIT OCW, Machine Learning 08日目カーネル

Lecture 6.

Active Learning (cont)

Non-linear Predictions, Kernels

Lecture 7.

Linear Regression and Kernels

Kernels

0 件のコメント:

コメントを投稿

2017年8月24日木曜日

MIT OCW, Machine Learning 08日目 カーネル

Lecture 6.

Active Learning (cont)

Non-linear Predictions, Kernels

Lecture 7.

Linear Regression and Kernels

Kernels

0 件のコメント:

コメントを投稿

MIT OCW, Machine Learning 08日目カーネル