Information about Murpy's Machine Learning:14. Kernel

14.1 Introduction • how do we represent a text document or protein sequence, which can be of variable length? • One approach to such problems is to define a generative model for the data, and use the inferred latent representation and/or the parameters of the model as features, and then to plug these features in to standard methods. • For example, in Chapter 28, we discuss deep learning, which is essentially an unsupervised way to learn good feature representations. • Another approach is to assume that we have some way of measuring the similarity between objects, that doesn’t require preprocessing them into feature vector format. • For example, when comparing strings, we can compute the edit distance between them. • In this chapter, we will discuss several kinds of kernel functions • 길이가 일정치 않은 feature는 어떻게 표현할까 • Latent representation LDA 각 문서의 토픽 분포 • Kernel: 두 feature 간의 similarity만으로 feature를 표현 • e.g 두 스트링 비교 editing distance • Support Vector Machine • kernelized ridge regression 을 sparse linear machine으로 바꾸는 관점 • maximum margin classifier로 보는 관점

14.2 Kernel functions • We define a kernel function to be a real-valued function of two arguments, κ(x,x)∈R,for x,x’∈X. • symmetric • non-negative • can be interpreted as a measure of similarity

14.2.1 RBF kernels • squared exponential kernel(SE kernel) or Gaussian kernel • If Σ is diagonal, this can be written as • We can interpret the σj as defining the characteristic length scale of dimension j. radial basis function or RBF kernel Here σ2 is known as the bandwidth.

• 시그마가 커질수록, 두 feature간 거리가 멀어도 어느정도 수치는 나오게 된다. • 시그마가 작으면, 거리가 좀만 멀어져도 kernel 유사도는 급히 작아진다

14.2.2 Kernels for comparing documents • If we use a bag of words representation, where xij is the number of times words j occurs in document i, we can use the cosine similarity, • Unfortunately, this simple method does not work very well, (stop word)

14.2.3 Mercer (positive definite) kernels • In general, if the kernel is Mercer, then there exists a function φ mapping x∈X to RD such that • For example, consider the (non-stationary)polynomial kernel

14.2.3 Mercer (positive definite) kernels • Gram matrix, • kernel function satisfy the requirement that the Gram matrix be positive definite for any set of inputs • It can be shown (Schoelkopf and Smola 2002) that the Gaussian kernel is a Mercer kernel as is the cosine similarity kernel (Sahami and Heilman 2006) • Mercer’s theorem. • If the Gram matrix is positive definite, we can compute an eigenvector decomposition of it as follows

14.2.4 Linear kernels • Deriving the feature vector implied by a kernel is in general quite difficult, and only possible if the kernel is Mercer. • However, deriving a kernel from a feature vector is easy: we just use • If φ(x)=x,we get the linear kernel, defined by κ • This is useful if the original data is already high dimensional, and if the original features are individually informative, e.g., a bag of words representation where the vocabulary size is large, or the expression level of many genes. • s. In such a case, the decision boundary is likely to be representable as a linear combination of the original features, so it is not necessary to work in some other feature space. • Of course, not all high dimensional problems are linearly separable. For example, images are high dimensional, but individual pixels are not very informative, so image classification typically requires non-linear kernels (see e.g., Section 14.2.7).

14.2.5 Matern kernels • The Matern kernel, which is commonly used in Gaussian process regression (see Section 15.2) • Where r=||x−x ||, ν>0, l>0, and Kν is a modified Bessel function. • As ν→∞, this approaches the SE kernel. If ν=12, the kernel simplifies to

14.2.6 String kernels • The real power of kernels arises when the inputs are structured objects. • As an example, we now describe one way of comparing two variable length strings using a string kernel. • Consider two strings x, and x of lengths D, D’, each defined over the alphabet A • A={A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V} • Now let φs(x) denote the number of times that substrings appears in string x • More formally and more generally, let us say that’s is a substring of x if we can write x=usv for some (possibly empty) strings u, s and v . • . x string, x' string의 커널은, substring을 많이 공유하면 커진다

14.2.7 Pyramid match kernels • 이미지에서 점을 추출하고 bag of words 로 표현한다(SIFT) • 여러 레벨을 가진 grid로 나누고, 얼마나 일치하는지 본다(같은 grid에 얼마나 같이 등장하는지?) • This is a Mercer kernel.

14.3 Using kernels inside GLMs • We define a kernel machine to be a GLM where the input feature vector has the form • If κ is an RBF kernel, this is called an RBF network We discuss ways to choose the μk parameters below. • We will call Equation 14.28 a kernelised feature vector. Note that in this approach, the kernel need not be a Mercer kernel. • We can use the kernelized feature vector for logistic regression by defining • This provides a simple way to define a non-linear decision boundary. • As an example, consider the data coming from the exclusive or or xor function.

14.3 Using kernels inside GLMs • Linear regression의 kernelized 버전

14.3.2 L1VMs, RVMs, and other sparse vector machines • The main issue with kernel machines is: how do we choose the centroids μk? • If the input is low-dimensional Euclidean space, we can uniformly tile the space occupied by the data with prototypes, as we did in Figure 14.2(c). = 공간을 균일하게 나눈다 • However, this approach breaks down in higher numbers of dimensions because of the curse of dimensionality. 차원이 커지게 되면, 각 공간이 너무 많아 져서 차원의 저주에 걸리게 된다 • A simpler approach is to make each example xi be a prototype, so we get • Now we see D = N, so we have as many parameters as data points. However, we can use any of the sparsity- promoting priors for w discussed in Chapter 13 to efficiently select a subset of the training exemplars. 이런 접근법은 13장 sparse linear model에서 했던 w를 sparse하게 만드는 접근법을 training example의 subset을 고르는 문제로 바꾸는 것이면 가능하게 한다. wi=0이면, xi는 없어도 예측 가능 • We call this a sparse vector machine

• w에 대해서 1 regularization으로 penalty를 주는 sparse vector machine을 L1VM • w에 대해서 2 regularization으로 penalty를 주는 sparse vector machine을 L2VM • Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) (13.7)로 sparsity를 얻는 모델을 relevance vector machine or RVM (Tipping 2001)라 한다 • Another very popular approach to creating a sparse kernel machine is to use a support vector machine or SVM Rather than using a sparsity-promoting prior, it essentially modifies the likelihood term, which is rather unnatural from a Bayesian point of view. Nevertheless, the effect is similar, as we will see.

14.4 The kernel trick • X자체나 X의 고차원 mapping을다루는 것이 아니라 kernel 자체를 다룸 • <x,x’> 를 k(x,x’)로 교체 • linear regression 같은데에서 커널을 쓰면, 고차원으로 사상된 training example 을 다루어야 했기 때문에, 학습 이나 예측 계산이 무거워졌었는데, 그런한 단점을 회피하는 trick 같은 것, • 심지어 무한 차원 매핑까지 할 수 있다(RBF= Gaussian Kernel) • 또는 structured object 의 다루기 어려운 feature 자체를 쓰는 것이 아니라, feature 간의 유사도라고 할 수 있는 kernel 을 씀으로써, 기존 알고리즘들이 structured object 에도 수행될 수 있게 해줌 • This is called the kernel trick.

14.4.1 Kernelized nearest neighbor classification • Recall that in a 1NN classifier (Section 1.4.2), we just need to compute the Euclidean distance of a test vector to all the training points, find the closest one, and look up its label. • This can be kernelized by observing that • This allows us to apply the nearest neighbor classifier to structured data objects. • 예를 들어서, 전에 다루었던 문서 커널을 써서, KNN으로 문서 분류를 할 수 있다.(tf-idf가 가장 비슷한 문서 k개 로 test 문서를 분류) Similarity를 kernel로만 계산

14.4.2 Kernelized K-medoids clustering • K –means가 각 cluster에 속한 feature들의 평균을 중심으로 하는 것처럼, k-medoids는 각 cluster의 중간값 • feature 자체를 다루는 것이 아니기 때문에 평균을 내지 못한다 • e-step = 각 데이터 xi가 어느 cluster에 속하는지 정한다(가장 가까운 centroid), k means와 같음 • m-step = 각 cluster k의 중심은 k cluster에 속한 애들 중에서, 중간값을 구한다 = k cluster에 속한 애들 중에서 다 른 애들과의 거리의 합이 최소 • 예를 들어서 string kernel 을 써서, 공통된 substring이 많은 비슷한 string을 clustering 할 수 있다

14.4.3 Kernelized ridge regression • Applying the kernel trick to distance-based methods was straightforward. • It is not so obvious how to apply it to parametric models such as ridge regression. • 14.4.3.1 The primal problem • Let x ∈ RD be some feature vector, and X be the corresponding N × D design matrix • Minimize • The optimal solution is given b • 14.4.3.2 The dual problem • using the matrix inversion lemma (Equation 4.107) • takes O(N3 + N2D) time to compute. This can be advantageous if D is large

14.4.3 Kernelized ridge regression • we can partially kernelize this, by replacing XXT with the Gram matrix K. • But what about the leading XT term? • Let us define the following dual variables: • Then we can rewrite the primal variables as follows • This tells us that the solution vector is just a linear sum of the N training vectors • When we plug this in at test time to compute the predictive mean, we get • kernelize ridge regression 의 쓰임새? 만약, 다루고자 하는 training example의 차원이 100000000이라면, 원래의 ridge regression이라면, 역행렬을 계산하는데 O(D^3)의 시간복잡도가 들 것이지만, Kernel ridge regression은 O(N^3)의 시간 복잡도를 가진다. • 하지만, 예측 계산은 O(D)에서 O(ND)로 증가한다. 그러므로 α를 svm처럼 sparse하게 만들어서(예측에 support vector만 써서) 해결해야 한다

14.5 Support vector machines (SVMs) • If L is quadratic loss, this is equivalent to ridge regression, • Kernelized ridge regression is kernelized, but not sparse • loss function을 hinge(classification)나 epsilon insensitive loss function(regression)을 쓰면, solution이 sparse해지 고, prediction은 오직 training data의 일부(support vectors)만을 가지고, 예측할 수 있다. • This combination of the kernel trick plus a modified loss function is known as a support vector machine or SVM

14.5.1 SVMs for regression • The problem with kernelized ridge regression is that the solution vector w depends on all the training inputs. • We now seek a method to produce a sparse estimate • epsilon insensitive loss function • This means that any point lying inside an-tube around the prediction is not penalized, as in Figure 14.10. 예측치의 오차가 특정 범위(epsilon)에 있으면, 페 널티는 가하지 않는다. 하지만, 그 범위를 넘어가 면 오차-epsilon만큼 페널티를 가한다

• his objective is convex and unconstrained, but not differentiable, because of the absolute value function in the loss term. • One popular approach is to formulate the problem as a constrained optimization problem. • In particular, we introduce slack variables to represent the degree to which each point lies outside the tube 제약조건이 true가 되도록 ξ를 늘려주고, 늘린 만큼 페널티를 준다

• One can show (see e.g., (Schoelkopf and Smola 2002)) that the optimal solution has the form • Once the model is trained, we can then make predictions using

14.5.2 SVMs for classification • 14.5.2.1 Hinge loss yη가 1보다 크면 penalty는 없다 f(X)가 > 1이고 y =1 이거나 f(X)가 < 1이고 y =-1 일 때 즉 정답일 때 pernality 없음

• The overall objective has the form • Once again, this is non-differentiable, because of the max term. However, by introducing slack variables ξi, one can show that this is equivalent to solving • Standard solvers take O(N3) time. • However, specialized algorithms, which avoid the use of generic QP solvers, have been developed for this problem, such as the sequential minimal optimization or SMO algorithm (Platt 1998). In practice this can take O(N2) • In such settings, it is common to use linear SVMs, which take O(N) time to train (Joachims 2006; Bottou et al. 2007). 제약조건이 true가 되도록 ξ를 늘려 제약조건을 완화해주고, 완화해준 만큼 페널티를 준다

• One can show that the solution has the form • where αi = λiyi and where α is sparse (because of the hinge loss). • The xi for which αi > 0 are called support vectors; these are points which are either incorrectly classified, or are classified correctly but are on or inside the margin (we discuss margins below). 서포트 벡터는 오분류된 점들이거나, 잘 분류됐지만, margin 안에 있는 점들 • At test time, prediction is done using • his takes O(sD) time to compute, where s ≤ N is the number of support vectors. • This depends on the sparsity level, and hence on the regularizer C 모든 training example의 알파의 weighted sum

14.5.2.2 The large margin principle • In this section, we derive Equation 14.58 form a completely different perspective. • 목표는 feature가 커널에 의해서 사상된 space 상에서 linear한 discriminant function f(x)을 구하는 것이다. X는 f(x)에 사영시킨 x⊥에서 r만큼 전진

14.5.2.2 The large margin principle • We would like to make this distance r = f(x)/||w|| as large as possible, for reasons illustrated in Figure 14.11. • intuitively, the best one to pick is the one that maximizes the margin, i.e., the perpendicular distance to the closest point • In addition, we want to ensure each point is on the correct side of the boundary, hence we want f(xi) yi > 0. • So our objective becomes • Note that by rescaling the parameters using w → kw and w0 → kw0, we do not change the distance of any point to the boundary, since the k factor cancels out when we divide by ||w||. • Therefore let us define the scale factor such that yifi = 1 for the point that is closest to the decision boundary. yi(wT xi + w0) = 1 이 되게 rescaling • We therefore want to optimize () • The constraint says that we want all points to be on the correct side of the decision boundary with a margin of at least 1. large margin classifier. 모든 training example에서 가장 가까운 점의 margin이 가장 큰 w 를 구하라

soft margin constraints • If the data is not linearly separable (even after using the kernel trick), there will be no feasible solution in which yifi ≥ 1 for all i. • We replace the hard constraints that yifi ≥ 0 with the soft margin constraints that yifi ≥ 1 − ξi. The new objective becomes • 1 • We therefore introduce slack variables ξi ≥ 0 such that ξi = 0 if the point is on or inside the correct margin boundary, and ξi = |yi −fi| otherwise • 0 < ξi ≤ 1 the point lies inside the margin, but on the correct side of the decision boundary. • If ξi > 1, the point lies on the wrong side of the decision boundary. 제약조건이 true가 되도록 ξ를 늘려 제약조건을 완화해주고, 완화해준 만큼 페널티를 준다

14.5.2.3 Probabilistic output • An SVM classifier produces a hard-labeling, ˆy(x) = sign(f(x)). • However, we often want a measure of confidence in our prediction. • One heuristic approach is to interpret f(x) as the log-odds ratio, log (p(y=1|x)/p(y=0|x) ). • where a, b can be estimated by maximum likelihood on a separate validation set. • However, the resulting probabilities are not particularly well calibrated

14.5.2.4 SVMs for multi-class classification • In Section 8.3.7, we saw how we could “upgrade” a binary logistic regression model to the multiclass case, by replacing the sigmoid function with the softmax, and the Bernoulli distribution with the multinomial. • Upgrading an SVM to the multi-class case is not so easy, since the outputs are not on a calibrated scale and hence are hard to compare to each other • The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in which we train C binary classifiers, fc(x), where the data from class c is treated as positive, and the data from all the other classes is treated as negative. • However, this can result in regions of input space which are ambiguously labeled, as shown in Figure 14.14(a). • Another approach is to use the one-versus-one or OVO approach, also called all pairs, in which we train C(C−1)/2 classifiers to discriminate all pairs fc,c . C(C−1)/2개의 분류기C개의 분류기 ambiguously labeled

14.5.3 Choosing C • Typically C is chosen by cross-validation. • however, that C interacts quite strongly with the kernel parameters. • To choose C efficiently, one can develop a path following algorithm in the spirit of lars (Section 13.3.4). • The basic idea is to start with λ large, so that the margin 1/||w(λ)|| is wide, and hence all points are inside of it and have αi = 1. • By slowly decreasing λ, a small set of points will move from inside the margin to outside, and their αi values will change from 1 to 0, as they cease to be support vectors.

14.5.4 Summary of key points • Summarizing the above discussion, we recognize that SVM classifiers involve three key ingredients: • the kernel trick :prevent underfitting, • Sparsity, the large margin principle : prevent overfitting

14.6 Comparison of discriminative kernel methods • We have mentioned several different methods for classification and regression based on kernels, which we summarize in Table 14.1. • GPs and L2VM are generally the slowest, taking O(N3) time • if speed matters, use an RVM, • if well-calibrated probabilistic output matters (e.g., for active learning or control problems), use a GP. “tune” the kernel parameter, use efficient gradient based optimizers to maximize the marginal likelihood 확률 모델이냐? kernel 이 mercer가 아니어도 되냐?

14.7 Kernels for building generative models • There is a different kind of kernel known as a smoothing kernel which can be used to create non-parametric density estimates. • This can be used for unsupervised density estimation, p(x), as well as for creating generative models for classification and regression by making models of the form p(y, x). • 14.7.1 Smoothing kernels • A smoothing kernel is a function of one argument which satisfies the following properties: • A simple example is the Gaussian kernel, • We can control the width of the kernel by introducing a band width parameter h:

14.7.2 Kernel density estimation (KDE) aka Parzen window density estimator • Recall Gaussian mixture model • However, it requires specifying the number K and locations μk of the clusters • Gaussian 갯수가 커지면 RD 의 모든 분포를 표현할 수 있고, 각 training example 마다 Gaussian을 가지고 있다면 • We can generalize the approach by writing • simple non-parametric density model: no model fitting is required • The disadvantage is that the model takes a lot of memory to store, and a lot of time to evaluate. (training example 자체가 모델이고, 계산할 때도 training example을 모두 이용한다 ) 특정 feature x의 확률은, training example xi마다 있는 N개의 Gaussian에 의해 서 결정되는데, x랑 가까이 있는 xi의 Gaussian이 더 큰 영향을 준다 x - xi 가 0에 가까울수록 커짐

• 각 data example xi마다 자기 자신을 평균으로 하는 N 개의 정규분포가 있고, 확률을 알고자 하는 x가 N 개의 정규 분포에 얼마나 떨어져 있는지 정규분포로 알아내서 합산 이 x는 3개의 정규분포와 가까워서 수치 높음 각 정규 분포의 분산을 늘리면 smooth 해짐

14.7.4 Kernel regression wi(x) x가 각 xi와 가까우면 weighting이 커지므로, 가까운 training set의 y값을 다 많이 따른다

14.7.5 Locally weighted regression • If we define κh(x−xi)=κ(x,xi), we can rewrite the prediction made by kernel regression as follows • Note tha tκ(x,xi) need not be a smoothing kernel. If it is not, we no longer need the normalization term, so we can just write • N개의 function κ(x∗,xi)을 융합 • linear regression 은 모든 데이터에 대해서 동일한 w를 쓰는 반면 LWR은 test set x*마다 fitting 함으로 y를 예측한다. 즉 test set마다 다른 weight β(x∗)를 가지고 다음과 같이 fit • solution은 현재 예측하고자 하는 x*에 대한 weight β(x∗) 로 모든 training set을 예측했을 때(커널로 가까우면 큰 에러를 갖 도록 조정) 에러가 최소되게 weight 학습

Machine Learning. CS4780 / CS 5780 ... Optimal hyperplane, margin, kernels, stability; Generative Models : ... Kevin Murphy, "Machine Learning ...

Read more

Machine learning is an exciting and fast ... whereas Murphy's book has more ... A Few Useful Things to Know About Machine Learning: 14: Oct 24 ...

Read more

... using a radial-basis function kernel. ... Support Vector Machines: ... K. Murphy. Machine Learning: A Probabilistic Perspective.

Read more

Machine learning is the science of getting computers to act without being ... support vector machines, kernels, neural networks). (ii) Unsupervised ...

Read more

It has long been a central concern of Computer Science to create machines capable ... Vector Machines and Kernels. ... Murphy. Machine Learning: ...

Read more

Kevin Murphy's Machine learning: ... If they tell me that I ask them to explain the difference between Logistic Regression and Linear Kernel SVMs, ...

Read more

Stony Brook University CSE 512 ... Scholkopf video tutorial on kernels ... http://www.kernel-machines.org (Optional) Murphy, ...

Read more

MACHINE LEARNING THEORY Maria Florina ... Course description: Machine learning studies automatic methods for learning to ... Kernels as Features ...

Read more

## Add a comment