If you're here, you probably checked out a few blog posts or Wikipedia to understand Kernels. So this is just a few additional thoughts. Firstly, why even do this 'kernel trick'?
The application we will be looking at in this post is clustering a set of points. For this, it is of course important to have some kind of distance measure between the points, in fact that is the main function needed.
Consider a dataset of 4-dimensional points. Pick a point (x1,x2,x3,x4). For the sake of this example you find that to separate the points in multiple clusters you will have to move them into the 10-dimensional space using the following formula:
Fully expanded, each of the 4-dimensional points is now the following 10-dimensional point:
You could just convert all points into this and then apply a clustering algorithm, which will use some kind of distance measure on those points such as euclidean distance, but observe all these calculations.. For each point you have to do calculations. Now imagine if you did not have to convert all points. You use the polynomial kernel:
The crucial point here is that the kernel gives us a kind of similarity measure (due to the dot product), which is what we want. Since this is basically the only thing we need for a successful clustering, using a kernel works here. Of course, if you needed the higher dimension for something more complicated, a kernel would not suffice any longer. The second crucial point here is that calculating the kernel is much faster. You do not have to first convert all the points to 10 dimensions and then apply some kind of distance measure, no, you do that in just one dot operation, i.e. one sum loop. To be precise, you iterate times with a kernel, but you do it times with the naive way of converting points to higher dimension.