This method is an implementation of the Kernel Fisher Discriminant (KFD) [Mika99a]. The aim is to find a binary kernel classifier which is the linear decision function in a feature space induced by the selected kernel function.
Friday, March 02, 2007
Tuesday, February 27, 2007
Kernel Principal Component Analysis
Kernel Principal Component Analysis (Kernel PCA) [5,6] is a method of non-linear feature extraction, closely related to methods applied in Support Vector Machines (SVMs)[3].
A cite of code of KPCA can be seen as follows:
http://cmp.felk.cvut.cz/cmp/software/stprtool/manual/kernels/extraction/list/kpca.html
A cite of code of KPCA can be seen as follows:
http://cmp.felk.cvut.cz/cmp/software/stprtool/manual/kernels/extraction/list/kpca.html
Thursday, February 15, 2007
Hidden Markov Model (HMM) Toolbox for Matlab
Written by Kevin Murphy, 1998.Last updated: 8 June 2005. The new version computes the summed two-slice marginals, to save space (suggestion of Herbert Jaeger).
This toolbox supports inference and learning for HMMs with discrete outputs (dhmm's), Gaussian outputs (ghmm's), or mixtures of Gaussians output (mhmm's). The Gaussians can be full, diagonal, or spherical (isotropic). It also supports discrete inputs, as in a POMDP. The inference routines support filtering, smoothing, and fixed-lag smoothing. For more general models, please see my Bayes Net Toolbox.
This toolbox supports inference and learning for HMMs with discrete outputs (dhmm's), Gaussian outputs (ghmm's), or mixtures of Gaussians output (mhmm's). The Gaussians can be full, diagonal, or spherical (isotropic). It also supports discrete inputs, as in a POMDP. The inference routines support filtering, smoothing, and fixed-lag smoothing. For more general models, please see my Bayes Net Toolbox.
Ensembles and Resampling
To explain why resampling and ensembles are so useful, it is helpful to formulate the neural network training process in statistical terms (Bishop, 1995). We regard the problem as that of estimating an unknown nonlinear function, which has additive noise, on the basis of a limited data set of examples, D. There are several sources of error in our neural network's predictions. First, and unavoidably, even a "perfect" network that exactly modeled the underlying function would make errors due to the noise. However, there is also error due to the fact that we need to fit the neural network model using the finite sample data set, D. This remaining error can be split into two components, the model bias and variance. The bias is the average error that a particular model training procedure will make across different particular data sets (drawn from the unknown function's distribution). The variance reflects the sensitivity of the modeling procedure to a particular choice of data set.
We can trade off bias versus variance. At one extreme, we can arbitrarily select a function that entirely ignores the data. This has zero variance, but presumably high bias, since we have not actually taken into account the known aspects of the problem at all. At the opposite extreme, we can choose a highly complex function that can fit every point in a particular data set, and thus has zero bias, but high variance as this complex function changes shape radically to reflect the exact points in a given data set. The high bias, low variance solutions can have low complexity (e.g., linear models), whereas the low bias, high variance solutions have high complexity. In neural networks, the low complexity models have smaller numbers of units.
How does this relate to ensembles and resampling? We necessarily divide the data set into subsets for training, selection, and test. Intuitively, this is a shame, as not all the data gets used for training. If we resample, using a different split of data each time, we can build multiple neural networks, and all the data gets used for training at least some of them. If we then form the networks into an ensemble, and average the predictions, an extremely useful result occurs. Averaging across the models reduces the variance, without increasing the bias. Arguably, we can afford to build higher bias models than we would otherwise tolerate (i.e., higher complexity models), on the basis that ensemble averaging can then mitigate the resulting variance.
The generalization performance of an ensemble can be better than that of the best member network, although this does depend on how good the other networks in the ensemble are. Unfortunately, it is not possible to show whether this is actually the case for a given ensemble. However, there are some reassuring pieces of theory to back up the use of ensembles.
First, it can be shown (Bishop, 1995) that, on the assumption that the ensemble members' errors have zero mean and are uncorrelated, the ensemble reduces the error by a factor of N, where N is the number of members. In practice, of course, these errors are not uncorrelated. An important corollary is that an ensemble is more effective when the members are less correlated, and we might intuitively expect that to be the case if diverse network types and structures are used.
Second, and perhaps more significantly, it can be shown that the expected error of the ensemble is at least as good as the average expected error of the members, and usually better. Typically, some useful reduction in error does occur. There is of course a cost in processing speed, but for many applications this is not particularly problematic.
There are a number of approaches to resampling available.
The simplest approach is random (monte carlo) resampling, where the training, selection and test sets are simply drawn at random from the data set, keeping the sizes of the subsets constant. Alternatively, you CAN sometimes resample the training and selection set, but keep the test set the same, to support a simple direct comparison of results. The second approach is the popular cross-validation algorithm. Here, the data set is divided into a number of equal sized divisions. A number of neural networks are created. For each of these, one division is used for the test data, and the others are used for training and selection. In the most extreme version of this algorithm, leave-one-out cross validation, N divisions are made, where N is the number of cases in the data set, and on each division the network is trained on all bar one of the cases, and tested on the single case that is omitted. This allows the training algorithm to use virtually the entire data set for training, but is obviously very intensive.
The third approach is bootstrap sampling. In the bootstrap, a new training set is formed by sampling with replacement from the available data set. In sampling with replacement, cases are drawn at random from the data set, with equal probability, and any one case may be selected any number of times. Typically the bootstrap set has the same number of cases as the data set, although this is not a necessity. Due to the sampling process, it is likely that some of the original cases will not be selected, and these can be used to form a test set, whereas other cases will have been duplicated.
The bootstrap procedure replicates, insofar as is possible with limited data, the idea of drawing multiple data sets from the original distribution. Once again, the effect can be to generate a number of models with low bias, and to average out the variance. Ensembles can also be beneficial at averaging out bias. If we include different network types and configurations in an ensemble, it may be that different networks make systematic errors in different parts of the input space. Averaging these differently configured networks may iron out some of this bias.
We can trade off bias versus variance. At one extreme, we can arbitrarily select a function that entirely ignores the data. This has zero variance, but presumably high bias, since we have not actually taken into account the known aspects of the problem at all. At the opposite extreme, we can choose a highly complex function that can fit every point in a particular data set, and thus has zero bias, but high variance as this complex function changes shape radically to reflect the exact points in a given data set. The high bias, low variance solutions can have low complexity (e.g., linear models), whereas the low bias, high variance solutions have high complexity. In neural networks, the low complexity models have smaller numbers of units.
How does this relate to ensembles and resampling? We necessarily divide the data set into subsets for training, selection, and test. Intuitively, this is a shame, as not all the data gets used for training. If we resample, using a different split of data each time, we can build multiple neural networks, and all the data gets used for training at least some of them. If we then form the networks into an ensemble, and average the predictions, an extremely useful result occurs. Averaging across the models reduces the variance, without increasing the bias. Arguably, we can afford to build higher bias models than we would otherwise tolerate (i.e., higher complexity models), on the basis that ensemble averaging can then mitigate the resulting variance.
The generalization performance of an ensemble can be better than that of the best member network, although this does depend on how good the other networks in the ensemble are. Unfortunately, it is not possible to show whether this is actually the case for a given ensemble. However, there are some reassuring pieces of theory to back up the use of ensembles.
First, it can be shown (Bishop, 1995) that, on the assumption that the ensemble members' errors have zero mean and are uncorrelated, the ensemble reduces the error by a factor of N, where N is the number of members. In practice, of course, these errors are not uncorrelated. An important corollary is that an ensemble is more effective when the members are less correlated, and we might intuitively expect that to be the case if diverse network types and structures are used.
Second, and perhaps more significantly, it can be shown that the expected error of the ensemble is at least as good as the average expected error of the members, and usually better. Typically, some useful reduction in error does occur. There is of course a cost in processing speed, but for many applications this is not particularly problematic.
There are a number of approaches to resampling available.
The simplest approach is random (monte carlo) resampling, where the training, selection and test sets are simply drawn at random from the data set, keeping the sizes of the subsets constant. Alternatively, you CAN sometimes resample the training and selection set, but keep the test set the same, to support a simple direct comparison of results. The second approach is the popular cross-validation algorithm. Here, the data set is divided into a number of equal sized divisions. A number of neural networks are created. For each of these, one division is used for the test data, and the others are used for training and selection. In the most extreme version of this algorithm, leave-one-out cross validation, N divisions are made, where N is the number of cases in the data set, and on each division the network is trained on all bar one of the cases, and tested on the single case that is omitted. This allows the training algorithm to use virtually the entire data set for training, but is obviously very intensive.
The third approach is bootstrap sampling. In the bootstrap, a new training set is formed by sampling with replacement from the available data set. In sampling with replacement, cases are drawn at random from the data set, with equal probability, and any one case may be selected any number of times. Typically the bootstrap set has the same number of cases as the data set, although this is not a necessity. Due to the sampling process, it is likely that some of the original cases will not be selected, and these can be used to form a test set, whereas other cases will have been duplicated.
The bootstrap procedure replicates, insofar as is possible with limited data, the idea of drawing multiple data sets from the original distribution. Once again, the effect can be to generate a number of models with low bias, and to average out the variance. Ensembles can also be beneficial at averaging out bias. If we include different network types and configurations in an ensemble, it may be that different networks make systematic errors in different parts of the input space. Averaging these differently configured networks may iron out some of this bias.
NEURAL NETWORKS
An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones. This is true of ANNs as well. An example of a simple feedforward network
Wednesday, February 14, 2007
Subscribe to:
Posts (Atom)