Discuss the limitations of a linear classifier
Let X be a 4 × 2 matrix, Y be a 2 × 2 matrix, and Z be a matrix such that X = ZY. What is the size of Z?
Compute a formula for Ak , i.e., the matrix power operation defined as the matrix product of k copies of A. Your answer must be a single matrix. Show the necessary steps of your derivation.
Suppose A is an n× n matrix with eigenvalue λ and the corresponding eigenvector v.
If A is invertible, is v an eigenvector of A-1 If so, what will the new eigenvalue be? If not, explain why not.
Is 3v an eigenvector of A? If so, what will the eigenvalue be? If not, explain why not.
Identify, with proof, the possible values of the determinant of an orthogonal matrix A ∈ Rn×n.
Assume A is an invertible matrix. Prove that
det (A-1) = 1/det (A)
Consider the following dataset of three points in R2: x1 = (-1, -1)T, x2 = (0, 0)T , x3 = (1, 1)T .
What is the first principal component when normalized to unit length? Write down the vector.
Use the vector you found in Part 1 to reduce the dimensionality of the data to one dimen- sion. What are the new coordinates? What is the variance of the projected data?
Compute the reconstruction error when the data is projected back to two dimensional space.
In this problem, and several others, you will use the classic MNIST digit dataset. MNIST is made up of 70,000 28 × 28 images of the handwritten digits 0-9, where 60,000 of the images have been designated for use in training algorithms and 10,000 images have been designated for testing/evaluating algorithms to see how well they work on data they have never seen before (i.e., how well they generalize to new data). I have provided you with some helper files to load the MNIST images into MATLAB, and similar functions can be readily written (or found online) for Python. The images should be turned into floating point images, and using single precision will make the code run faster and use less memory.
Go to this webpage http://yann.lecun.com/exdb/mnist/ and download these four files:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
After downloading them, you likely will need to unzip them and place them in a folder. Note that if you used the MATLAB helper files, the images will be stored as 784-dimensional vectors. You can use the ‘reshape’ command to reshape each one of these image vectors to be 28 × 28 to display it.
Important: mnist hard.mat is not used for this problem. It is used later on a different problem.
Compute the mean and standard deviation of each digit in the training set, i.e., compute the mean and standard deviation of all the zeros, all the ones, etc. Your answer should display 20 images in which the top row has the image mean for each digit (i.e., the average ‘0′, the average ‘1′, etc.) and the second row has their corresponding variance. You should provide your code as part of your answer (my solution was 10 lines of MATLAB).
Principal Component Analysis (PCA) is typically explained using an eigendecomposition of the d × d data covariance matrix C; however, due to finite-precision arithmetic on a computer this algorithm for PCA can be numerically unstable. In practice, Singular Value Decomposition (SVD) of the data itself (instead of the covariance matrix) is typically used, i.e., X = UΣVT (see Appendix A.1 of Szeliski).
Given a data matrix X show mathematically how SVD can be used to compute the principal components instead of using the eigendecomposition of C. Assume that X is mean zero (i.e., centered). What would be the formula for the eigenvalues using the SVD algorithm for PCA?
Hint: This problem requires you to use the formula for PCA using the covariance matrix C to find a formula for PCA using SVD instead.
Write a function that implements PCA using SVD. Your algorithm should take a d × m matrix D and an integer k as input, where d is the dimensionality of the m data points, and k is the number of principal components to retain. Do not assume that D is mean zero. The function should return the top k principal components in a matrix and the d dimensional mean.
Apply your function to the MNIST training data to reduce the dimensionality to 10 dimen- sions. Display the first 10 principal components as images. When displaying the image, contrast normalize it by constraining the values to be between 0 and 1 by subtracting the minimum and then dividing by the maximum (this is called contrast stretching). (This problem took about 13 lines of MATLAB code, including code to display the answer)
Plot the mean reconstruction error (squared distance between the original data and the reconstructed data) as a function of the number of principal components, i.e., from 1 to 783. To do this efficiently, you should make a new version of your function for PCA to in which k is varied. Use vectorization to ensure your code runs fast. Even if you use vectorization, expect this script to take decent amount of time to run (30-90 minutes). Make sure to label your plot’s axes.
Implement k nearest neighbors using Euclidean distance and evaluate your algorithm by “training” it on the 60,000 MNIST training images and evaluating it on the 10,000 testing images. To do this, write a function that takes as input the test features (images), training images, training labels, and k, and it should output predicted class labels for each test feature vector (or image). For debugging purposes, I suggest you only use a small portion of the training images (first 300 images) and that you use vectorization to ensure your code is fast. You may find the file Compute L2 Distance Matrix useful for this. Note that with vectorization you may need to manage memory to ensure your machine doesn’t run out of resources.
What’s the percent accuracy with k equal to 1, 3, and 5 on the test images? What would the accuracy be on the training images be when k = 1?
One of the major problems with nearest neighbor is that it doesn’t have a way to weigh the importance of each dimension of the feature vector. It also is computationally expensive because it requires comparing a vector of test features to every training data point. A computationally faster approach are linear classifiers.
For a K category classification problem, a multi-class linear algorithm will use a collection of K d-dimensional vector to classify a d dimensional vector of features (e.g., an image). Formally, this prediction would be given by y = arg maxkwTx, where wk is the weight vector for category k and x is the input vector of features.
Many algorithms have been devised to find the collection wk vectors, and they each perform differently. We will look at some of the simplest methods.
A linear machine is a generalization of the Perceptron algorithm to multiple categories. To train a linear machine, it sequentially classifies each vector in the training data xt, i.e.,
yt = arg maxwkTxt.
If the predicted category yt is not the correct category v, then the weights are updated as follows:
wv ← wv + αxt
wyt ← wyt – αxt
where α is the learning rate. During training, the algorithm will review all of the training data multiple times (each loop through the entire dataset is called an epoch).
Implement a linear machine and compute the accuracy on the training data and the test data. Use a learning rate of α = 1 and use 2 epochs, and initialize all of your weight vectors, i.e., all 10 of the wk vectors, to be all zeros. How do the results on the train and test data compare to the nearest neighbor results from the previous problem? If there is a significant difference, explain why this might be the case.
While linear machines are easy to program and understand, they are not regularized and cannot output probabilities. One of the best regularized linear classifiers is the Support Vector Machine (with a linear kernel), and one of the best methods for generating proba- bilities is logistic regression.
Download LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) and train a linear SVM model and a logistic regression model and compute the test accuracy for each of the two models on the MNIST test data. Use options ‘-s 0’ and ‘-s 1’ Try tuning the cost parameter.
Discuss the limitations of a linear classifier. What are some ways in which these limitations could be overcome?
Discuss the limitations of the naive nearest neighbor algorithm. What are some ways in which these limitations could be overcome?
Download the file mnist hard.mat. Note that you can load MATLAB data files (*.mat files) into Python using the function loadmat in SciPy. It should be clear from the variables in the file which are the appropriate data and labels.
Train 1-nearest neighbor and report the accuracy on the test data.
Apply PCA to the training data to reduce the dimensionality to 600, and then re-train the 1-nearest neighbor. The PCA transformation should be found using the training data, and then that transformation should be applied to both the train and test data. Report the accuracy on the test data. If PCA helped, explain why you think it might have done so.
Why did accuracy differ significantly from when MNIST was used earlier? List two ways in which you might be able to increase accuracy and explain why you think they will help.
Implement one of the ideas you came up with in the previous section and evaluate it on the test data. How did it affect the accuracy on the test data? Make sure to provide whatever code you wrote (which you should have been doing for all of the programming questions anyway).