In this
laboratory we will address the problem of data analysis with a
reference to a classification problem.
Follow the instructions below. Think hard before you call the instructors!
Download the zipfile and unzip it in a local folder
Set Matlab path to include the local folder
Use the colormap command to change the color palette of the plots if you need to increase visibility. Check the documentation for the available options.
You will generate a training and a test set of D-dimensional points (N points for each class), with N=100 D=30.
1.A For each point, the first two features will be generated by MixGauss, extracted from two gaussian distributions with centroids (1, 1) and (-1,-1) and sigmas 0.7 (the first one with Y=1, the second with Y=-1)
Ytr(Ytr==2) = -1;
[X2ts, Yts] = MixGauss(…);
Yts(Yts==2) = -1;
1B. You may want to plot the relevant features of the data
scatter(X2tr(:,1),
X2tr(:,2), 25, Ytr);
scatter(X2ts(:,1), X2ts(:,2), 25, Yts);
1.C The remaining variables will be generated as Gaussian noise
Xtr_noise=sigma_noise*randn(2*N,D-2);
Xts_noise=sigma_noise*randn(2*N,D-2);
With sigma_noise = 0.01.
To compose the final data matrix, concatenate the features by running:
2.A Compute the data principal components (see help PCA)
2.B Plot the first two components of X_proj using the following line
2.C Try now with the first 3 components, by using
scatter3(X_proj(:,1), X_proj(:,2), X_proj(:,3), 25, Ytr);
Reason on the meaning of the results you are obtaining. Is the 3rd component relevant?
2.D Display the sqrt of the first 10 eigenvalues (disp(sqrt(d(1:10)))). Plot the coefficients (eigenvector) associated with the largest eigenvalue:
2.E Repeat the above steps with dataset generated using different sigma_noise (0, 0.01, 0.1, 0.5, 0.7, 1, 1.2, 1.4, 1.6, 2).
To what extent data visualization by PCA is affected by the noise?
3.A Use
the data generated in section 1. Standardize the data matrix, so
that each column has mean 0 and standard deviation
1:
m=mean(Xtr);
(see "help mean", it computes the mean for each column)
s
= std(Xtr);
for i = 1:2*N
Xtr(i,:) = Xtr(i,:) - m;
end
for
i = 1:2*N
Xtr(i,:) = Xtr(i,:) ./ s;
end
Do the same for Xts, by using m and s computed on Xtr.
3.B Use the orthogonal matching pursuit algorithm (type 'help OmatchingPursuit')
3.C You
may want to check the predicted labels on the training set
Ypred
= sign(Xts *
w);
err = calcErr(Yts, Ypred);
and
plot the coefficients w
with scatter(1:D,
abs(w)).
How the error changes with the number of iterations of the method?
3.D By using holdoutCVOMP find the best number of iterations with intIter = 2:D. Moreover, plot the training and validation error, using:
What is the behavior of the training and validation errors with respect to the number of iterations?
4.A Analyse the results you obtain in sections 2 and 3 as you choose
N >> D
N ~ D
N << D
and evaluate the benefits of the two different algorithms.