Principal Component Analysis (PCA)

The function iapca computes the principal components transformation matriz of a given data set. The input is the data set organized as one row per sample e columns with attributes of each sample. The output are three: the eigenvector matriz, the diagonal of the covariance matriz and the sample mean.

Function definition

 1 import numpy as np
 2 
 3 def iapca(X):
 4 
 5     n, dim = X.shape
 6     mu = X.mean(axis=0)
 7     X = X - mu
 8 
 9     C = (X.T).dot(X)/(n-1)        # Covariance matrix
10     e,V = np.linalg.eigh(C)       # eigenvalues and eigenvectors of the covariance matrix
11     indexes = np.argsort(e)[::-1] # sorting eigenvalues from largest
12     e  = e [indexes]              # update e and V
13     V = V[:,indexes]
14     return V,e,mu

Example - Bidimensional Gaussian distribution

Creating the data:

A matrix of 50 samples of (y,x) coordinates is created. The samples are generated from a Gaussian distribution rotated by 60 degrees.

 1 import ia636 as ia
 2 import numpy as np
 3 
 4 mean = [1,3]
 5 cov = np.array([
 6       [ 0.14,     0.2078461],
 7       [ 0.2078461,0.38     ]
 8       ])
 9 X = np.random.multivariate_normal(mean,cov,50)
10 
11 adshow((ia.iaplot(X[:,1],X[:,0],
12                   colors = 'r',
13                   shapes = 'x',
14                   axis = 'equal')), 'Gaussian distribution')

Gaussian distribution

Computing the matrix transformation

 1 V,e,mu_X = ia.iapca(X)
 2 sigma = np.sqrt(e)
 3 
 4 # Plotando os dados e seus autovetores no sistema de coordenadas original
 5 arrow = [[mu_X[0], mu_X[1], sigma[1]*V[0,1], sigma[1]*V[1,1]],
 6          [mu_X[0], mu_X[1], sigma[0]*V[0,0], sigma[0]*V[1,0]]]
 7 
 8 adshow(ia.iaplot(X[:,1],X[:,0],
 9                  arrows_list = arrow,
10                  colors = 'r',
11                  shapes = 'x',
12                  axis = 'equal') , 'eigenvectors plotted')
13 
14 Xr = (X - mu_X).dot(V)
15 Cr = (Xr.T).dot(Xr)/(Xr.shape[0]-1)  # Covariance matrix
16 
17 print 'mean:', mu_X
18 print 'sigma**2', sigma**2
19 print 'Cr:\n', Cr
20 
21 arrow_r = [[mu_X[0], mu_X[1], 0       , sigma[1]],
22            [mu_X[0], mu_X[1], sigma[0], 0       ]]
23 
24 adshow(ia.iaplot(Xr[:,1]+mu_X[1], Xr[:,0]+mu_X[0],
25                  arrows_list = arrow_r,
26                  colors = 'b',
27                  shapes = '+',
28                  axis = 'equal'),'Transformed distribution with eigenvectors')
mean: [ 1.04635382  3.1188504 ]
sigma**2 [ 0.60693601  0.01712658]
Cr:
[[  6.06936006e-01   3.90843820e-17]
 [  3.90843820e-17   1.71265806e-02]]

eigenvectors plotted

Transformed distribution with eigenvectors

Method

  1. Build matrix , where n (rows) gives the number of samples and m (columns) gives the number of attributes. m gives the dimensionality of the problem:
  1. Compute the mean values of every attributes of all samples:
  1. Compute the covariance matrix:
  1. Compute the eigenvalues ( ) and the eigenvectors ( ) which diagonalizes the covariance matrix ( ):
  1. The transformation matrix (V) of the PCA is built as following: the first column is the eigenvector related to the largest eigenvalue, the second column with the eigenvector related to the second largest eigenvalue and so.

The data transformation to the new coordinates system is given by:

The eigenvalues give the variance of the data projected in the new coordinates system. To compute the variance percentage related to each eigenvector direction of the transformed coordinates system, divide the corresponding eigenvalue by the sum of all eigenvalues:

Contributions

  • Roberto Medeiros de Souza, Sept 2013