Principal Component Analysis(PCA) in a Nutshell

Ritik Dutta
3 min readNov 12, 2022

--

Problem:

With increase in dimensions(like 100’s/features) leads to problems like:

  1. Time Complexity increases.
  2. Difficult to generalize relations.
  3. All the time all features are not going to contribute.
  4. Hard to find shape of data.

Solution:

So, we can try to convert a dataset in some other axis in such a way that it will be able to retain original meaning of the data. we can do this with the help of PCA.

Steps:

Step 1: Requirements:

For PCA to work, the data should have mean = 0 and standard deviation ~ 1

Convert the normal distribution to standard normal distribution

Step 2: Fit a best fitting line passing through origin:

The best fitted line is decided by maximizing the distance between the projected points on line and origin.

These individual points are called d1, d2, d3, d4, …

we can do this by dividing a, b, c by length of a

hence our new mixture for PC1 would be written as

This 1 unit long vector consisting of 0.97 parts of x1 and 0.242 parts of x2 is called singular vector or the Eigenvector for PC1.
and proportions of each axis are called Loading Scores.

Step 3: PC2

Step 4: Calculating present variations for each PC and Scree Plot

SS(distances for PC1) = Eigenvalue for PC1
SS(distances for PC2) = Eigenvalue for PC2

We can convert them into variation around
origin(0,0) by dividing by n-1

for example:

let variation for
PC1 = 15,
PC2 = 3

total variation around both PC’s = 15+3 = 18
that means PC1 accounts for
15/18 = 0.83 = 83% of total variations
around PCs

this would also called EVR(Explained Variance Ratio)

Step 5: Scree Plot

plot EVR for every PC’s

here we can see that with 4 no. of PC’s we get 90% of explanation,
so we take these 4 PC’s only.

--

--

No responses yet