Principal Component Analysis(PCA) in a Nutshell
Problem:
With increase in dimensions(like 100’s/features) leads to problems like:
- Time Complexity increases.
- Difficult to generalize relations.
- All the time all features are not going to contribute.
- Hard to find shape of data.
Solution:
So, we can try to convert a dataset in some other axis in such a way that it will be able to retain original meaning of the data. we can do this with the help of PCA.
Steps:
Step 1: Requirements:
For PCA to work, the data should have mean = 0 and standard deviation ~ 1
Convert the normal distribution to standard normal distribution
Step 2: Fit a best fitting line passing through origin:
The best fitted line is decided by maximizing the distance between the projected points on line and origin.
These individual points are called d1, d2, d3, d4, …
we can do this by dividing a, b, c by length of a
hence our new mixture for PC1 would be written as
This 1 unit long vector consisting of 0.97 parts of x1 and 0.242 parts of x2 is called singular vector or the Eigenvector for PC1.
and proportions of each axis are called Loading Scores.
Step 3: PC2
Step 4: Calculating present variations for each PC and Scree Plot
SS(distances for PC1) = Eigenvalue for PC1
SS(distances for PC2) = Eigenvalue for PC2
We can convert them into variation around
origin(0,0) by dividing by n-1
for example:
let variation for
PC1 = 15,
PC2 = 3
total variation around both PC’s = 15+3 = 18
that means PC1 accounts for
15/18 = 0.83 = 83% of total variations
around PCs
this would also called EVR(Explained Variance Ratio)
Step 5: Scree Plot
plot EVR for every PC’s
here we can see that with 4 no. of PC’s we get 90% of explanation,
so we take these 4 PC’s only.