GFG App
Open App
Browser
Continue

# Principal Component Analysis(PCA)

As the number of features or dimensions in a dataset increases, the amount of data required to obtain a statistically significant result increases exponentially. This can lead to issues such as overfitting, increased computation time, and reduced accuracy of machine learning models this is known as the curse of dimensionality problems that arise while working with high-dimensional data.

As the number of dimensions increases, the number of possible combinations of features increases exponentially, which makes it computationally difficult to obtain a representative sample of the data and it becomes expensive to perform tasks such as clustering or classification because it becomes. Additionally, some machine learning algorithms can be sensitive to the number of dimensions, requiring more data to achieve the same level of accuracy as lower-dimensional data.

To address the curse of dimensionality, Feature engineering techniques are used which include feature selection and feature extraction. Dimensionality reduction is a type of feature extraction technique that aims to reduce the number of input features while retaining as much of the original information as possible.

In this article, we will discuss one of the most popular dimensionality reduction techniques i.e Principal Component Analysis(PCA).

## Principal Component Analysis(PCA)

Principal Component Analysis(PCA) technique was introduced by the mathematician Karl Pearson in 1901. It works on the condition that while the data in a higher dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should be maximum.

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of uncorrelated variables.PCA is the most widely used tool in exploratory data analysis and in machine learning for predictive models. Moreover,

PCA is an unsupervised learning algorithm technique used to examine the interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best fit.

The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset while preserving the most important patterns or relationships between the variables without any prior knowledge of the target variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables, retaining most of the sample’s information, and useful for the regression and classification of data.

Principal Component Analysis

1. PCA is a technique for dimensionality reduction that identifies a set of orthogonal axes, called principal components, that capture the maximum variance in the data. The principal components are linear combinations of the original variables in the dataset and are ordered in decreasing order of importance. The total variance captured by all the principal components is equal to the total variance in the original dataset.
2. The first principal component captures the most variation in the data, but the second principal component captures the maximum variance that is orthogonal to the first principal component, and so on.
3. PCA can be used for a variety of purposes, including data visualization, feature selection, and data compression. In data visualization, PCA can be used to plot high-dimensional data in two or three dimensions, making it easier to interpret. In feature selection, PCA can be used to identify the most important variables in a dataset. In data compression, PCA can be used to reduce the size of a dataset without losing important information.
4. In PCA, it is assumed that the information is carried in the variance of the features, that is, the higher the variation in a feature, the more information that features carries.

Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets, making them easier to understand and work with.

### Important Concepts for Principal Component Analysis(PCA)

#### Standardization

First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a standard deviation of 1.

Here,

• is the mean of independent features
•  is the standard deviation of independent features

#### Covariance :

Covariance measures the strength of joint variability between two or more variables, indicating how much they change in relation to each other. To find the covariance we can use the formula:

The value of covariance can be positive, negative, or zeros.

• Positive: As the x1 increases x2 also increases.
• Negative: As the x1 increases x2 also decreases.
• Zeros: No direct relation

#### Eigenvalues and Eigenvectors

Let A be a square nXn matrix and X be a non-zero vector for which

for some scalar values . then  is known as the eigenvalue of matrix A and X is known as the eigenvector of matrix A for the corresponding eigenvalue.

It can also be written as :

where I is the identity matrix of the same shape as matrix A. And the above conditions will be true only if  will be non-invertible (i.e singular matrix). That means,

From the above equation, we can find the eigenvalues \lambda, and therefore corresponding eigenvector can be found using the equation .

### How Principal Component Analysis(PCA) works?

Hence, PCA employs a linear transformation that is based on preserving the most variance in the data using the least number of dimensions. It involves the following steps:

## Python3

 import pandas as pd import numpy as np   # Here we are using inbuilt dataset of scikit learn from sklearn.datasets import load_breast_cancer   # instantiating cancer = load_breast_cancer(as_frame=True) # creating dataframe df = cancer.frame   # checking shape print('Original Dataframe shape :',df.shape)   # Input features X = df[cancer['feature_names']] print('Inputs Dataframe shape   :', X.shape)

Output:

Original Dataframe shape : (569, 31)
Inputs Dataframe shape   : (569, 30)

## Python3

 # Mean X_mean = X.mean()   # Standard deviation X_std = X.std()   # Standardization Z = (X - X_mean) / X_std

## Python3

 # covariance c = Z.cov()   # Plot the covariance matrix import matplotlib.pyplot as plt import seaborn as sns sns.heatmap(c) plt.show()

Output:

Covariance Map

## Python3

 eigenvalues, eigenvectors = np.linalg.eig(c) print('Eigen values:\n', eigenvalues) print('Eigen values Shape:', eigenvalues.shape) print('Eigen Vector Shape:', eigenvectors.shape)

Output:

Eigen values:
[1.32816077e+01 5.69135461e+00 2.81794898e+00 1.98064047e+00
1.64873055e+00 1.20735661e+00 6.75220114e-01 4.76617140e-01
4.16894812e-01 3.50693457e-01 2.93915696e-01 2.61161370e-01
2.41357496e-01 1.57009724e-01 9.41349650e-02 7.98628010e-02
5.93990378e-02 5.26187835e-02 4.94775918e-02 1.33044823e-04
7.48803097e-04 1.58933787e-03 6.90046388e-03 8.17763986e-03
1.54812714e-02 1.80550070e-02 2.43408378e-02 2.74394025e-02
3.11594025e-02 2.99728939e-02]
Eigen values Shape: (30,)
Eigen Vector Shape: (30, 30)

## Python3

 # Index the eigenvalues in descending order  idx = eigenvalues.argsort()[::-1]   # Sort the eigenvalues in descending order  eigenvalues = eigenvalues[idx]   # sort the corresponding eigenvectors accordingly eigenvectors = eigenvectors[:,idx]

## Python3

 explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues) explained_var

Output:

array([0.44272026, 0.63243208, 0.72636371, 0.79238506, 0.84734274,
0.88758796, 0.9100953 , 0.92598254, 0.93987903, 0.95156881,
0.961366  , 0.97007138, 0.97811663, 0.98335029, 0.98648812,
0.98915022, 0.99113018, 0.99288414, 0.9945334 , 0.99557204,
0.99657114, 0.99748579, 0.99829715, 0.99889898, 0.99941502,
0.99968761, 0.99991763, 0.99997061, 0.99999557, 1.        ])

#### Step 6: Determine the number of principal components

Here we can either consider the number of principal components of any value of our choice or by limiting the explained variance. Here I am considering explained variance more than equal to 50%. Let’s check how many principal components come into this.

## Python3

 n_components = np.argmax(explained_var >= 0.50) + 1 n_components

Output:

2

#### Step 7: Project the data onto the selected principal components

Steps:

• Find the projection matrix, It is a matrix of eigenvectors corresponding to the largest eigenvalues of the covariance matrix of the data. it projects the high-dimensional dataset onto a lower-dimensional subspace
• The eigenvectors of the covariance matrix of the data are referred to as the principal axes of the data, and the projection of the data instances onto these principal axes are called the principal components.

## Python3

 # PCA component or unit matrix u = eigenvectors[:,:n_components] pca_component = pd.DataFrame(u,                              index = cancer['feature_names'],                              columns = ['PC1','PC2']                             )   # plotting heatmap plt.figure(figsize =(5, 7)) sns.heatmap(pca_component) plt.title('PCA Component') plt.show()

Output:

Principal component

• Then, we project our dataset using the formula:

• Dimensionality reduction is then obtained by only retaining those axes (dimensions) that account for most of the variance, and discarding all others.

Finding Projection in PCA

## Python3

 # Matrix multiplication or dot Product Z_pca = Z @ pca_component   Z_pca = pd.DataFrame(Z_pca.values,                      columns = ['PCA1','PCA2']                             ) Z_pca.head()

Output:

PCA

The eigenvectors of the covariance matrix of the data are referred to as the principal axes of the data, and the projection of the data instances onto these principal axes are called the principal components. Dimensionality reduction is then obtained by only retaining those axes (dimensions) that account for most of the variance, and discarding all others.

## Python3

 # Importing PCA from sklearn.decomposition import PCA   # Let's say, components = 2 pca = PCA(n_components = 2) pca.fit(Z) x_pca = pca.transform(Z)   # Create the dataframe df_pca1 = pd.DataFrame(x_pca, columns=['PC{}'.format(i+1) for i in range(n_components)]) df_pca1.head()

Output:

PCA

We can match from the above Z_pca result from it is exactly the same values

## Python3

 # giving a larger plot plt.figure(figsize =(8, 6))   plt.scatter(x_pca[:, 0], x_pca[:, 1], c = cancer['target'], cmap ='plasma')   # labeling x and y axes plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.show()

Output:

PCA

#### Find the PCA component or the projection of the data instances onto the principal axes

it will be the same value that we calculated earlier as a Unit matrix.

## Python3

 # components pca.components_

Output:

array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
[-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
-0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
-0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

1. Dimensionality Reduction: PCA is a popular technique used for dimensionality reduction, which is the process of reducing the number of variables in a dataset. By reducing the number of variables, PCA simplifies data analysis, improves performance, and makes it easier to visualize data.
2. Feature Selection: PCA can be used for feature selection, which is the process of selecting the most important variables in a dataset. This is useful in machine learning, where the number of variables can be very large, and it is difficult to identify the most important variables.
3. Data Visualization: PCA can be used for data visualization. By reducing the number of variables, PCA can plot high-dimensional data in two or three dimensions, making it easier to interpret.
4. Multicollinearity: PCA can be used to deal with multicollinearity, which is a common problem in a regression analysis where two or more independent variables are highly correlated. PCA can help identify the underlying structure in the data and create new, uncorrelated variables that can be used in the regression model.
5. Noise Reduction: PCA can be used to reduce the noise in data. By removing the principal components with low variance, which are assumed to represent noise, PCA can improve the signal-to-noise ratio and make it easier to identify the underlying structure in the data.
6. Data Compression: PCA can be used for data compression. By representing the data using a smaller number of principal components, which capture most of the variation in the data, PCA can reduce the storage requirements and speed up processing.
7. Outlier Detection: PCA can be used for outlier detection. Outliers are data points that are significantly different from the other data points in the dataset. PCA can identify these outliers by looking for data points that are far from the other points in the principal component space.