Skip to content
Related Articles
Open in App
Not now

Related Articles

ML | Label Encoding of datasets in Python

Improve Article
Save Article
  • Difficulty Level : Easy
  • Last Updated : 11 Jan, 2023
Improve Article
Save Article

In machine learning, we usually deal with datasets that contain multiple labels in one or more than one columns. These labels can be in the form of words or numbers. To make the data understandable or in human-readable form, the training data is often labelled in words. 

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

Example : 
Suppose we have a column Height in some dataset. 

After applying label encoding, the Height column is converted into: 

where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.

We apply Label Encoding on iris dataset on the target column which is Species. It contains three species Iris-setosa, Iris-versicolor, Iris-virginica


# Import libraries 
import numpy as np
import pandas as pd
# Import dataset
df = pd.read_csv('../../data/Iris.csv')


array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

After applying Label Encoding – 


# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['species']= label_encoder.fit_transform(df['species'])


array([0, 1, 2], dtype=int64)

Limitation of label Encoding 
Label encoding converts the data in machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets. A label with a high value may be considered to have high priority than a label having a lower value.

An attribute having output classes Mexico, Paris, Dubai. On Label Encoding, this column lets Mexico is replaced with 0, Paris is replaced with 1, and Dubai is replaced with 2. 
With this, it can be interpreted that Dubai has high priority than Mexico and Paris while training the model, But actually, there is no such priority relation between these cities here.

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!