Skip to content
Related Articles
Get the best out of our app
Open App

Related Articles

ML | Overview of Data Cleaning

Improve Article
Save Article
Like Article
Improve Article
Save Article
Like Article

Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any hidden tricks or secrets to uncover. However, the success or failure of a project relies on proper data cleaning. Professional data scientists usually invest a very large portion of their time in this step because of the belief that “Better data beats fancier algorithms”

If we have a well-cleaned dataset, there are chances that we can get achieve good results with simple algorithms also, which can prove very beneficial at times especially in terms of computation when the dataset size is large. Obviously, different types of data will require different types of cleaning. However, this systematic approach can always serve as a good starting point. 

Steps involved in Data Cleaning: Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and removing any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent, and free of errors, as incorrect or inconsistent data can negatively impact the performance of the ML model.

Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the data science pipeline that involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the data to improve its quality and usability. Data cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which can negatively impact the accuracy and reliability of the insights derived from it.

The following are the most common steps involved in data cleaning:

  1. Data inspection and exploration: This step involves understanding the data by inspecting its structure, identifying missing values, outliers, and inconsistencies.
  2. Handling missing data: Missing data is a common issue in real-world datasets, and it can occur due to various reasons such as human errors, system failures, or data collection issues. Various techniques can be used to handle missing data, such as imputation, deletion, or substitution.
  3. Handling outliers: Outliers are extreme values that deviate significantly from the majority of the data. They can negatively impact the analysis and model performance. Techniques such as clustering, interpolation, or transformation can be used to handle outliers.
  4. Data transformation: Data transformation involves converting the data from one form to another to make it more suitable for analysis. Techniques such as normalization, scaling, or encoding can be used to transform the data.
  5. Data integration: Data integration involves combining data from multiple sources into a single dataset to facilitate analysis. It involves handling inconsistencies, duplicates, and conflicts between the datasets.
  6. Data validation and verification: Data validation and verification involve ensuring that the data is accurate and consistent by comparing it with external sources or expert knowledge.
  7. Data formatting: Data formatting involves converting the data into a standard format or structure that can be easily processed by the algorithms or models used for analysis.

In summary, data cleaning is a crucial step in the data science pipeline that involves identifying and correcting errors, inconsistencies, and inaccuracies in the data to improve its quality and usability. It involves various techniques such as handling missing data, handling outliers, data transformation, data integration, data validation and verification, and data formatting. The goal of data cleaning is to prepare the data for analysis and ensure that the insights derived from it are accurate and reliable.


1. Removal of unwanted observations: This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate observations most frequently arise during data collection and Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve. 

  • Redundant observations alter the efficiency to a great extent as the data repeats and may add towards the correct side or towards the incorrect side, thereby producing unfaithful results.
  • Irrelevant observations are any type of data that is of no use to us and can be removed directly.

2. Fixing Structural errors: The errors that arise during measurement, transfer of data, or other similar situations are called structural errors. Structural errors include typos in the name of features, the same attribute with a different name, mislabeled classes, i.e. separate classes that should really be the same, or inconsistent capitalization. 

For example, the model will treat America and America as different classes or values, though they represent the same value or red, yellow, and red-yellow as different classes or attributes, though one class can be included in the other two classes. So, these are some structural errors that make our model inefficient and give poor-quality results.

3. Managing Unwanted outliers: Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models. Generally, we should not remove outliers until we have a legitimate reason to remove them. Sometimes, removing them improves performance, sometimes not. So, one must have a good reason to remove the outlier, such as suspicious measurements that are unlikely to be part of real data.

4. Handling missing data: Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove the missing observation. They must be handled carefully as they can be an indication of something important. 

The two most common ways to deal with missing data are: 

  1. Dropping observations with missing values.
    • The fact that the value was missing may be informative in itself.
    • Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!
  2. Imputing the missing values from past observations.
    • Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was missing.
    • Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features.

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle. 
So, missing data is always an informative and an indication of something important. And we must be aware of our algorithm of missing data by flagging it. By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean. 

Some data cleansing tools:

  • Openrefine
  • Trifacta Wrangler 
  • TIBCO Clarity
  • Cloudingo
  • IBM Infosphere Quality Stage

Data cleaning is an important step in the machine learning process because it can have a significant impact on the quality and performance of a model. Data cleaning involves identifying and correcting or removing errors and inconsistencies in the data.

Here is a simple example of data cleaning in Python:


import pandas as pd
# Load the data
df = pd.read_csv("data.csv")
# Drop rows with missing values
df = df.dropna()
# Remove duplicate rows
df = df.drop_duplicates()
# Remove unnecessary columns
df = df.drop(columns=["col1", "col2"])
# Normalize numerical columns
df["col3"] = (df["col3"] - df["col3"].mean()) / df["col3"].std()
# Encode categorical columns
df["col4"] = pd.get_dummies(df["col4"])
# Save the cleaned data
df.to_csv("cleaned_data.csv", index=False)

The code I provided does not have any explicit output statements, so it will not produce any output when it is run. Instead, it modifies the data stored in the df DataFrame and saves it to a new CSV file.

If you want to see the cleaned data, you can print the df DataFrame or read the saved CSV file. For example, you can add the following line at the end of the code to print the cleaned data:


Advantages of Data Cleaning in Machine Learning:

  1. Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors, inconsistencies, and irrelevant data, which can help the model to better learn from the data.
  2. Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can help improve the accuracy of the ML model.
  3. Better representation of the data: Data cleaning allows the data to be transformed into a format that better represents the underlying relationships and patterns in the data, making it easier for the ML model to learn from the data.
  4. Improved data quality: Data cleaning helps to improve the quality of the data, making it more reliable and accurate. This ensures that the machine learning models are trained on high-quality data, which can lead to better predictions and outcomes.
  5. Improved data security: Data cleaning can help to identify and remove sensitive or confidential information that could compromise data security. By eliminating this information, data cleaning can help to ensure that only the necessary and relevant data is used for machine learning.

Disadvantages of Data Cleaning in Machine Learning:

  1. Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets.
  2. Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in the loss of important information or the introduction of new errors.
  3. Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed data may not be representative of the underlying relationships and patterns in the data.
  4. Data loss: Data cleaning can result in the loss of important information that may be valuable for machine learning analysis. In some cases, data cleaning may result in the removal of data that appears to be irrelevant or inconsistent, but which may contain valuable insights or patterns.
  5. Cost and resource-intensive: Data cleaning can be a resource-intensive process that requires significant time, effort, and expertise. It can also require the use of specialized software tools, which can add to the cost and complexity of data cleaning.
  6. Overfitting: Overfitting occurs when a machine learning model is trained too closely on a particular dataset, resulting in poor performance when applied to new or different data. Data cleaning can inadvertently contribute to overfitting by removing too much data, leading to a loss of information that could be important for model training and performance.

Conclusion: So, we have discussed four different steps in data cleaning to make the data more reliable and to produce good results. After properly completing the Data Cleaning steps, we’ll have a robust dataset that avoids many of the most common pitfalls. This step should not be rushed as it proves very beneficial in the further process.

My Personal Notes arrow_drop_up
Last Updated : 05 May, 2023
Like Article
Save Article
Similar Reads
Related Tutorials