Skip to content
Related Articles

Related Articles

Data Cleaning in R

Improve Article
Save Article
  • Last Updated : 27 Jun, 2022
Improve Article
Save Article

In this article, we will be briefly going through Data cleaning with its application and its technique for implementation in the R programming language.

Data Cleaning in R

Data Cleaning is the process to transform raw data into consistent data that can be easily analyzed. It is aimed at filtering the content of statistical statements based on the data as well as their reliability. Moreover, it influences the statistical statements based on the data and improves your data quality and overall productivity.

Purpose of Data Cleaning

The following are the various purposes of data cleaning:

  • Eliminate Errors
  • Eliminate Redundancy
  • Increase Data Reliability
  • Delivery Accuracy
  • Ensure Consistency
  • Assure Completeness
  • Standardize your approach

Overview of a typical data analysis chain

This section represents an overview of a typical data analysis. Each rectangle in the figure represents data in a certain state while each arrow represents the activities needed to get from one state to the other. The first state (Raw data) is the data as it comes in. Raw data may lack headers, contain wrong data types, wrong category labels, unknown or unexpected character encoding, and so on. Once this preprocessing has taken place, data can be deemed Technically correct Data. That is, in this state data can be read into an R data.frame, with correct names, types, and labels, without further trouble. However, this does not mean that the values are error-free or complete. Consistent data is the stage where data is ready for statistical inference. It is the data that most statistical theories use as a starting point. 

 

How to clean data in R

Here, this involves various steps, as from the initial raw data have to move toward the consistent and highly efficient data which is ready to me implement as per the requirements and produces the highly precise and accurate statistical results. The steps vary from data to data as in this case the user should be aware of the date he/she is using for the results. As there are many characteristics and common symptoms of messy data which totally depend on the data used by the user for analysis.

Characteristics of clean data include data are:

  •   Free of duplicate rows/values
  •   Error-free (misspellings free )
  •   Relevant (special characters free )
  •   The appropriate data type for analysis
  •   Free of outliers (or only contain outliers that have been identified/understood)
  •   Follows a “tidy data” structure

Common symptoms of messy data:

  •   Special characters (e.g. commas in numeric values)
  •   Numeric values stored as text/character data types
  •   Duplicate rows
  •   Misspellings
  •   Inaccuracies
  •   White space
  •   Missing data
  •   Zeros instead of null values vary.

Let’s Start the implementation of Data Cleaning in R

For this, we will use inbuilt datasets(air quality datasets) which are available in R. 

R




head(airquality)


Output:

 

In the above dataset, we can clearly see the NA value inside the columns which will generate the error or not produce the accurate predictions for Machine Learning Model.

Handling missing value in R

To handle the missing value we will check the columns of the datasets, if we found some missing data inside the columns then this generates the NA values as an output, which can be not good for every model. So let’s check it using mean() methods.

R




mean(airquality$Solar.R)


Output:

<NA>

Checking another column

R




mean(airquality$Ozone)


Output:

<NA>

Checking another column

Here we get the mean value of Wind Columns which means it doesn’t have any missing value in this column.

R




mean(airquality$Wind)


Output:

9.95751633986928

Handling NA values

Handling NA value using na.rm in both columns.

R




mean(airquality$Solar.R, na.rm = TRUE)


Output:

185.931506849315

Also performing the same operation on another column.

R




mean(airquality$Ozone, na.rm = TRUE)


Output:

42.1293103448276

Data Cleaning Operation

After checking the summary of the dataset and we found the  number on NA in two columns(Ozone and Solar.R)

R




summary(airquality)


Output:

 

We can get a clear visual of the irregular data using a boxplot.

R




boxplot(airquality)


Output:

 

Removing irregularities data with is.na() methods.

R




New_df = airquality
  
New_df$Ozone = ifelse(is.na(New_df$Ozone), 
                      median(New_df$Ozone,
                             na.rm = TRUE),
                      New_df$Ozone)


Output:

 

Performing the same operation in another column.

R




New_df$Solar.R = ifelse(is.na(New_df$Solar.R),
                        median(New_df$Solar.R, 
                               na.rm = TRUE),
                        New_df$Solar.R)


Now can clearly see that we don’t have any unclean data using summary methods.

R




summary(New_df)


Output:

 

We can clearly see that we don’t have any missing data inside data frame.

R




head(New_df)


Output:

 

Now our boxplot outliers also show no errors.

R




boxplot(New_df)


 


My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!