Skip to content
Related Articles

Related Articles

Data Manipulation in R with Dplyr Package

Improve Article
Save Article
  • Last Updated : 22 Aug, 2022
Improve Article
Save Article

In this article let’s discuss manipulating data in the R programming language.

In order to manipulate the data, R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use the data manipulation function, first need to import the dplyr package using library(dplyr) line of code. Below is the list of a few data manipulation functions present in dplyr package.

Function Name

Description

filter()

Produces a subset of a Data Frame.

distinct()

Removes duplicate rows in a Data Frame

arrange()

Reorder the rows of a Data Frame

select()

Produces data in required columns of a Data Frame

rename()

Renames the variable names

mutate()

Creates new variables without dropping old ones.

transmute()

Creates new variables by dropping the old.

summarize()

Gives summarized data like Average, Sum, etc.

filter() method

The filter() function is used to produce the subset of the data that satisfies the condition specified in the filter() method. In the condition, we can use conditional operators, logical operators, NA values, range operators etc. to filter out data. Syntax of filter() function is given below-

filter(dataframeName, condition)

Example:

In the below code we used filter() function to fetch the data of players who scored more than 100 runs from the “stats” data frame.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# fetch players who scored more
# than 100 runs
filter(stats, runs>100)


Output

  player runs wickets
1      B  200      20
2      C  408      NA

distinct() method

The distinct() method removes duplicate rows from data frame or based on the specified columns. The syntax of distinct() method is given below-

distinct(dataframeName, col1, col2,.., .keep_all=TRUE)

Example: 

Here in this example, we used distinct() method to remove the duplicate rows from the data frame and also remove duplicates based on a specified column.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D', 'A', 'A'),
                runs=c(100, 200, 408, 19, 56, 100),
                wickets=c(17, 20, NA, 5, 2, 17))
 
# removes duplicate rows
distinct(stats)
 
#remove duplicates based on a column
distinct(stats, player, .keep_all = TRUE)


Output

  player runs wickets
1      A  100      17
2      B  200      20
3      C  408      NA
4      D   19       5
5      A   56       2
  player runs wickets
1      A  100      17
2      B  200      20
3      C  408      NA
4      D   19       5

arrange() method

In R, the arrange() method is used to order the rows based on a specified column. The syntax of arrange() method is specified below-

arrange(dataframeName, columnName)

Example:

In the below code we ordered the data based on the runs from low to high using arrange() function.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# ordered data based on runs
arrange(stats, runs)


Output

  player runs wickets
1      D   19       5
2      A  100      17
3      B  200      20
4      C  408      NA

select() method

The select() method is used to extract the required columns as a table by specifying the required column names in select() method. The syntax of select() method is mentioned below-

select(dataframeName, col1,col2,…)

Example:

Here in the below code we fetched the player, wickets column data only using select() method.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# fetch required column data
select(stats, player,wickets)


Output

  player wickets
1      A      17
2      B      20
3      C      NA
4      D       5

rename() method

The rename() function is used to change the column names. This can be done by the below syntax-

rename(dataframeName, newName=oldName)

Example: 

In this example, we change the column name “runs” to “runs_scored” in stats data frame.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# renaming the column
rename(stats, runs_scored=runs)


Output

  player runs_scored wickets
1      A         100      17
2      B         200      20
3      C         408      NA
4      D          19       5

mutate() & transmute() methods

These methods are used to create new variables. The mutate() function creates new variables without dropping the old ones but transmute() function drops the old variables and creates new variables. The syntax of both methods is mentioned below-

mutate(dataframeName, newVariable=formula)

transmute(dataframeName, newVariable=formula)

Example:

In this example, we created a new column avg using mutate() and transmute() methods.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, 7, 5))
 
# add new column avg
mutate(stats, avg=runs/4)
 
# drop all and create a new column
transmute(stats, avg=runs/4)


Output

  player runs wickets    avg
1      A  100      17  25.00
2      B  200      20  50.00
3      C  408       7 102.00
4      D   19       5   4.75
     avg
1  25.00
2  50.00
3 102.00
4   4.75

Here mutate() functions adds a new column for the existing data frame without dropping the old ones where as transmute() function created a new variable but dropped all the old columns.

summarize() method

Using the summarize method we can summarize the data in the data frame by using aggregate functions like sum(), mean(), etc. The syntax of summarize() method is specified below-

summarize(dataframeName, aggregate_function(columnName))

Example:

 In the below code we presented the summarized data present in the runs column using summarize() method.

R




# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, 7, 5))
 
# summarize method
summarize(stats, sum(runs), mean(runs))


Output

  sum(runs) mean(runs)
1       727     181.75

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!