Skip to content
Related Articles

Related Articles

How to Transform Data in R?

Improve Article
Save Article
  • Last Updated : 29 Sep, 2022
Improve Article
Save Article

In this article, we will learn how to transform data in the R programming language.

The data transformation is mostly handled by the external packages tidyverse and dplyr in R. These packages provide many methods to carry out the data simulations. There are a large number of ways to simulate data transformation in R. These methods are widely available using these packages, which can be downloaded and installed using the following command : 

install.packages("tidyverse")

Method 1: Using Arrange() method

The arrange() method in R is used to create an order for the sequence of the observations given. It takes a single column or a set of columns as the input to the method and creates an order for these. 

The arrange() method in the tidyverse package inputs a list of column names to rearrange them in a specified order. By default, the arrange() method arranges the data in ascending order. It has the following syntax : 

Syntax: arrange(col-name) 

Parameter:

col-name – Name of the column.

The data frame can be supplied with a pipe operator followed by the application of arrange() method to reflect the changes.

R




# Importing tidyvverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  l3 = c(0,1,1,1,0,0,0,0))
 
# Assigning row names
rownames(data_frame) <- c("r1",
"r2","r3","r4","r5","r6","r7","r8")
print("Data Frame")
print(data_frame)
 
# Arranging a single column in ascending order
arr_data_frame <- data_frame %>% arrange(col1)
print("Arranged Data Frame")
print(arr_data_frame)


Output:

   col1 col2 col3
r1    2    a    0
r2    4    b    1
r3    1    c    1
r4    7    d    1
r5    5    e    0
r6    3    f    0
r7    5    g    0
r8    8    h    0
[1] "Arranged Data Frame"
   col1 col2 col3
r3    1    c    1
r1    2    a    0
r6    3    f    0
r2    4    b    1
r5    5    e    0
r7    5    g    0
r4    7    d    1
r8    8    h    0

Explanation : 

The minimum col1 value is 1, and then the largest is 8. All the col1 values in the data frame are arranged in ascending order, and the rows are shuffled accordingly. Similarly, we can arrange the data in descending order using desc() method inside arrange() method and the data frame rows are shuffled accordingly as in the below example.

The column values can also be arranged in descending order by specifying the order explicitly using the following syntax : 

Syntax: arrange(desc(col-name))

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c(0,1,1,1,0,0,0,0))
 
# Assigning row names
rownames(data_frame) <- c("r1",
"r2","r3","r4","r5","r6","r7","r8")
print("Data Frame")
# Printing data frame
print(data_frame)
 
# Arranging column in descending order
arr_data_frame <- data_frame %>%
              arrange(desc(col1))
print("Arranged Data Frame")
print(arr_data_frame)


Output:

 col1 col2 col3
r1    2    a    0
r2    4    b    1
r3    1    c    1
r4    7    d    1
r5    5    e    0
r6    3    f    0
r7    5    g    0
r8    8    h    0
[1] "Arranged Data Frame"
   col1 col2 col3
r8    8    h    0
r4    7    d    1
r5    5    e    0
r7    5    g    0
r2    4    b    1
r6    3    f    0
r1    2    a    0
r3    1    c    1

Method 2: Using select() method

The specific columns of the data frame can also be fetched using the select() method in tidyverse package. The columns are fetched in the order of their specification in the argument list of the select() method call. This method results in a subset of the data frame as the output. The following syntax is followed : 

Syntax: select(list-of-col-names)

Parameter:

list-of-col-names – List of column names separated by comma.

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c(0,1,1,1,0,0,0,0),
  col4 = c(9:16))
 
print("Data Frame")
 
# Printing data_frame
print(data_frame)
 
# Selecting a range of columns in df
arr_data_frame <- data_frame %>%
                select(col2,col4)
print("Selecting col2 and col4 in Data Frame")
print(arr_data_frame)


Output:

 col1 col2 col3 col4
1    2    a    0    9
2    4    b    1   10
3    1    c    1   11
4    7    d    1   12
5    5    e    0   13
6    3    f    0   14
7    5    g    0   15
8    8    h    0   16
> 
> # selecting a range of columns in df 
> arr_data_frame <- data_frame %>% select(col2,col4)
> print("Selecting col2 and col4 in Data Frame")
[1] "Selecting col2 and col4 in Data Frame"
> print(arr_data_frame)
  col2 col4
1    a    9
2    b   10
3    c   11
4    d   12
5    e   13
6    f   14
7    g   15
8    h   16

The col2 and col4 of the data frame are selected and displayed as the output. 

A consecutive range of columns can also be fetched from the data frame by specifying the colon operator. For instance, the following code snippet indicates that a range of columns can be extracted using the command col2:col4, which fetches all the columns in order beginning from col2 of the data frame. 

Since there is a colon operator between col2 and col4, all the columns in the data frame are selected beginning from col2 and ending at col4 in order. Therefore, col2, col3, and col4 are returned as the output data frame. 

Syntax: select(begin-col : end-col)

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c(0,1,1,1,0,0,0,0),
  col4 = c(9:16))
 
print("Data Frame")
print(data_frame)
 
# Selecting a range of columns
# in df
arr_data_frame <- data_frame %>%
  select(col2:col4)
print("Selecting col2 to col4 in Data Frame")
print(arr_data_frame)


Output:

  col1 col2 col3 col4
1    2    a    0    9
2    4    b    1   10
3    1    c    1   11
4    7    d    1   12
5    5    e    0   13
6    3    f    0   14
7    5    g    0   15
8    8    h    0   16
[1] "Selecting col2 to col4 in Data Frame"
  col2 col3 col4
1    a    0    9
2    b    1   10
3    c    1   11
4    d    1   12
5    e    0   13
6    f    0   14
7    g    0   15
8    h    0   16

Method 3: Using filter() method

The filter() method in the tidyverse package is used to apply a range of constraints and conditions to the column values of the data frame. It filters the data and results in the smaller output returned by the column values satisfying the specified condition. The conditions are specified using the logical operators, and values are validated then.  A data frame can be supplied with the pipe operator and then using the filter condition. 

Syntax: filter(cond1, cond2)

Parameter:

cond1, cond2 – Condition to be applied on data.

The following code snippet indicates that all the column values are returned where the col1 value is greater than 4. Only those rows are returned in the output of the data frame.

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c(0,1,1,1,0,0,0,0),
  col4 = c(9:16))
 
print("Data Frame")
 
# Printing data frame
print(data_frame)
 
# Selecting values where
# col1 value is greater than 4
arr_data_frame <- data_frame %>%
    filter(col1>4)
print("Selecting col1 >4 ")
 
# Printing data frame after
# applying filter
print(arr_data_frame)


Output:

 col1 col2 col3 col4
1    2    a    0    9
2    4    b    1   10
3    1    c    1   11
4    7    d    1   12
5    5    e    0   13
6    3    f    0   14
7    5    g    0   15
8    8    h    0   16
[1] "Selecting col1 >4 "
  col1 col2 col3 col4
1    7    d    1   12
2    5    e    0   13
3    5    g    0   15
4    8    h    0   16

Explanation : 

The output data frame contains the rows of the original data frame where the col1 value is greater than 4. The rows are fetched in the order in which they occur in the original data frame.

Multiple values can also be checked using the filter tag. For instance, the range of values to be checked is specified in the c() vector method. The following code snippet illustrates that the col3 value is checked either for the “there” or “this” value. A row satisfying any of these equality constraints is returned in the final output of the data frame.

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c("this","that","there",
"here","there","this","that","here"),
  col4 = c(9:16))
 
print("Data Frame")
 
# Printing data frame
print(data_frame)
 
# Selecting values where
# col1 value is greater than 4
arr_data_frame <- data_frame %>%
  filter(col3 == c("there","this"))
print("Selecting col1>4 ")
 
# Printing data frame after
# applying filter
print(arr_data_frame)


Output:

  col1 col2  col3 col4
1    2    a  this    9
2    4    b  that   10
3    1    c there   11
4    7    d  here   12
5    5    e there   13
6    3    f  this   14
7    5    g  that   15
8    8    h  here   16
[1] "Selecting col3 value is either there or this"
  col1 col2  col3 col4
1    1    c there   11
2    5    e there   13
3    3    f  this   14

Explanation :

The output data frame contains the rows of the original data frame where the col3 value is either “this” or “there”. The rows are fetched in the order in which they occur in the original data frame.

Multiple conditions can also be checked in the filter method and combined using the comma using filter() method. For instance, the below code checks for the col3 value equal to “there” and col1 value equivalent to 5, respectively. The output data frame contains the rows of the original data frame where the col1 value is equivalent to 5 and the col3 value is equivalent to “there.” The rows are fetched in the order in which they occur in the original data frame.

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c("this","that","there","here",
           "there","this","that","here"),
  col4 = c(9:16))
 
print("Data Frame")
print(data_frame)
 
# Selecting values where
# col3 value is there and col1 is 5
arr_data_frame <- data_frame %>%
    filter(col3=="there",col1==5)
print("Selecting col3 value
    is there and col1 is 5")
print(arr_data_frame)


Output:

  col1 col2  col3 col4
1    2    a  this    9
2    4    b  that   10
3    1    c there   11
4    7    d  here   12
5    5    e there   13
6    3    f  this   14
7    5    g  that   15
8    8    h  here   16
[1] "Selecting col3 value is there and col1 is 5"
  col1 col2  col3 col4
1    5    e there   13

Method 4: Using spread() method

The spread method in R is used to spread any key-value pair in multiple columns in the data frame. It is used to increase the readability of the data specified in the data frame. The data is rearranged according to the list of columns in the spread() method. All the data in col2 is repeated until the values in col3 are exhausted. The entire data frame is returned as the output. It has the following syntax : 

Syntax: spread(col-name)

Parameter:

col-name – Name of one or more columns according to which data is to be structured.

The following code arranges the data such that the values in col2 are assigned as column headings and their corresponding values as cell values of the data frame :

R




# Importing tidyr
library(tidyr)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c("A","A","A","A","A","A",
           "B","B","B","B","B","B"),
  col2 = c("Eng","Phy","Chem","MAQ","Bio","SST",
           "Eng","Phy","Chem","MAQ","Bio","SST"),
  col3 = c(34,56,46,23,72,67,89,43,88,45,78,99)
  )
 
print("Data Frame")
print(data_frame)
 
# Selecting values by col2 and col3
arr_data_frame <- data_frame %>%
    spread(col2,col3)
print("Spread using col2 and col3")
print(arr_data_frame)


Output:

   col1 col2 col3
1     A  Eng   34
2     A  Phy   56
3     A Chem   46
4     A  MAQ   23
5     A  Bio   72
6     A  SST   67
7     B  Eng   89
8     B  Phy   43
9     B Chem   88
10    B  MAQ   45
11    B  Bio   78
12    B  SST   99
[1] "Spread using col2 and col3"
  col1 Bio Chem Eng MAQ Phy SST
1    A  72   46  34  23  56  67
2    B  78   88  89  45  43  99

Explanation : 

The data in the data frame is spread using the col2 and col3 values. The col2 unique values are assigned as column headings, and their corresponding cell values are assigned in the data frame values. 

The following code arranges the data such that the values in col2 are assigned as row headings and their corresponding values as cell values of the data frame :

R




# Importing tidyr
library(tidyr)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c("A","A","A","A","A","A",
           "B","B","B","B","B","B"),
  col2 = c("Eng","Phy","Chem","MAQ","Bio","SST",
           "Eng","Phy","Chem","MAQ","Bio","SST"),
  col3 = c(34,56,46,23,72,67,89,43,88,45,78,99)
)
 
print("Data Frame")
print(data_frame)
 
# Selecting values by col1 and col3
arr_data_frame <- data_frame %>%
    spread(col1,col3)
print("Spread using col1 and col3")
print(arr_data_frame)


Output:

   col1 col2 col3
1     A  Eng   34
2     A  Phy   56
3     A Chem   46
4     A  MAQ   23
5     A  Bio   72
6     A  SST   67
7     B  Eng   89
8     B  Phy   43
9     B Chem   88
10    B  MAQ   45
11    B  Bio   78
12    B  SST   99
[1] "Spread using col1 and col3"
  col2  A  B
1  Bio 72 78
2 Chem 46 88
3  Eng 34 89
4  MAQ 23 45
5  Phy 56 43
6  SST 67 99

Method 5: Using mutate() method

The mutate() method in R is used to create and modify new variables in the specified data frame. A new column name can be assigned to the data frame and evaluated to an expression where constants or column values can be used. The output data frame has the new columns created. The method has the following syntax :

Syntax: mutate (new-col-name = expr)

Parameters:

  • new-col-name – Name of column to be created.
  • expr –  Expression which is applied on new column.

Multiple columns can also be added to the data frame and separated using the comma operator. The following code snippet illustrates the addition of two new columns, col5, which is the summation of col1 and col4 values, and col6, which is constant 1 added to the col3 values. For example, in first row col1 = 2 and col4 = 9 , therefore col5 = 2 + 9 = 11. And col6 is the addition of col3 value and 1. 

R




# Importing tidyverse
library(tidyverse)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c(0,1,1,1,0,0,0,0),
  col4 = c(9:16))
 
print("Data Frame")
print(data_frame)
 
# Added new columns
data_frame_mutate <- data_frame %>%
  mutate(col5 = col1 + col4 ,
         col6 = col3+1)
print("Mutated Data Frame")
print(data_frame_mutate)


Output:

 col1 col2 col3 col4
1    2    a    0    9
2    4    b    1   10
3    1    c    1   11
4    7    d    1   12
5    5    e    0   13
6    3    f    0   14
7    5    g    0   15
8    8    h    0   16
[1] "Mutated Data Frame"
  col1 col2 col3 col4 col5 col6
1    2    a    0    9   11    1
2    4    b    1   10   14    2
3    1    c    1   11   12    2
4    7    d    1   12   19    2
5    5    e    0   13   18    1
6    3    f    0   14   17    1
7    5    g    0   15   20    1
8    8    h    0   16   24    1

Method 6: Using group_by() and summarise() method

The group_by() and summarise() methods in R are used collectively to group by variables of the data frame and reduce multiple values down to a single value. It is used to make the data more readable. The column name can be specified in R’s group_by() method. The data can be arranged in groups and then further summarised using the base aggregate methods in this package. 

Syntax: group_by(col-name) 

Syntax: group_by(col,..) %>% summarise(action)

The data in the data frame are grouped according to the col3 value. The count column indicates the number of records in each group; for instance, there are five rows with col3 = 0. The mean is then calculated for all the elements in a. particular group.

R




# Importing dplyr
library(dplyr)
 
# Creating a data frame
data_frame = data.frame(
  col1 = c(2,4,1,7,5,3,5,8),
  col2 = letters[1:8],
  col3 = c(0,1,1,1,0,0,0,0),
  col4 = c(9:16))
 
print("Data Frame")
print(data_frame)
 
# Mutate data using group_by()
# and summarise()
data_frame_mutate <- data_frame %>%
    group_by(col3) %>%
  summarise(
    count = n(),
    mean_col1 = mean(col1)
  )
print("Mutated Data Frame")
print(data_frame_mutate)


Output:

 col1 col2 col3 col4
1    2    a    0    9
2    4    b    1   10
3    1    c    1   11
4    7    d    1   12
5    5    e    0   13
6    3    f    0   14
7    5    g    0   15
8    8    h    0   16
[1] "Mutated Data Frame"
# A tibble: 2 x 3
   col3 count mean_col1
  <dbl> <int>     <dbl>
1     0     5       4.6
2     1     3       4  

Method 7: Using the gather() method

Sometimes, many data columns indicate the values that should be stored in a single column of the data frame and are unnecessarily bifurcated. This can be done using the gather() method. It collects the key-value pairs and rearranges them in new columns. The column values are specified as arguments of the gather() method. For instance, the following code snippet illustrates the combination of col2 to col4 of the data frame under argument 2 of the gather() method. The particular column from which the value was chosen is assigned the tag under the column name “Subject” given by argument 1 of the called method.

Syntax: gather(data, key, value)

The col2, col3, and col4 values in the data frame are clubbed under the column “Subject” in the output data frame. The marks of each student in each subject are then assigned in the col3 value. 

R




# Importing dplyr
library(dplyr)
 
# Creating a data frame
data_frame = data.frame(col1 =
  c("Jack","Jill","Yash","Mallika",
    "Muskan","Keshav","Meenu","Sanjay"),
     Maths = c(26,47,14,73,65,83,95,48),
     Physics = c(24,53,45,88,68,35,78,24),
     Chemistry = c(67,23,79,67,33,66,25,78)
     )
 
print("Data Frame")
print(data_frame)
 
data_frame_mutate <- data_frame %>%
    gather("Subject","Marks",2:4)
print("Mutated Data Frame")
 
# Printing after rearrange data
print(data_frame_mutate)


Output:

     col1 Maths Physics Chemistry
1    Jack    26      24        67
2    Jill    47      53        23
3    Yash    14      45        79
4 Mallika    73      88        67
5  Muskan    65      68        33
6  Keshav    83      35        66
7   Meenu    95      78        25
8  Sanjay    48      24        78
[1] "Mutated Data Frame"
      col1   Subject Marks
1     Jack     Maths    26
2     Jill     Maths    47
3     Yash     Maths    14
4  Mallika     Maths    73
5   Muskan     Maths    65
6   Keshav     Maths    83
7    Meenu     Maths    95
8   Sanjay     Maths    48
9     Jack   Physics    24
10    Jill   Physics    53
11    Yash   Physics    45
12 Mallika   Physics    88
13  Muskan   Physics    68
14  Keshav   Physics    35
15   Meenu   Physics    78
16  Sanjay   Physics    24
17    Jack Chemistry    67
18    Jill Chemistry    23
19    Yash Chemistry    79
20 Mallika Chemistry    67
21  Muskan Chemistry    33
22  Keshav Chemistry    66
23   Meenu Chemistry    25
24  Sanjay Chemistry    78

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!