Data Munging in R Programming
Data Munging is the general technique of transforming data from unusable or erroneous form to useful form. Without a few degrees of data munging (irrespective of whether a specialized user or automated system performs it), the data can’t be ready for downstream consumption. Basically the procedure of cleansing the data manually is known as data munging. In R Programming the following ways are oriented with data munging process:
- apply() Family
- aggregate()
- dplyr package
- plyr package
Using apply() Family for Data Munging
In apply() collection of R the most basic function is the apply() function. Apart from that, there exists lapply(), sapply() and tapply(). The entire collection of apply() can be considered a substitute for a loop. It is the most restrictive type of function. It should be performed on a matrix which contains all homogeneous element. If the apply() function is performed using a data frame or any other kind of object, the function will first change it to a matrix and then perform its operation. It is basically used to avoid the explicit use of loop structure or construct.
Syntax:
apply(X, margin, function)
Parameters:
x: an array or matrix
margin: a value between 1 and 2 in order to decide where to apply the function [ 1- row; 2- column]
function: the function to apply
Example:
R
# Using apply() m <- matrix (C <- (1:10), nrow = 5, ncol = 6) m a_m <- apply (m, 2, sum) a_m |
Output:
[,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 6 1 6 1 6 [2,] 2 7 2 7 2 7 [3,] 3 8 3 8 3 8 [4,] 4 9 4 9 4 9 [5,] 5 10 5 10 5 10 [1] 15 40 15 40 15 40
In the above example, we are calculating the sum of the elements column-wise. Hence for a large set of data, we can easily produce the desired output.
The lapply() function is used to perform operations on a list and it returns a resultant list of the same size as the input list. The ‘l’ in lapply() refers to lists. The lapply() function does not need the margin parameter.
Syntax:
lapply(X, func)
Parameters:
X: the list or a vector or an object
func: the function to apply
Example:
R
# Using lapply() movies <- c ( "SPIDERMAN" , "BATMAN" , "AVENGERS" , "FROZEN" ) movies movies_lower <- lapply (movies, tolower) str (movies_lower) |
Output:
[1] "SPIDERMAN" "BATMAN" "AVENGERS" "FROZEN" List of 4 $ : chr "spiderman" $ : chr "batman" $ : chr "avengers" $ : chr "frozen"
The sapply() function takes any vector or object or list and performs the exact operation as the lapply() function. Both of them have the same syntax.
The tapply() function is used to calculate or measure mean, median, maximum, and so on, or to perform a function on each and every factor of the variable. It is efficiently used to create a subset of any vector and then to apply or perform any function on them.
Syntax:
tapply(X, index, func = NULL)
Parameters:
X: an object or vector
index: a list of factor
func: the function to apply
Example:
R
# Using tapply() data (iris) tapply (iris$Sepal.Width, iris$Species, median) |
Output:
setosa versicolor virginica 3.4 2.8 3.0
Using aggregate() in Data Munging
In R, aggregate() function is used to combine or aggregate the input data frame by applying a function on each column of a sub-data frame. In order to perform aggregation or to apply the aggregate() function we must include the following:
- The input data that we wish to aggregate
- The variable within the data that will be used to group by
- The function or calculation to apply
The aggregate() function will always return a data frame which contains all unique values from the input data frame after applying the specific function. We can only apply a single function inside an aggregate function. In order to include multiple functions inside the aggregate() function, we need to use the plyr package.
Syntax:
aggregate(formula, data, function)
Parameters:
formula: the variable(s) of the input data frame we want to apply functions on.
data: the data that we want to use for group by operation.
function: the function or calculation to be applied.
Example:
R
# R program to illustrate # aggregate() function assets <- data.frame ( asset.class = c ( "equity" , "equity" , "equity" , "option" , "option" , "option" , "bond" , "bond" ), rating = c ( "AAA" , "A" , "A" , "AAA" , "BB" , "BB" , "AAA" , "A" ), counterparty.a = c ( runif (3), rnorm (5)), counterparty.b = c ( runif (3), rnorm (5)), counterparty.c = c ( runif (3), rnorm (5))) assets exposures <- aggregate ( x = assets[ c ( "counterparty.a" , "counterparty.b" , "counterparty.c" )], by = assets[ c ( "asset.class" , "rating" )], FUN = function (market.values){ sum ( pmax (market.values, 0)) }) exposures |
Output:
asset.class rating counterparty.a counterparty.b counterparty.c 1 equity AAA 0.08250275 0.5474595 0.9966172 2 equity A 0.33931258 0.6442402 0.2348197 3 equity A 0.68078755 0.5962635 0.6126720 4 option AAA -0.47624689 -0.4622881 -1.2362731 5 option BB -0.78860284 0.3219559 -1.2847157 6 option BB -0.59461727 -0.2840014 -0.5739735 7 bond AAA 1.65090747 1.0918564 0.6179858 8 bond A -0.05402813 0.1602164 1.1098481 asset.class rating counterparty.a counterparty.b counterparty.c 1 bond A 0.00000000 0.1602164 1.1098481 2 equity A 1.02010013 1.2405038 0.8474916 3 bond AAA 1.65090747 1.0918564 0.6179858 4 equity AAA 0.08250275 0.5474595 0.9966172 5 option AAA 0.00000000 0.0000000 0.0000000 6 option BB 0.00000000 0.3219559 0.0000000
We can see that in the above example the values of assets data frame have been aggregated on “asset.class” and “rating” columns.
Using the plyr Package for Data Munging
The plyr package is used for splitting, applying, and combining data. The plyr is a set of tools that can be used for splitting up huge or big data for creating a homogeneous piece, then applying a function on each and every piece and finally combine all the resultant values. We can already perform these actions in R, but on using plyr we can do it easily since:
- The names, arguments, and outputs are totally consistent
- Convenient parallelism
- Both input and output involves data frames, matrices or lists
- To track the long execution or running programs it provides a progress bar
- Built-in informative error messages and error recovery
- Labels which are maintained through all transformations.
The two functions that we are going to discuss in this section are ddply() and llply(). For each subset of a given data frame, the ddply() applies a function and then combine the result.
Syntax:
ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE,
.drop = TRUE, .parallel = FALSE, .paropts = NULL)
Parameters:
data: the data frame that is to be processed
variable: the variable based on which it will split the data frame
fun: the function to be applied
…: other arguments that are passed to fun
progress: name of the progress bar
inform: whether to produce any informative error message
drop: combination of variables that is not in the input data frame should be preserved or dropped.
parallel: whether to apply function parallel
paropts: list of extra or additional options passed
Example:
R
# Using ddply() library (plyr) dfx <- data.frame ( group = c ( rep ( 'A' , 8), rep ( 'B' , 15), rep ( 'C' , 6)), sex = sample ( c ( "M" , "F" ), size = 29, replace = TRUE ), age = runif (n = 29, min = 18, max = 54) ) ddply (dfx, . (group, sex), summarize, mean = round ( mean (age), 2), sd = round ( sd (age), 2)) |
Output:
group sex mean sd 1 A F 41.00 9.19 2 A M 35.76 12.14 3 B F 34.75 11.70 4 B M 40.01 10.10 5 C F 25.13 10.37 6 C M 43.26 7.63
Now we will see how to use llply() to work on data munging. The llply() function is used on each element of lists, where we apply a function on them, and the combined resultant output is also a list.
Syntax:
llply(.data, .fun = NULL,
…, .progress = “none”, .inform = FALSE,
.parallel = FALSE, .paropts = NULL)
Example:
R
# Using llply() library (plyr) x <- list (a = 1:10, beta = exp (-3:3), logic = c ( TRUE , FALSE , FALSE , TRUE )) llply (x, mean) llply (x, quantile, probs = 1:3 / 4) |
Output:
$a [1] 5.5 $beta [1] 4.535125 $logic [1] 0.5 $a 25% 50% 75% 3.25 5.50 7.75 $beta 25% 50% 75% 0.2516074 1.0000000 5.0536690 $logic 25% 50% 75% 0.0 0.5 1.0
Using dplyr package for Data Munging
The dplyr package can be considered as a grammar of data manipulation which is providing us a consistent set of verbs that helps us to solve some most common challenges of data manipulation:
- arrange() is used to change the order of the rows.
- filter() is used to pick cases depending on their value or based on the value.
- mutate() is used to add new variables that are functions of already existing variables.
- select() is used to pick or select variables based on their names.
- summarize() is used to reduce multiple values to a single summary.
There are many more functions under dplyr. The dplyr uses a very efficient backend which leads to less waiting time for the computation. It is more efficient than the plyr package.
Syntax:
arrange(.data, …, .by_group = FALSE)
filter(.data, …)
mutate(.data, …)
select(.data, …)
summarize(X, by, fun, …, stat.name = deparse(substitute(X)),
type = c(“variable”,”matrix”), subset = TRUE, keepcolnames = FALSE)
Example:
R
# Using dplyr package # Import the library library (dplyr) # Using arrange() starwars %>% arrange ( desc (mass)) # Using filter() starwars %>% filter (species == "Droid" ) # Using mutate() starwars %>% mutate (name, bmi = mass / ((height / 100) ^ 2)) %>% select (name:mass, bmi) # Using select() starwars %>% select (name, ends_with ( "color" )) # Using summarise() starwars %>% group_by (species) %>% summarise (n = n (), mass = mean (mass, na.rm = TRUE )) %>% filter (n > 1) |
Output:
> starwars %>% arrange(desc(mass)) # A tibble: 87 x 13 name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis> <list> <list> 1 Jabba D~ 175 1358 NA green-tan, ~ orange 600 hermap~ Nal Hutta Hutt <chr~ <chr [0~ <chr [0]> 2 Grievous 216 159 none brown, white green, ye~ NA male Kalee Kaleesh <chr~ <chr [1~ <chr [1]> 3 IG-88 200 140 none metal red 15 none NA Droid <chr~ <chr [0~ <chr [0]> 4 Darth V~ 202 136 none white yellow 41.9 male Tatooine Human <chr~ <chr [0~ <chr [1]> 5 Tarfful 234 136 brown brown blue NA male Kashyyyk Wookiee <chr~ <chr [0~ <chr [0]> 6 Owen La~ 178 120 brown, grey light blue 52 male Tatooine Human <chr~ <chr [0~ <chr [0]> 7 Bossk 190 113 none green red 53 male Trandosha Trando~ <chr~ <chr [0~ <chr [0]> 8 Chewbac~ 228 112 brown unknown blue 200 male Kashyyyk Wookiee <chr~ <chr [1~ <chr [2]> 9 Jek Ton~ 180 110 brown fair blue NA male Bestine ~ Human <chr~ <chr [0~ <chr [1]> 10 Dexter ~ 198 102 none brown yellow NA male Ojom Besali~ <chr~ <chr [0~ <chr [0]> # ... with 77 more rows > starwars %>% filter(species == "Droid") # A tibble: 5 x 13 name height mass hair_color skin_color eye_color birth_year gender homeworld species films vehicles starships <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <list> <list> <list> 1 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid <chr [6]> <chr [0]> <chr [0]> 2 R2-D2 96 32 NA white, blue red 33 NA Naboo Droid <chr [7]> <chr [0]> <chr [0]> 3 R5-D4 97 32 NA white, red red NA NA Tatooine Droid <chr [1]> <chr [0]> <chr [0]> 4 IG-88 200 140 none metal red 15 none NA Droid <chr [1]> <chr [0]> <chr [0]> 5 BB8 NA NA none none black NA none NA Droid <chr [1]> <chr [0]> <chr [0]> > starwars %>% mutate(name, bmi = mass / ((height / 100) ^ 2)) %>% select(name:mass, bmi) # A tibble: 87 x 4 name height mass bmi <chr> <int> <dbl> <dbl> 1 Luke Skywalker 172 77 26.0 2 C-3PO 167 75 26.9 3 R2-D2 96 32 34.7 4 Darth Vader 202 136 33.3 5 Leia Organa 150 49 21.8 6 Owen Lars 178 120 37.9 7 Beru Whitesun lars 165 75 27.5 8 R5-D4 97 32 34.0 9 Biggs Darklighter 183 84 25.1 10 Obi-Wan Kenobi 182 77 23.2 # ... with 77 more rows > starwars %>% select(name, ends_with("color")) # A tibble: 87 x 4 name hair_color skin_color eye_color <chr> <chr> <chr> <chr> 1 Luke Skywalker blond fair blue 2 C-3PO NA gold yellow 3 R2-D2 NA white, blue red 4 Darth Vader none white yellow 5 Leia Organa brown light brown 6 Owen Lars brown, grey light blue 7 Beru Whitesun lars brown light blue 8 R5-D4 NA white, red red 9 Biggs Darklighter black light brown 10 Obi-Wan Kenobi auburn, white fair blue-gray # ... with 77 more rows > starwars %>% group_by(species) %>% + summarise(n = n(),mass = mean(mass, na.rm = TRUE)) %>% + filter(n > 1) # A tibble: 9 x 3 species n mass <chr> <int> <dbl> 1 Droid 5 69.8 2 Gungan 3 74 3 Human 35 82.8 4 Kaminoan 2 88 5 Mirialan 2 53.1 6 Twi'lek 2 55 7 Wookiee 2 124 8 Zabrak 2 80 9 NA 5 48
Please Login to comment...