Skip to content
Related Articles

Related Articles

Split the Dataset into the Training & Test Set in R

Improve Article
Save Article
  • Last Updated : 25 Jul, 2022
Improve Article
Save Article

In this article, we are going to see how to Splitting the dataset into the training and test sets using R Programming Language.

Method 1: Using base R 

The sample() method in base R is used to take a specified size data set as input. The data set may be a vector, matrix or a data frame. This method then extracts a sample from the specified data set. The sample chosen contains elements of a specified size from the data set which can be either chosen with or without replacement.

The sampling method has the following documentation in R : 

Syntax: sample(vec, size, replace = FALSE, prob = NULL)

Arguments : 

  • vec – A vector or matrix of elements from where to choose the sample.
  • size – The total number of entries chosen. 
  • replace – Indicative of whether the sample should be done with or without replacement
  • prob – Probability weights indicating the proportion of elements to be kept in training and testing subsets 

The following code snippet illustrates the procedure where first the dataset matrix is created.

R




# creating the data set
mat = matrix(
  # values from 1 to 21
  c(1:21),
  # No of rows
  nrow = 7,  
  # No of columns
  ncol = 3, 
  byrow = TRUE
)
print ("Dataset")
print (mat)
  
# divide the matrix into training set 70% and
# testing 30% respectively with replacement
sample <- sample(c(TRUE,FALSE), nrow(mat), 
                 replace=TRUE, prob=c(0.7,0.3))
  
# creating training dataset
train_dataset  <- mat[sample, ]
  
# creating testing dataset
test_dataset  <- mat[!sample, ]
  
print("Training Dataset")
print (train_dataset)
print("Testing Dataset")
print (test_dataset)


Output:

 

Method 2: Using dplyr package in R

The dplyr package in R is used to perform data manipulations and operations. It can be loaded and installed into the R working space using the following command : 

install.packages("dplyr")

A data frame is first created using the data.frame method in R. The sample_frac method of the dplyr package is then applied using the piping operator. The sample_frac() method in this package is used to select random sample from the input data set. It is used to select the specified percentage of items from the input dataset. The training dataset can be created using this method. It has the following syntax : 

Syntax: sample_frac(dataset, perc)

Arguments : 

  • dataset – The input dataset
  • perc – The percentage of sample used to include in the training dataset 

In order to create the testing dataset, the anti_join() method of this package can be used which is used to select the rows from the main input dataset that do not lie in the dataset specified as the second argument. As a result, both the datasets will be disjoint in nature. The method has the following syntax : 

Syntax: anti_join(dataset, dataframe, by = col_name)

Arguments : 

  • dataset – The input dataset
  • dataframe – The input data frame to check and compare the values with 
  • by – The column name whose values are to be checked

R




# installing the reqd library
library("dplyr")
# creating a data frame
data_frame = data.frame(col1 = c(1:15),
                        col2 = letters[1:15],
                        col3 = c(0,1,1,1,0,0,0,0,
                                 0,1,1,0,1,1,0))
print("Data Frame")
print(data_frame)
  
print ("Training Dataset")
training_dataset  <- data_frame %>% dplyr::sample_frac(0.7)
print (training_dataset)
print ("Testing Dataset")
testing_dataset   <- dplyr::anti_join(data_frame,
                                      training_dataset, by = 'col1')
print (testing_dataset)


Output:

 

Method 3: Using catools package in R

The sample.split method in catools package can be used to divide the input dataset into training and testing components respectively. It divides the specified vector into the pre-defined fixed ratio which is given as the second argument of the method.

Syntax: sample.split(vec , SplitRatio = x)

Arguments : 

  • vec – The vector comprising of the data labels
  • SplitRatio – Indicator of the splitting ratio to be used 

This method creates a boolean vector with the number of entries equivalent to the vector length specified. The subset of the main input dataset can then be extracted using the sample vector and the subset method in this package. Now the training dataset can be accessed using the following syntax : 

Syntax: subset(data-frame, sample == TRUE/FALSE)

Arguments : 

  • data-frame – The data set to create the sample from 
  • sample – The rows from the dataset will be accessed wherever the values of the sample vector hold true. 

The training and testing datasets can be created using the subset() method respectively.

R




# installing the reqd library
library("caTools")
  
# creating a data frame
data_frame = data.frame(col1 = c(1:15),
                        col2 = letters[1:15],
                        col3 = c(0,1,1,1,0,0,0,
                                 0,0,1,1,0,1,1,0))
print("Data Frame")
print(data_frame)
  
# creating a sample diving into the ratio of 60:40
sample <- sample.split(data_frame$col2, SplitRatio = 0.6)
print ("Training Dataset")
  
# check if sample is true 
training_dataset  <- subset(data_frame, sample == TRUE)
print (training_dataset)
print ("Testing Dataset")
  
# check if sample holds false
testing_dataset   <- subset(data_frame, sample == FALSE)
print (testing_dataset)


Output:

 


My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!