© 2018 - Analytics Link

Subsetting your datasets in R - multiple solutions!

July 28, 2017

Often you'll want to access certain elements from your dataset, sometimes for performance and sometimes for keeping the data manageable!

 

Here are some ways you can do that using the inbuilt features within R

 

                                                                                                                                    

Use case 1:  I want to select or keep only certain columns from my dataset:

 

In this scenario we have a dataset with 10 columns, named c1 through to c10 and we want to keep c1, c2, c3, and c7

 

# method 1 - using column names

mydata_new <- mydata_old[c("c1","c2","c3","c7")]

 

# method 2 - using column numbers

mydata_new <- mydata_old[c(1,2,3,7)]

 

# method 2a - using column numbers (including a sequence)

mydata_new <- mydata_old[c(1:3,7)]

 

 

Use case 2:  I want to remove or drop certain columns from my dataset:

 

In this scenario we have a dataset with 10 columns, named c1 through to c10 and we want to drop c4 and c9

 

# method 1 - using column names

mydata_new <- mydata_old[, -which(names(mydata_old) %in% c("c4","c9"))]

 

# method 2 - using column numbers

mydata_new <- mydata_old[-c(4,9)] OR mydata_new <- mydata_old[c(-4,-9)]

 

 

Use case 3:  I want to select or keep only certain observations/rows from my dataset:

 

# selecting the first 20 observations/rows

mydata_new <- mydata_old[1:20,]

 

# selecting observations/rows that meet a certain criteria using 'which'

mydata_new <- mydata_old[which(mydata_old$gender == 'Male' & mydata_old$age > 30), ]

 

 

Use case 4:  I want to select or keep only certain observations/rows from my dataset AND certain columns:

 

This can be done using a combination of the above, but can also be done, quite simply, using the 'subset' function

 

# selecting observations/rows for males aged 30+, and only selecting certain columns in the process

mydata_new <- subset(mydata_old, gender == "Male" & age >= 30, select=c(gender, age, height, weight, ethnicity))

 

# similar to above, but using a range of ages (males between 30 and 40)

mydata_new <- subset(mydata_old, gender == "Male" & age >= 30 | age <= 40, select=c(gender, age, height, weight, ethnicity))

 

 

 

Use case 5:  I want to select a random subset of observations/rows from my dataset:

 

# selecting a random sample of 100 observations from the mydata_old dataset

mydata_sample <- mydata_old[sample(1:nrow(mydata_old), 100, replace=F),]

 

Note: in the above we're using replace = F or 'sampling without replacement'.  This means that once an observation is selected, it can't be selected again and the result is that we end up with 100 unique observations.  If we used replace = T this would be 'sampling WITH replacement' meaning once a row is randomly selected, it goes back into the pot to potentially be chosen again.  This is often used in 'bootstrapping' where we want to create many randomised samples to help solve the issue of small data size

 

 

 

Share on Facebook
Please reload

Please reload

Recent Posts