1

Below is an exercise from Datacamp.

Using the cbind() call to include all three sheets. Make sure the first column of urban_sheet2 and urban_sheet3 are removed, so you don't have duplicate columns. Store the result in urban.

Code:

# Add code to import data from all three sheets in urbanpop.xls
path <- "urbanpop.xls"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE)

# Extend the cbind() call to include urban_sheet3: urban
urban <- cbind(urban_sheet1, urban_sheet2[-1],urban_sheet3[-1])

# Remove all rows with NAs from urban: urban_clean
urban_clean<-na.omit(urban)

My question is why using [-1] to remove the first column in cbind. Is it a special use of square brackets inside cbind()? Does that mean that if I want to remove the first two columns the code should be urban_sheet2[-2]? I only know that square brackets are used for selecting certain columns or rows. This confuses me.

Eva
  • 61
  • 1
  • 3
  • 10
  • 2
    Plenty of examples out there to start learning about subsetting. http://www.statmethods.net/management/subset.html – Eric Watt Jul 10 '17 at 17:20
  • Possible duplicate of [Remove an entire column from a data.frame in R](https://stackoverflow.com/questions/6286313/remove-an-entire-column-from-a-data-frame-in-r) – Eric Watt Jul 10 '17 at 17:22

1 Answers1

8

This is not specific to cbind(). You can use - inside square brackets to remove any particular row or column you want. If your data frame is df, df[,-1] will have its first column removed. df[,-2] will have its second (and only second) column removed. df[,-c(1,2)] will have both its first and second columns removed. Likewise, df[-1,] will have its first row removed, etc.

This cannot be done with column names, e.g., df[,-"var1"] will not work. To use column names, you can use which(), as in df[,-which(names(df) %in% "var1")], but simply df[,!names(df) %in% "var1")] is easier and yields the same result. You can also use subset(): subset(df, select = -c(var1, var2)); this will remove the columns named "var1" and "var2".

Note that removing rows and columns only affects the output of the call, and will not affect the original object unless the output is assigned to the original object.

Noah
  • 3,437
  • 1
  • 11
  • 27
  • Thanks, this makes perfect sense now. But why the code 'urban_sheet2[-1]' without that comma works in Datacamp? Is it a mistake or [, -1] equals [-1]? – Eva Jul 11 '17 at 08:39
  • If you have a data frame, you can omit the comma to refer to commas. I prefer to always include it so that it's clear you are referencing a column of a data frame or matrix rather than an item in a vector. Omitting the comma is just a shortcut which, in my opinion, just makes things more confusing. – Noah Jul 11 '17 at 15:51