Looking up values in data.frame for given rows

Question

I have the following dataset (sample). Obviously, the real data set that I am using is much larger:

gvkey     tic  stko   year
001689   AEP1     1   2011
017096    BA3     1   2011
001440    AEP     0   2011
002285     BA     0   2011
001689   AEP1     1   2012
017096    BA3     1   2012
001440    AEP     0   2012
002285     BA     0   2012

Here is the code to generate the data:

dat <- data.frame(gvkey=c("001689", "017096", "001440", "002285"), tic=c("AEP1", "BA3", "AEP", "BA"), stko=c(1, 1, 0, 0), year=c(2011,2011,2011,2011,2012,2012,2012,2012))

Here is what I would like to do: Each row represents a year-firm pair with tic being the firm ticker. Firms where stko equals 1 are subsidiaries and share the same tic as their parent company, but have a number attached to the ticker, e.g. AEP1 belongs to AEP. Basically, I would like to create a new variable parent, which indicates for each subsidiary (row with stko=1) the gvkey of the parent company. And I would like to do that for each year. The final data set should look like this:

gvkey     tic  stko   year  parent
001689   AEP1     1   2011  001440
017096    BA3     1   2011  002285
001440    AEP     0   2011  
002285     BA     0   2011  
001689   AEP1     1   2012  001440
017096    BA3     1   2012  002285
001440    AEP     0   2012  
002285     BA     0   2012

Now, my initial approach would be to write a couple of for loops that for a given year iterate over the rows. Whenever, stko=1, then extract the part of the ticker without the number at the end (e.g. for row one AEP) and find the row with this exact ticker in the given year (e.g. row 3 for year 2011) and copy the gvkey of that row to the initial observation with stko=1.

However, this procedure will be really slow given the size of my data set. I would be grateful if anyone can think of a faster and easier approach.

Many thanks!!

Using my main data set, the output for dput(droplevels(head(dat))) is:

structure(list(gvkey = c("176017", "128663", "61586", "278120", 
"14062", "285313"), datadate = structure(c(4L, 4L, 1L, 3L, 2L, 
1L), .Label = c("31dec2010", "31dec2011", "31dec2012", "31dec2013"
), class = "factor"), fyear = c(2013, 2013, 2010, 2012, 2011, 
2010), indfmt = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "INDL", class = "factor"), 
consol = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "C", class = "factor"), 
popsrc = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "D", class = "factor"), 
datafmt = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "STD", class = "factor"), 
tic = c("ATHX", "SQNM", "IMH", "FNLIF", "CCDBF", "BSBR"), 
cusip = structure(c(1L, 6L, 5L, 4L, 3L, 2L), .Label = c("04744L106", 
"05967A107", "124900309", "33564P103", "45254P508", "817337405"
), class = "factor"), conm = structure(c(1L, 6L, 5L, 4L, 
3L, 2L), .Label = c("ATHERSYS INC", "BANCO SANTANDER BRASIL  -ADR", 
"CCL INDUSTRIES  -CL B", "FIRST NATIONAL FINL CORP", "IMPAC MORTGAGE HOLDINGS INC", 
"SEQUENOM INC"), class = "factor"), curcd = structure(c(2L, 
2L, 2L, 1L, 1L, 2L), .Label = c("CAD", "USD"), class = "factor"), 
costat = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), 
stko = c(0, 0, 0, 0, 0, 0)), .Names = c("gvkey", "datadate", 
"fyear", "indfmt", "consol", "popsrc", "datafmt", "tic", "cusip", 
"conm", "curcd", "costat", "stko"), row.names = c(NA, 6L), class = "data.frame")

talat · Accepted Answer · 2014-08-01T20:54:22.417

Another option with dplyr:

require(dplyr)

dat %>%
  mutate(tic2 = gsub("[0-9]", "", tic)) %>%
  group_by(tic2, year) %>%
  mutate(parent = ifelse(stko == 1, as.character(gvkey[stko == 0][1]), "")) %>%
  ungroup() %>%
  select(-tic2)


#Source: local data frame [8 x 5]
#
#   gvkey  tic stko year parent
#1 001689 AEP1    1 2011 001440
#2 017096  BA3    1 2011 002285
#3 001440  AEP    0 2011       
#4 002285   BA    0 2011       
#5 001689 AEP1    1 2012 001440
#6 017096  BA3    1 2012 002285
#7 001440  AEP    0 2012       
#8 002285   BA    0 2012

Edit: if there can be companies without a matching parent, try this code:

dat %>%
  mutate(tic2 = gsub("[0-9]", "", tic)) %>%
  group_by(tic2, year) %>%
  mutate(parent = ifelse(stko == 1 & sum(stko == 0) > 0, 
                         as.character(gvkey[stko == 0][1]), "")) %>%
  ungroup() %>%
  select(-tic2)

score 1 · Answer 2 · edited May 23 '17 at 11:45

1

Looks like you have a working solution, but you're looking for speed. For that, you're going to want to use subsetting and vectorized operations instead of loops (see this post). I'd also find a solution using dplyr or data.table because they're both dramatically faster than base R.

Here's my attempt at a solution based on those ideas.

require(dplyr)

parents <- dat %>%
  filter(stko == 0) %>%
  select(tic, gvkey) %>%
  unique(.)
rownames(parents) <- parents$tic

dat2 <- dat %>%
  mutate(
    parentTic = sub("[1-9]$", "", tic),
    parentGvkey = parents[parentTic, "gvkey"])

edited May 23 '17 at 11:45

Community

1
1

answered Aug 01 '14 at 18:37

rsoren

4,036
3
26
37

This one works well for my data set (I have additional variables that I didn't include in the sample in my post). I just substituted the `sub()` with the `gsub()` function in order to make it work for subsidiary tickers that have multiple digit numbers in the end. – rp1 Aug 01 '14 at 19:10
Ideally, it should take care of the years. However, I haven't manage to get the two other solutions working for my main data set. – rp1 Aug 01 '14 at 19:18
What do you mean in the OP: "And I would like to do that for each year." All ```gvkey``` for a parent company are identical between years, at least in the provided data frame. – rsoren Aug 01 '14 at 20:47
Yes, in the sample the companies are identical among the years, however, this is not usually the case in my main data set. – rp1 Aug 02 '14 at 01:04
It would have been good to have that detail in your sample data set. In case you haven't read it already, this post on reproducible examples is essential reading on SO: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – rsoren Aug 02 '14 at 21:15

score 0 · Answer 3 · answered Aug 01 '14 at 18:09

This is a fairly straightforward exercise in the base::merge function

 rbind(transform(dat[dat$stko == 0, ], parent = ''),
   merge(
     transform(dat[dat$stko != 0, ], tic.parent = gsub('[0-9]', '', tic)),
     unique(transform(dat[dat$stko == 0, ], parent = gvkey)[, c('tic', 'parent', 'year')]),
     by.x = c('tic.parent', 'year'), by.y = c('tic', 'year')
   )[, -1]
 )

 #     gvkey  tic stko year parent
 # 3  001440  AEP    0 2011
 # 4  002285   BA    0 2011
 # 7  001440  AEP    0 2012
 # 8  002285   BA    0 2012
 # 5  001689 AEP1    1 2011 001440
 # 6  001689 AEP1    1 2012 001440
 # 71 017096  BA3    1 2011 002285
 # 81 017096  BA3    1 2012 002285

Thanks for pointing out my error. I tried to come up with a solution with `dplyr` but @beginneR beat me fair and square. — Masato Nakazawa, Aug 01 '14 at 18:46

Looking up values in data.frame for given rows

3 Answers3