I have the following dataset (sample). Obviously, the real data set that I am using is much larger:
gvkey     tic  stko   year
001689   AEP1     1   2011
017096    BA3     1   2011
001440    AEP     0   2011
002285     BA     0   2011
001689   AEP1     1   2012
017096    BA3     1   2012
001440    AEP     0   2012
002285     BA     0   2012
Here is the code to generate the data:
dat <- data.frame(gvkey=c("001689", "017096", "001440", "002285"), tic=c("AEP1", "BA3", "AEP", "BA"), stko=c(1, 1, 0, 0), year=c(2011,2011,2011,2011,2012,2012,2012,2012))
Here is what I would like to do: Each row represents a year-firm pair with tic being the firm ticker. Firms where stko equals 1 are subsidiaries and share the same tic as their parent company, but have a number attached to the ticker, e.g. AEP1 belongs to AEP. Basically, I would like to create a new variable parent, which indicates for each subsidiary (row with stko=1) the gvkey of the parent company. And I would like to do that for each year. The final data set should look like this:
gvkey     tic  stko   year  parent
001689   AEP1     1   2011  001440
017096    BA3     1   2011  002285
001440    AEP     0   2011  
002285     BA     0   2011  
001689   AEP1     1   2012  001440
017096    BA3     1   2012  002285
001440    AEP     0   2012  
002285     BA     0   2012  
Now, my initial approach would be to write a couple of for loops that for a given year iterate over the rows. Whenever, stko=1, then extract the part of the ticker without the number at the end (e.g. for row one AEP) and find the row with this exact ticker in the given year (e.g. row 3 for year 2011) and copy the gvkey of that row to the initial observation with stko=1.
However, this procedure will be really slow given the size of my data set. I would be grateful if anyone can think of a faster and easier approach.
Many thanks!!
Using my main data set, the output for dput(droplevels(head(dat))) is:
structure(list(gvkey = c("176017", "128663", "61586", "278120", 
"14062", "285313"), datadate = structure(c(4L, 4L, 1L, 3L, 2L, 
1L), .Label = c("31dec2010", "31dec2011", "31dec2012", "31dec2013"
), class = "factor"), fyear = c(2013, 2013, 2010, 2012, 2011, 
2010), indfmt = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "INDL", class = "factor"), 
consol = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "C", class = "factor"), 
popsrc = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "D", class = "factor"), 
datafmt = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "STD", class = "factor"), 
tic = c("ATHX", "SQNM", "IMH", "FNLIF", "CCDBF", "BSBR"), 
cusip = structure(c(1L, 6L, 5L, 4L, 3L, 2L), .Label = c("04744L106", 
"05967A107", "124900309", "33564P103", "45254P508", "817337405"
), class = "factor"), conm = structure(c(1L, 6L, 5L, 4L, 
3L, 2L), .Label = c("ATHERSYS INC", "BANCO SANTANDER BRASIL  -ADR", 
"CCL INDUSTRIES  -CL B", "FIRST NATIONAL FINL CORP", "IMPAC MORTGAGE HOLDINGS INC", 
"SEQUENOM INC"), class = "factor"), curcd = structure(c(2L, 
2L, 2L, 1L, 1L, 2L), .Label = c("CAD", "USD"), class = "factor"), 
costat = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor"), 
stko = c(0, 0, 0, 0, 0, 0)), .Names = c("gvkey", "datadate", 
"fyear", "indfmt", "consol", "popsrc", "datafmt", "tic", "cusip", 
"conm", "curcd", "costat", "stko"), row.names = c(NA, 6L), class = "data.frame")
 
     
     
     
    