I have a dataframe which contains stock prices, something like:
id     date        price
1     2/3/2016     4
1     2/4/2016     4.2
1     2/5/2016     3
2     2/4/2016     3
2     2/5/2016     4.1
2     2/6/2016     3
.     .            . 
.     .            . 
.     .            . 
However, not all stocks cover the same time period, as companies leave and join the exchange. I'm looking to subset my data to a continuous time series (ignoring weekends). I face a tradeoff between number of stocks and length of time (i.e. the longer the the time series, the less companies have data available over the entire series).
I believe the optimal method for picking a series is that it contains a start date from a stock run and an end date of a (possibly different) stock run. Any other time period would not be maximizing the time coverage. In the end, I'm expecting a dataframe containing possible start dates, end dates, and number of companies. I can then decide which time range is best suited.
I hope this is clear and please let me know if any clarifications are needed.
Edit: This should help. Here is an crude dataframe
date1 <- seq(as.Date("2006/1/1"), as.Date("2010/1/1"), "days")
date2 <- seq(as.Date("2006/6/12"), as.Date("2008/1/1"), "days")
date3 <- seq(as.Date("2009/10/11"), as.Date("2010/12/1"), "days")
date4 <- seq(as.Date("2002/1/1"), as.Date("2007/1/13"), "days")
row_to_drop1 <- !(weekdays(as.Date(date1)) %in% c('Saturday','Sunday'))
row_to_drop2 <- !(weekdays(as.Date(date2)) %in% c('Saturday','Sunday'))
row_to_drop3 <- !(weekdays(as.Date(date3)) %in% c('Saturday','Sunday'))
row_to_drop4 <- !(weekdays(as.Date(date4)) %in% c('Saturday','Sunday'))
date1 <- date1[row_to_drop1]
date2 <- date2[row_to_drop2]
date3 <- date3[row_to_drop3]
date4 <- date4[row_to_drop4]
mydf <- rbind.data.frame(data.frame(id = 1, date = date1),data.frame(id=2,date=date2),data.frame(id=2,date=date3),data.frame(id=3,date=date4))
I'm hoping for this output:
start       end         #ofids
1/1/2006    1/1/2010    1
6/12/2006   1/1/2010    2
10/11/2009  1/1/2010    3
1/1/2002    1/1/2010    3
1/1/2006    1/1/2008    3
6/12/2006   1/1/2008    1
10/11/2009  1/1/2008    2
1/1/2002    1/1/2008    3
1/1/2006    12/1/2010   1
6/12/2006   12/1/2010   3
10/11/2009  12/1/2010   2
1/1/2002    12/1/2010   2
1/1/2006    1/13/2007   2
6/12/2006   1/13/2007   1
10/11/2009  1/13/2007   1
1/1/2002    1/13/2007   1
The #ofids is just made up, but hopefully it gets the point across. Some rows will be dropped as the start date can't be greater than the end date.
Here are the runs (combo of start dates with all end dates):
start       end
1/1/2006    1/1/2010
6/12/2006   1/1/2010
10/11/2009  1/1/2010
1/1/2002    1/1/2010
1/1/2006    1/1/2008
6/12/2006   1/1/2008
1/1/2002    1/1/2008
1/1/2006    12/1/2010
6/12/2006   12/1/2010
10/11/2009  12/1/2010
1/1/2002    12/1/2010
1/1/2006    1/13/2007
6/12/2006   1/13/2007
1/1/2002    1/13/2007
That should make for 14x3(ids)=42 rows.
 
    