Using the data.table package for R, I would like to be able to pull a specific record based on multiple criteria. In reality, I am dealing with a table with millions or tens of millions of records, and I want to extract the penultimate entries of a given field for records which have terminating 0 entries in that field. I can identify which records are of interest, and I can identify the dates of 0 entry, so with some creative use of lubridate I can create a list of ID and "Date I should be looking for". How can I take this two-column, 100 row list and use it to return the specific 100 values I need from a table of 10 million records?
Here is a simple example of what I am trying to do.
Given the following data table A:
Name Date Amount
1: A 1 100
2: A 2 100
3: A 3 100
4: A 4 99
5: A 5 98
6: A 6 97
7: A 7 96
8: A 8 95
9: A 9 94
10: A 10 93
11: A 11 92
12: A 12 0
13: B 2 200
14: B 3 200
15: B 4 190
16: B 5 180
17: B 6 170
18: B 7 160
19: B 8 150
20: B 9 0
21: C 2 100
22: C 3 95
23: C 4 90
24: C 5 90
25: C 6 85
26: C 7 80
27: C 8 0
What I would like to do is pull the last non-zero Amount for each record. What I can do is create a table B:
Name Date
1: A 11
2: B 8
3: C 7
What I want is the Amount in A for each Name & Date in B. The answer should be c(92, 150, 80).
Another approach would be to extract the subset of all records such that the Name and Date field pairs are valid. I can probably concatenate the two and search that way, but in actuality, Name is a long alphanumeric string and Date is converted into POSIX, so that could get ugly.
Furthermore, I may be doing this the wrong way, if there is a simpler way to return the penultimate record by a specific field, that may be all I need, something that would function like: A[Name %in% X, second-to-last record, by = Name] where X is the list of records in which I am interested.
Thank you.
Code
A <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "C",
"C", "C", "C", "C", "C", "C"), Date = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 2L,
3L, 4L, 5L, 6L, 7L, 8L), Amount = c(100L, 100L, 100L, 99L, 98L,
97L, 96L, 95L, 94L, 93L, 92L, 0L, 200L, 200L, 190L, 180L, 170L,
160L, 150L, 0L, 100L, 95L, 90L, 90L, 85L, 80L, 0L)), .Names = c("Name",
"Date", "Amount"), row.names = c(NA, -27L), class = c("data.table",
"data.frame"))
B <- data.table(Name = c('A', 'B', 'C'), Date = c(11, 8, 7))