Given your description of the problem, the following should work
library(dplyr)
library(stats)
# df is the data.frame (see below)
df <- cbind(ID=seq_len(nrow(df)),df)
r.stolen <- which(df$is_stolen == 1)
r.not <- which(df$is_stolen != 1)
print(df[rep(r.not, times=length(r.stolen)),] %>%
setNames(.,paste0(names(.),"_not")) %>%
bind_cols(df[rep(r.stolen, each=length(r.not)),], .) %>%
mutate(in_range = as.numeric(telematic_trip_no != telematic_trip_no_not & time_of_day == time_of_day_not & day_of_week == day_of_week_not & lat_dec >= lat_min_not & lat_dec <= lat_max_not & lon_dec >= lon_min_not & lon_dec <= lon_max_not)) %>%
group_by(ID) %>%
summarise(count = sum(in_range)) %>%
arrange(desc(count)))
The first line just adds a column named ID to df that identifies the row by its row number that we can later dplyr::group_by to make the count.
The next two lines divides the rows into stolen and not-stolen cars. The key is to:
- replicate each row of stolen cars
N times where N is the number of not-stolen car rows,
- replicate the rows of not-stolen cars (as a block)
M times where M is the number of stolen car rows, and
- append the result of (2) to (1) as new columns and change the names of these new columns so that we can reference them in the condition
The result of (3) have rows that enumerates all pairs of stolen and not-stolen rows from the original data frame so that your condition can be applied in an array fashion. The dplyr piped R workflow that is the fourth line of the code (wrapped in a print()) does this:
- the first command replicates the not-stolen car rows using
times
- the second command appends
_not to the column names to distinguish them from the stolen car columns when we bind the columns. Thanks to this SO answer for that gem.
- the third command replicates the stolen car rows using
each and appends the previous result as new columns using dplyr::bind_cols
- the fourth command uses
dplyr::mutate to create a new column named in_range that is the result of applying the condition. The boolean result is converted to {0,1} to allow for easy accumulation
- the rest of the commands in the pipe does the counting of
in_range grouped by the ID and arranging the results in decreasing order of the count. Note that now ID is the column that identifies the rows of the original data frame for which is_stolen = 1 whereas ID_not is the column for rows that is_stolen = 0
This assumes that you want the count for each row that is_stolen = 1 in the original data frame, which is what you said in your question. If instead you really want the count for each telematic_trip_no that is stolen, then you can use
group_by(telematic_trip_no) %>%
in the pipe instead.
I've tested this using the following data snippet
df <- structure(list(position_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), .Label = c("2016-06-05 00:00:01",
"2016-06-05 00:00:04", "2016-06-05 00:00:05", "2016-06-05 00:00:19",
"2016-06-05 00:00:20", "2016-06-05 00:00:22", "2016-06-05 00:00:23",
"2016-06-05 00:00:35", "2016-06-05 00:09:34", "2016-06-06 01:00:06"
), class = "factor"), telematic_trip_no = c(526132109L, 526028387L,
526081476L, 526140512L, 526140518L, 526006880L, 526017880L, 526027880L,
526006880L, 526006890L, 526106880L, 526005880L, 526007880L),
lat_dec = c(-26.6641, -26.6402, -26.5545, -26.531, -26.531,
-26.501, -26.5315, -26.5325, -26.501, -26.5315, -26.5007,
-26.5315, -26.5315), lon_dec = c(27.8733, 27.8059, 28.3263,
27.8704, 27.8704, 27.849, 27.88, 27.87, 27.849, 27.87, 27.8493,
27.87, 27.87), is_stolen = c(0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), hour_of_day = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), time_of_day = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 9L, 0L), day_of_week = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Monday",
"Sunday"), class = "factor"), lat_min = c(-26.6651, -26.6412,
-26.5555, -26.532, -26.532, -26.502, -26.532, -26.532, -26.502,
-26.532, -26.502, -26.532, -26.532), lat_max = c(-26.6631,
-26.6392, -26.5535, -26.53, -26.53, -26.5, -26.53, -26.53,
-26.5, -26.53, -26.5, -26.53, -26.53), lon_max = c(27.8743,
27.8069, 28.3273, 27.8714, 27.8714, 27.85, 27.8714, 27.8714,
27.85, 27.8714, 27.85, 27.8714, 27.8714), lon_min = c(27.8723,
27.8049, 28.3253, 27.8694, 27.8694, 27.848, 27.8694, 27.8694,
27.848, 27.8694, 27.848, 27.8694, 27.8694)), .Names = c("position_time",
"telematic_trip_no", "lat_dec", "lon_dec", "is_stolen", "hour_of_day",
"time_of_day", "day_of_week", "lat_min", "lat_max", "lon_max",
"lon_min"), class = "data.frame", row.names = c(NA, -13L))
Here, I appended 7 new rows with is_stolen = 1 to your original 6 rows that are all is_stolen = 0:
- the first added row with
telematic_trip_no = 526005880 violates the longitude condition for all not-stolen rows, so its count should be 0
- the second added row with
telematic_trip_no = 526006880 violates the latitude condition for all not-stolen rows, so its count should be 0
- the third added row with
telematic_trip_no = 526007880 violates the telematic_trip_no condition for all not-stolen rows, so its count should be 0
- the fourth added row with
telematic_trip_no = 526006890 satisfies the condition for rows 4 and 5 that are not-stolen, so its count should be 2
- the fifth added row with
telematic_trip_no = 526106880 satisfies the condition for row 6 that is not-stolen, so its count should be 1
- the sixth added row with
telematic_trip_no = 526017880 violates the time_of_day condition for all not-stolen rows, so its count should be 0
- the seventh added row with
telematic_trip_no = 526027880 violates the day_of_week condition for all not-stolen rows, so its count should be 0
Running the code on this data gives:
# A tibble: 7 x 2
ID count
<int> <dbl>
1 10 2
2 11 1
3 7 0
4 8 0
5 9 0
6 12 0
7 13 0
which is as expected recalling that the appended rows with is_stolen = 1 starts at row 7 with ID = 7.
If one were to group by telematic_trip_no instead, we get the result:
# A tibble: 7 x 2
telematic_trip_no count
<int> <dbl>
1 526006890 2
2 526106880 1
3 526005880 0
4 526006880 0
5 526007880 0
6 526017880 0
7 526027880 0
As a caveat, the above approach does cost memory. Worst case the number of rows grows to N^2/4 where N is the number of rows in the original data frame, and the number of columns doubles for the data frame that is used to evaluate the condition. As with most array processing techniques, there is a trade between speed and memory.
Hope this helps.