I have the following data frame.
Input:
class   id  q1  q2  q3  q4
Ali     12  1   2   3   3
Tom     16  1   2   4   2
Tom     18  1   2   3   4
Ali     24  2   2   4   3
Ali     35  2   2   4   3
Tom     36  1   2   4   2
- class indicates the teacher's name,
- id indicates the student user ID, and,
- q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
- difference: For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
- cheating: This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class   id  q1  q2  q3  q4  difference                  cheating
Ali     12  1   2   3   3   (12,24 = 2), (12,35 = 2)    NA
Tom     16  1   2   4   2   (16,18 = 3), (16,36 = 0)    36
Tom     18  1   2   3   4   (16,18 = 3), (18,36 = 3)    NA
Ali     24  2   2   4   3   (12,24 = 2), (24,35 = 0)    35
Ali     35  2   2   4   3   (12,35 = 2), (24,35 = 0)    24
Tom     36  1   2   4   2   (16,36 = 0), (18,36 = 3)    16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
- This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations) 
- This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R? 
- The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column 
- This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns 
dput()
For your convenience, I am sharing data dput():
structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
 
    