EDIT - Improved the question by including a reproducible example and giving more clarity to my issues
Hi, my issue is that I have to translate this Stata code to R for it to be used in a large dataset:
sort UF UPA Ano Trimestre
loc j = 1
loc stop = 0
loc count = 0
while `stop' == 0 {
loc lastcount = `count'
count if p201 == . & n_p == `i'+1
loc count = r(N)
if `count' == `lastcount' {
loc stop = 1
}
else {
if r(N) != 0 {
replace p201 = p201[_n - `j'] if
UF == UF[_n - `j'] &
UPA == UPA[_n - `j'] &
n_p == `i'+1 & n_p[_n - `j'] == `i' &
p201 ==. & forw[_n - `j'] != 1 &
replace forw = 1 if UF == UF[_n + `j'] &
UPA == UPA[_n + `j'] &
p201 == p201[_n + `j'] &
n_p == `i' & n_p[_n + `j']==`i'+ 1 &
forw != 1
loc j = `j' + 1
}
else {
loc stop = 1
}
}
}
replace back = p201 !=. if n_p == `i'+1
replace forw = 0 if forw != 1 & n_p == `i'
}
My dataset is huge and more complex than the example posted below. I would like to understand mainly what is the usefulness of the while loop involving j.
Here is a toy example and the desired result in R:
start <- data.frame(
Ano = c(2012, 2012, 2012, 2012),
Trimestre = c("1", "2", "3", "4"),
UF = c(28, 28, 28, 28),
UPA = c(280020150, 280020150, 280020150, 280020150),
n_p = c(1, 2, 3, 4),
p201 = c(1, NA, NA, NA),
back = c(NA, NA, NA, NA),
forw = c(NA, NA, NA, NA)
)
end <- data.frame(
Ano = c(2012, 2012, 2012, 2012),
Trimestre = c("1", "2", "3", "4"),
UF = c(28, 28, 28, 28),
UPA = c(280020150, 280020150, 280020150, 280020150),
n_p = c(1, 2, 3, 4),
p201 = c(1, 1, 1, 1),
back = c(NA, 1, 1, 1),
forw = c(1, 1, 1, 0)
)
Mainly, in the dataset there are multiple possible combinations for UF, UPA that identify the individual. Ano and Trimestre denote year and trimesters.
It seems as if the dataset is only matching all rows with the same UF-UPA by having them all according to the first value of p201 in each group. Variables back and forw equal 1 if an observation is paired with some other one in a past or future date.
My question then is if someone can help me say what are the while and j's for? I am not sure if the code could be greatly simplified in R by only using group_by from dplyr. I am not sure even if a for loop would be required.
However, I am not sure if this is only because of the particular subset of the data I have posted here or if these parts are indeed necessary. Is there a clever way to find out by testing some other stuff?