I'm working with a panel data set with the following variables. Here's a snippet of my data:
i     region urban year
8431  3      1     1979
8431  3      1     1980
8431  3      1     1981
8431  3      1     1982
8431  3      0     1983
8431  3      0     1984
8431  3      0     1985
8431  3      0     1986
8431  3      0     1987
8431  3      0     1988
8431  3      0     1989
8431  3      1     1990
8431  3      1     1991
8431  3      1     1992
8431  3      1     1993
8431  3      1     1994
8431  3      1     1996
8431  3      0     1998
8431  3      1     2000
8431  3      1     2002
8431  3      1     2004
8431  3      1     2006
8431  2      0     2008
8431  3      1     2010
8431  3      1     2012
The panel is strongly balanced, but there are some missing observations in the data that I deal with through interpolation.
I want to calculate the total number of times region and urban have changed during the observation period for each individual.
First thing I tried was lagged values:
bysort i: gen urban_lag = l1.urban
bysort i (year): gen urbanchange = 0 if urban==urban_lag & !missing(urban)
replace urbanchange = 1 if urban!=urban_lag & !missing(urban)
For some reason, this generates a lot of missing values, even for individuals where there are not missing values for region. Does anyone know why this is the case?
I then tried this:
sort i year
by i: gen byte urbanchange = urban != urban[_n-1]
This mostly works and gets me slightly closer to where I want, except the first observation of each individual is equal to 1, because change[0] is before the start of the data and so Stata returns it as missing, so I get urbanchange[1] = 1 for every individual.
This is the output I want:
i     region urban year  regionchange  urbanchange
8431  3      1     1979  0             0
8431  3      1     1980  0             0
8431  3      1     1981  0             0
8431  3      1     1982  0             0
8431  3      0     1983  0             1
8431  3      0     1984  0             0
8431  3      0     1985  0             0
8431  3      0     1986  0             0
8431  3      0     1987  0             0
8431  3      0     1988  0             0
8431  3      0     1989  0             0
8431  3      1     1990  0             1
8431  3      1     1991  0             0
8431  3      1     1992  0             0
8431  3      1     1993  0             0
8431  3      1     1994  0             0
8431  3      1     1996  0             0
8431  3      0     1998  0             1
8431  3      1     2000  0             1
8431  3      1     2002  0             0
8431  3      1     2004  0             0
8431  3      1     2006  0             0
8431  2      0     2008  1             1
8431  3      1     2010  1             1
8431  3      1     2012  0             0
This question is basically the Stata version of this question I asked a year ago Counting the number of changes of a categorical variable during repeated measurements within a category
Edit regarding interpolation: since there are a lot of gaps in my data, for observations with missing values, I make the assumption that, for example, the individual's current urban value is the same as the last observed (non-missing) urban value.
For example:
region  urban   year
1       1       1979
1       1       1980
1       1       1981
3       1       1982
3       1       1983
3       1       1984
.       .       1985
1       1       1986
1       1       1987
1       1       1988
1       1       1989
.       .       1990
.       .       1991
.       .       1992
.       .       1993
.       .       1994
        
I fill in the missing values like such:
gen urban1 = urban
bysort i (year): replace urban1 = urban1[_n-1] if missing(urban1)
gsort i -year 
replace urban1 = urban1[_n-1] if urban1 >=. 
and do the same for region.
Output:
region  urban   year   region1  urban1
1       1       1979   1        1
1       1       1980   1        1
1       1       1981   1        1        
3       1       1982   3        1
3       1       1983   3        1        
3       1       1984   3        1        
.       .       1985   3        1
1       1       1986   1        1
1       1       1987   1        1
1       1       1988   1        1
1       1       1989   1        1
.       .       1990   1        1
.       .       1991   1        1
.       .       1992   1        1
.       .       1993   1        1
.       .       1994   1        1
I'm making some strong assumptions here and some may disagree with what I'm doing here, but for the purpose of counting the number of changes, it doesn't affect the results I want and the missing values don't mess with my output.
 
    