I am using R to get some data from Google Analytics API. In this particular scenario, I get the information about the Affinity Interest of my users segmented by gender and age group. The structure of the data I get looks something similiar to:
gender ageGroup interest        sessions
male   18-24    Autos           4
male   18-24    Autos/Luxury    1
male   18-24    Autos/Vans      1
male   25-34    Autos           8
male   25-34    Autos/Luxury    2
male   25-34    Autos/Vans      2
male   25-34    Autos/Compacts  1
...
female 65+      Fashion         20
The problem with this structure is, however, Autos as the main interest contains also the sessions of the sub-categories, and if I use this data in a Pivot Table I will get the wrong information.
Therefore, I am adding the subcategory "Generalists" to every main category as its own subcategory and splitting this column in two:
for (i2 in 1:nrow(ga.genderAgeAffinityTable) ) {
# main categories <- chrFound = integer(0)            
chrFound <- grep("[/]", ga.genderAgeAffinityTable$interest[i2] )
if (length(chrFound) < 1) {
ga.genderAgeAffinityTable$interest[i2] <- 
sprintf("%s/Generalists", ga.genderAgeAffinityTable$interest[i2])
}
ga.genderAgeAffinityTable <- as.data.frame
(cSplit(ga.genderAgeAffinityTable, "interest", sep = "/"))
}
View(ga.genderAgeAffinityTable)
            gender ageGroup interest        subcategory        sessions
            male   18-24    Autos           Generalists        4
            male   18-24    Autos           Luxury             1
            male   18-24    Autos           Vans               1
            male   25-34    Autos           Generalists        8
            male   25-34    Autos           Luxury             2
            male   25-34    Autos           Vans               2
            male   25-34    Autos           Compacts           1
            ...
            female 65+      Fashion         Generalists        20
I still have to get rid of the wrong sessions calculation, as for the first group (males, 18-24 years, Auto lovers), Generalists should have only 2 sessions (sessions - sum(other subcategories)). I am doing that using an auxId (genderAgeInterestSubcategory), summing all sessions by that auxId, merging the aggregated sessions as a new column in my dataframe and recalculating sessions for subcategory "Generalists":
ga.genderAgeAffinityTable$auxId <- sprintf("%s%s%s",
ga.genderAgeAffinityTable$gender, ga.genderAgeAffinityTable$age,
ga.genderAgeAffinityTable$interest_1 )
ga.interestAggregated <- aggregate(ga.genderAgeAffinityTable[,c("sessions")],
by=list(ga.genderAgeAffinityTable$auxId), "sum")
colnames(ga.interestAggregated) <- c("auxId", "aggregated")
ga.genderAgeAffinityTable <- (merge(ga.genderAgeAffinityTable,
ga.interestAggregated, by = 'auxId'))
for (i3 in 1:nrow(ga.genderAgeAffinityTable) ) {
if (ga.genderAgeAffinityTable$interest_2[i3] == "Generalists" ) {
# Do not recalculate sessions for interests with only Generalists as subcategory          
if (ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3] != 0 ) {
ga.genderAgeAffinityTable$sessions[i3] <-
ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3]
}
}
}
Do you know a more straightforward way to do the same without using an auxid?
 
    