Using the file posts.csv in the data folder of this repo (the sample of 10,000 public Facebook posts by members of the US congress from 2017), solve the following with dplyr:
- Do not consider posts with zero likes 
- Compute the comment to like ratio (i.e. comments_count / likes_count) for each post and store it in a column - clr
- For each - screen_name, compute a- normaliser_based_on_even_months = max(clr) - min(clr), i.e. the maximum minus the minimum- clrvalue of posts by that- screen_name, however, only taking into account the posts made in even months, i.e. posts made in in February, April, June, August, October, December when computing- max(clr) - min(clr)for each- screen_name
- Set all - normaliser_based_on_even_monthsthat have a value of zero to NA or delete them
- Afterwards create a column - normalised_clrwhich stores the- clrof all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the- normaliser_based_on_even_monthsof the associated screen name. The only exception are posts from screen names that had a- normaliser_based_on_even_monthsvalue of zero and were deleted/set to NA before -- for these posts just set the value in- normalised_clrto NA as well or drop the post from the final data frame.
- In other words, the value of a single post/line - i(written by a politician- p) in that- normalised_clrcolumn can be computed as:- normalised_clr_{i,p} = clr{i}/normaliser_based_on_even_months_{p}for all observations for which there is a non-NA- normaliser_based_on_even_months(no need to use a loop for this though,- dplyrallows to compute it in a vectorised way)
- Keep only those rows with - normalised_clr> 0
- Arrange the data frame according to - normalised_clrin ascending order
- Print out only - screen_nameand- normalised_clrfor the first 10 rows, i.e. the posts with the 10 lowest- normalised_clr
I have written an R program for this. However, my task is to convert that into a SINGLE SQLite query. Here's what I've done:
# Create database
posts_db <- dbConnect(RSQLite::SQLite(), "C:/Users/HP/Documents/posts.sqlite")
# Reading the first file into R
data <- read.csv("C:/Users/HP/Documents/posts.csv",
    stringsAsFactors = FALSE)
dbWriteTable(posts_db, "posts", data, overwrite = TRUE)
dbListFields(posts_db, "posts")
dbGetQuery(posts_db,"WITH 
cte1 AS (SELECT screen_name, comments_count*1.0/likes_count AS clr, 
strftime('%m', date) AS month FROM posts WHERE likes_count>0), 
cte2 AS (SELECT (MAX(clr) - MIN(clr)) AS normaliser_based_on_even_months 
FROM cte1 
WHERE month % 2 = 0
GROUP BY screen_name),
cte3 AS (SELECT screen_name, clr, normaliser_based_on_even_months,
clr/normaliser_based_on_even_months AS normalized_clr FROM cte1, cte2 
WHERE normaliser_based_on_even_months>0)
SELECT screen_name, normalized_clr FROM cte3 
WHERE normalized_clr>0 
ORDER BY normalized_clr")
Here's the output:
               screen_name normalized_clr
1             repmarkpocan   0.0002546821
2             repmarkpocan   0.0002690018
3  CongressmanRalphAbraham   0.0002756995
4  CongressmanRalphAbraham   0.0002912010
5             derek.kilmer   0.0003549631
6           RepJimMcGovern   0.0003664136
7      CongresswomanNorton   0.0003687929
8             derek.kilmer   0.0003749212
9           RepJimMcGovern   0.0003870155
10     CongresswomanNorton   0.0003895286
For reference here is the R code that produces the result I want the SQLite Query to produce:
posts <- read.csv("C:/Users/HP/Documents/posts.csv")
#Remove columns with zero likes
posts <- posts %>% filter(likes_count > 0)
#create 'clr' which is the comment to like ratio
posts <- posts %>% mutate(clr = comments_count / likes_count) 
#Compute the maximum minus the minimum `clr` value of posts by that `screen_name`, however, only taking into account __the posts made in even months, i.e. posts made in in February, April, June, August, October, December__ when computing `max(clr) - min(clr)` for each `screen_name`. Code from https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
posts$date <- ymd(posts$date)
posts$date <- month(posts$date)
posts_normaliser <- posts %>% group_by(screen_name) %>% mutate(normaliser_based_on_even_months = case_when(date%%2==0 ~ (max(clr) - min(clr))))
#Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts_normaliser <- posts_normaliser %>% filter(normaliser_based_on_even_months > 0)
#Afterwards create a column `normalised_clr` which stores the `clr` of all posts from the original data frame (other than those with zero likes which were deleted in the first step) divided by the `normaliser_based_on_even_months` of the associated screen name. 
merged_df <- merge(posts, posts_normaliser)
merged_df <- merged_df %>% group_by(screen_name)%>% mutate(normalised_clr = clr / normaliser_based_on_even_months)
#Keep only those rows with `normalised_clr` \> 0
merged_df <- merged_df %>% filter(normalised_clr > 0)
#Arrange the data frame according to `normalised_clr` in ascending order
merged_df <- merged_df %>% arrange(normalised_clr)
#Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`
merged_df[1:10, c("screen_name", "normalised_clr")]
Here's the Output from R:
> merged_df[1:10, c("screen_name", "normalised_clr")]
# A tibble: 10 × 2
# Groups:   screen_name [5]
   screen_name                   normalised_clr
   <chr>                                  <dbl>
 1 CongresswomanSheilaJacksonLee        0.00214
 2 CongresswomanSheilaJacksonLee        0.00218
 3 CongresswomanSheilaJacksonLee        0.00277
 4 RepMullin                            0.00342
 5 SenDuckworth                         0.00342
 6 CongresswomanSheilaJacksonLee        0.00357
 7 replahood                            0.00477
 8 SenDuckworth                         0.00488
 9 SenDuckworth                         0.00505
10 RepSmucker                           0.00516
I keep going through the statements but I can't figure out why the outputs are different.
 
    