Background
I am creating a Sankey Diagram in R and I am struggling with labeling the nodes.
As example, I will reuse a dataset with 10 imaginary patients that are screened for COVID-19. At baseline, all patients are negative for COVID-19. After let’s say 1 week, all patients are tested again: now, 3 patients are positive, 6 are negative and 1 has an inconclusive result. Yet another week later, the 3 positive patients remain positive, 1 patient goes from negative to positive, and the others are negative.
data <- data.frame(patient = 1:10,
baseline = rep("neg", 10),
test1 = c(rep("pos",3), rep("neg", 6), "inconcl"),
test2 = c( rep(NA, 3), "pos", rep("neg", 6) ))
Attempt
To create the Sankey diagram, I am using the ggsankey package:
library(tidyverse)
#devtools::install_github("davidsjoberg/ggsankey")
df <- data %>%
make_long(baseline, test1, test2)
ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node,
fill = factor(node), label = node)) +
geom_sankey() +
geom_sankey_label(aes(fill = factor(node)), size = 3, color = "white") +
scale_fill_manual(values = c("grey", "green", "red")) +
theme(legend.position = "bottom", legend.title = element_blank())
Question
I would like to label the nodes with the number of patients that are present in each node (e.g., the first node would be labeled as 10, and the inconclusive node would be labeled as 1, and so on...).
How do you do this in R without hardcoding the values?
Parts of solution
To extract the numbers from the data, I thought the initial step should be something like:
data %>% count(baseline, test1, test2)
# baseline test1 test2 n
#1 neg inconcl neg 1
#2 neg neg neg 5
#3 neg neg pos 1
#4 neg pos <NA> 3
I think that if I am able to include the proper values in an extra column of the long data df, I should be able to call label=variable_name from the aesthetics?

