lump factor based on another column

Question

The example shows measurements of production output of different factories, where the first columns denotes the factory and the last column the amount produced.

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
  factory production
1       A         15
2       A          2
3       B          1
4       B          1
5       B          2
6       B          1
7       B          2
8       C         20
9       D          5

Now I want to lump together the factories into fewer levels, based on their total output in this data set.

With the normal forcats::fct_lump, I can lump them by the number of rows in which thy appear, e.g. for making 3 levels:

library(tidyverse)    
df %>% mutate(factory=fct_lump(factory,2))
      factory production
    1       A         15
    2       A          2
    3       B          1
    4       B          1
    5       B          2
    6       B          1
    7       B          2
    8   Other         20
    9   Other          5

but I want to lump them based on the sum(production), retaining the top n=2 factories (by total output) and lump the remaining factories. Desired result:

1       A         15
2       A          2
3   Other          1
4   Other          1
5   Other          2
6   Other          1
7   Other          2
8       C         20
9   Other          5

Any suggestions?

Thanks!

AntoniosK · Accepted Answer · 2018-10-04T15:16:18.057

The key here is to apply a specific philosophy in order to group factories together based on their sum of production. Note that this philosophy has to do with the actual values you have in your (real) dataset.

Option 1

Here's an example that groups together factories that have a sum production equal to 15 or less. If you want another grouping you can modify the threshold (e.g. use 18 instead of 15)

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other

I'm creating factory_new without removing the (original) factory column.

Option 2

Here's an example where you can rank / order the factories based on their production and then you can pick a number of top factories to keep as they are and group the rest

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

# get ranked factories based on sum production
df %>%
  group_by(factory) %>%
  summarise(SumProd = sum(production)) %>%
  arrange(desc(SumProd)) %>%
  pull(factory) -> vec_top_factories

# input how many top factories you want to keep
# rest will be grouped together
n = 2

# apply the grouping based on n provided
df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other

That is a good step, but still it does not keep the top n separated and lumping the rest. — Peter, Oct 04 '18 at 15:03
For the example, yes. But forcats::fct_lump takes the argument n and keeps the top n levels (by abundance) and lumps the rest into one. Maybe I need to clarify that. — Peter, Oct 04 '18 at 15:06
In the solution above the philosophy is to change the `sum(production)` threshold in order to get different groups. So, you can use `18` instead of `15` and you'll get a different grouping. If you want to provide `n` instead I can modify the code... — AntoniosK, Oct 04 '18 at 15:09

tiptoebull · Answer 2 · 2021-02-24T20:54:08.103

4

Just specify the weight argument w:

> df %>% 
+   mutate(factory = fct_lump_n(factory, 2, w = production))
  factory production
1       A         15
2       A          2
3   Other          1
4   Other          1
5   Other          2
6   Other          1
7   Other          2
8       C         20
9   Other          5

Note: use forcats::fct_lump_n because the generic fct_lump is no longer recommended.

edited Feb 24 '21 at 20:54

answered Feb 24 '21 at 19:46

tiptoebull

161
1
8

score 1 · Answer 3 · answered Oct 04 '18 at 15:15

1

We could use base R as well by creating a logical condition with ave

df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]

answered Oct 04 '18 at 15:15

akrun

874,273
37
540
662

lump factor based on another column

3 Answers3