I have to following data:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes 
           attributes 
1  apple-water-orange
2         apple-water
3        apple-orange
4              coffee
5    coffee-croissant
6    green-red-yellow
7      green-red-blue
8           green-red
9         black-white
10 black-white-purple
What I want is another column, that assigns a category to each row, based on observation similarity.
category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))
       attributes     category
1  apple-water-orange        1
2         apple-water        1
3        apple-orange        1
4              coffee        2
5    coffee-croissant        2
6    green-red-yellow        3
7      green-red-blue        3
8           green-red        3
9         black-white        4
10 black-white-purple        4
It is clustering in the broader sense, but I think most clustering methods are for numeric data only and one-hot-encoding has a lot of disadvantages (thats what I read on the internet).
Does anyone have an idea how to do this task? Maybe some word-matching approaches?
It would be also great if I could adjust degree of similarity (rough vs. decent "clustering") based on a parameter.
Thanks in advance for any idea!
 
    


 
    