I’d like to class a data frame in a certain way in R.
Assume to have a data frame like the following:
> data = sample(1:500, 5000, replace = TRUE)
In order to class this data frame I’m making these classes:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
If I want 0 to be included I’d just have to add include.lowest = TRUE:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
In this example this doesn’t show any difference, because 0 isn’t occuring in this data frame at all. But if it would, e.g. 4 times, there would be 106 instead of 102 elements in class [0,10]:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE)
> table(data.cl)
data.cl
[0,10] (10,20] (20,30] (30,40] (40,50]
106 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500]
1002 1492 1318 194
There is another option in changing class limits. The default option for cut() is right = FALSE. If you change it to right = TRUE you’ll get:
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500),
+ include.lowest = TRUE, right = FALSE)
> table(data.cl)
data.cl
[0,10) [10,20) [20,30) [30,40) [40,50)
92 81 87 111 118
[50,60) [60,70) [70,80) [80,90) [90,100)
103 89 94 103 103
[100,200) [200,350) [350,480) [480,500]
1003 1497 1320 199
include.lowest now becomes “include.highest” at the price of changing class limits and thus returning different amounts of class members in some classes, because of a slight shift in class limits.
But if I want to have the data frame
> data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500))
> table(data.cl)
data.cl
(0,10] (10,20] (20,30] (30,40] (40,50]
102 80 87 113 117
(50,60] (60,70] (70,80] (80,90] (90,100]
101 89 95 106 104
(100,200] (200,350] (350,480] (480,500)
1002 1492 1318 194
to exclude 500, too, what shall I do?
Of course, one can say: “Just write data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 499)) instead of data.cl = cut(data, breaks = c(seq(0,100,by=10), 200, 350, 480, 500)), because you’re dealing with integer numbers.”
Well, that’s right, but what would be if this wouldn’t be the case and I’d use floats instead? How can I exclude 500 then?