I'm (extremely) new to using MLR3, and am using it to model flight delays. I have some numerical variables, like Z, and some categorical variables like X. Let's just say I want to do a very simple model predicting delays based on both X and Z. From a theoretical perspective, we would usually encode the X factors into dummy variables, and then model it using linear regression. I see that MLR3 is doing this itself though - for example, when I create a task and run the learner, I can see that it has created coefficients for all the different factors i.e. treating them as separate dummy variables.
However, I can see how many other programmers are still using one-hot encoding to encode their categorical variables into dummies first - thus my question is, is one-hot encoding necessary, or does MLR3 do it for you?
edit: Below is an example dataset of my data. My predictor variables are Y (categorical) and Z (numerical). Y is the dependent variable and is numerical.
 Y    X    Z
-3    M    7.5
 5    W    9.2
 10   T    3.1
 4    T    2.2
 -13  M    10.1
 2    M    1.7
 4    T    4.5
This is the code I use
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
task <- TaskRegr$new('apples', backend=df2, target = 'Y')
set.seed(38)
train_set <- sample(task$nrow, 0.99 * task$nrow)
test_set <- setdiff(seq_len(task$nrow), train_set)
glrn_lm$train(task, row_ids = train_set)
glrn_lm$predict(task, row_ids = test_set)$score()
summary(lm(formula = task$formula(), data = task$data()))
And the results of that line will be something like:
Call:
lm(formula = task$formula(), data = task$data())
Residuals:
   Min     1Q Median     3Q    Max 
-39.62  -8.71  -4.77   0.27 537.12 
Coefficients:
                                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                4.888e+00  3.233e+00   1.512 0.130542    
XT                                         4.564e-03  3.776e-04  12.087  < 2e-16 ***
XW                                         4.564e-03  3.776e-04  12.087  < 2e-16 ***
Z                                         -4.259e+00  6.437e-01  -6.616 3.78e-11 ***
 
(The numbers up here are all way off - please don't mind that)
So as you can see, it derives two new variables called XT and XW - to denote the factor T under X and the factor W under X. I assume, like in dummy coding, XM is the reference variable here. So like I said earlier, regr_lm seems to already be doing the dummy coding for us. Is that really the case?
 
    