I have a big dataset of fire occurring in forests, and I want to predict when the fire ignites. This happens very rarely: 290 times out of 620 000 times.
A tibble: 62,905 x 13
   amplitude polarity DEM_avg   DC   DMC   DSR    FFMC    Pd    RH  TEMP  WS  tree_cover  fire
       <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl> <fct>
 1     -37.8      0     165.   269.  21.9  0.607  84.0   0    65.1  290. 4.36      8        0
 2     -68.1      0     303.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     34.7     0
 3     -54.3      0     332.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     35.8     1
 4    -108.       0     338.   168.  44.5  1.41   89.9   0    46.6  296. 0.692     30.3     0
 5     -60.3      0     374.   171.  35.7  2.30   88.9   0.3  51.7  295. 4.01      29.6     1
 6     -82.8      0     48.2   133.  18.4  0.210  84.9   0    65.1  289. 1.35      18.7     0
 7     -99.6      0     299.   219.  42.6  2.09   90.8   0    34.2  297. 1.42       7       1
 8     -98.1      0     116.   153.  44.7  0.988  89.0   0    41.3  298. 0.235     32.6     0
I tried to use SMOTE to balance my highly imbalanced dataset with the changes suggested by StupidWolf. I do the following:
library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/fire2018.csv", 
    col_types = cols(fire = col_factor(levels = c("0", 
        "1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]
SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 600, perc.under = 100)
However, when I use SMOTE from the DMwR package I now get the following error:
Error in factor(newCases[, a], levels = 1:nlevels(data[, a]), labels = levels(data[,  : 
  invalid 'labels'; length 0 should be 1 or 2
In addition: Warning messages:
1: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
2: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion
3: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
  NAs introduced by coercion
I have looked for different solutions. One suggested transforming variables into numeric and factor, but my variables are already transformed correctly. My dependent variable is factor w/ 2 levels and the independent variables are numeric, and I have no N/A in any of my variables. But, that did not help my case. I got a similar error.
 
    