Update:
the following code should be reproducible
someFrameA = data.frame(label="A", amount=rnorm(10000, 100, 20))
someFrameB = data.frame(label="B", amount=rnorm(1000, 50000, 20))
wholeFrame = rbind(someFrameA, someFrameB)
fit <- e1071::naiveBayes(label ~ amount, wholeFrame)
wholeFrame$predicted = predict(fit, wholeFrame)
nrow(subset(wholeFrame, predicted != label))
In my case, this gave 243 misclassifications.
Note these two rows: (row num, label, amount, prediction)
10252     B 50024.81895         A
2955      A   100.55977         A
10678     B 50010.26213         B
While the input is only different by 12.6, the classification changes. It's curious that the posterior probabilities for rows like this are so close:
> predict(fit, wholeFrame[10683, ], type="raw")
             A         B
[1,] 0.5332296 0.4667704
Original Question:
I am trying to classify some bank transactions using the transaction amount. I had many other text based features in my original model, but noticed something fishy when using just the numeric one.
> head(trainingSet)
                 category amount
1                   check 688.00
2 non-businesstransaction   2.50
3 non-businesstransaction  36.00
4 non-businesstransaction 243.22
5                 payroll 302.22
6 non-businesstransaction  16.18
fit <- e1071::naiveBayes(category ~ amount, data=trainingSet)
fit
Naive Bayes Classifier for Discrete Predictors
Call: naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
                bankfee                   check       creditcardpayment       e-commercedeposit               insurance 
            0.029798103             0.189613233             0.054001459             0.018973486             0.008270494 
      intrabanktransfer             loanpayment              mcapayment non-businesstransaction                     nsf 
            0.045001216             0.015689613             0.011432741             0.563853077             0.023351982 
                  other                 payroll              taxpayment          utilitypayment 
            0.003405497             0.014838239             0.005716371             0.016054488 
Conditional probabilities:
                         amount
Y                               [,1]        [,2]
  bankfee                  103.58490   533.67098
  check                    803.44668  2172.12515
  creditcardpayment        819.27502  2683.43571
  e-commercedeposit         42.15026    59.24806
  insurance                302.16500   727.52321
  intrabanktransfer       1795.54065 11080.73658
  loanpayment              308.43233   387.71165
  mcapayment               356.62755   508.02412
  non-businesstransaction  162.41626   951.65934
  nsf                       44.92198    78.70680
  other                   9374.81071 18074.36629
  payroll                 1192.79639  2155.32633
  taxpayment              1170.74340  1164.08019
  utilitypayment           362.13409  1064.16875
According to the e1071 docs, the first column for "conditional probabilities" is the mean of the numeric variable, and the other is the standard deviation. These means and stdevs are correct, as are the apriori probabilities.
So, it's troubling that this row:
> thatRow
   category   amount
40    other 11268.53
receives these posteriors:
> predict(fit, newdata=thatRow, type="raw")
          bankfee       check creditcardpayment e-commercedeposit    insurance intrabanktransfer   loanpayment    mcapayment
[1,] 4.634535e-96 7.28883e-06      9.401975e-05         0.4358822 4.778703e-51        0.02582751 1.103762e-174 1.358662e-101
     non-businesstransaction       nsf       other      payroll   taxpayment utilitypayment
[1,]            1.446923e-29 0.5364704 0.001717378 1.133719e-06 2.059156e-18   2.149142e-24
Note that "nsf" has about 300X the score than "other" does. Since this transaction has an amount of 11.2k dollars, if it were to follow that "nsf" distribution, it would be over 100 standard deviations from the mean. Meanwhile, since "other" transactions have a sample mean of about 9k dollars with a large standard deviation, I would think that this transaction is much more probable as an "other". While "nsf" is more likely wrt the prior probabilities, they aren't so different as to outweigh that tail observation, and there are plenty of other viable candidates besides "other" as well.
I was assuming that this package just looked at the normal(mew=samplemean, stdev=samplestdev) pdf and used that value to multiply, but is that not the case? I can't quite figure out how to see the source.
Datatypes seem to be fine too:
> class(trainingSet$amount)
[1] "numeric"
> class(trainingSet$category)
[1] "factor"
The "naive bayes classifier for discrete predictors" in the printout is maybe odd, since this is a continuous predictor, but I assume this package can handle continuous predictors.
I had similar results with the klaR package. Maybe I need to set the kernel option on that?
