I am trying to predict sales for a retail store. Here are my variables (you can largely ignore the values of the variables, outside of ZipZone; their values are largely irrelevant for this question):
storeId    sales    meanTemperature     meanHumidity    ZipZone
1          1350     56.78               61.12           0
2          1230     59.90               45.67           3
3          8476     63.54               49.87           3
4          4357     62.12               65.09           4
5          2314     69.78               68.99           4
6          7812     74.90               59.78           4
7          1350     56.78               61.12           6
8          1230     59.90               45.67           6
9          8476     63.54               49.87           6
10         4357     62.12               65.09           7
11         2314     69.78               68.99           7
12         7812     74.90               59.78           8
...
There are 50 unique storeId values (i.e. there are fifty stores). I built a regression model in the form of:
model <- lm(sales ~ meanTemperature*meanHumidity + ZipZone)
I'm currently testing this model's efficacy in terms of in- and out-of-sample prediction, so I've created inSample and outSample data frames (the former has 40 stores; the latter has 10). The issue, though, is that I have several stores in just one ZipZone. For example, the inSample table has store 1 (the only store in ZipZone 0), while the outSample table has store 12 (the only store in ZipZone 8). When I run the following:
pred <- predict(model, newdata = outSample)
I get the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
factor ZIPzone has new levels 8
I assume this is because inSample doesn't have a store in ZipZone 8, while outSample does. How can I avoid this problem?
