SAS' MISSOVER for data input in R

Question

I've got a data file A with 7 columns, no missing values, to which I've unix-joined a data file B that has 28 fields. The result file is C. If no match is found in B, then the output row in C only has 7 columns. If there is a match in B, then the output row in C has 35 columns. I've kicked around join's -e option to fill the missings 28 filds, but without success.

What I'm trying to do is duplicate SAS's MISSOVER input statement in R. For example the following code works perfectly:

 dat <- textConnection('x1,x2,x3,x4
 1,2,"present","present"
 3,4
 5,6')

 df <- read.csv(dat, sep=',' , header=T , 
     colClasses = c("numeric" , "numeric", "character", "character"))

 > df
   x1 x2      x3      x4
 1  1  2 present present
 2  3  4                
 3  5  6

But when I try to load my C file, I get the following error (using TRUE instead of T):

 df <- read.table( 'C.tab' , header=T , sep='\t', fill=TRUE,
                   colClasses = c(rep('numeric',7),rep('character',28)))


 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
   line 1 did not have 35 elements

The first line (second row in C, after the header), does indeed have only those 7 fields from A. In SAS I'd use the MISSOVER statement to set all those trailing missing fields to some missing value. How can I do that in R? Thanks.

I can't replicate this, which is made particularly difficult because you haven't supplied an example of a `C.tab` file that creates this error. — Joshua Ulrich, Sep 18 '13 at 18:03
A vague description doesn't constitute a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Joshua Ulrich, Sep 18 '13 at 18:15
Probably one of those names or addresses has `M'Cusik` or `O'Toole` or somesuch. — IRTFM, Sep 18 '13 at 18:18
I'd be glad to copy-paste an excerpt of the data, but that's probably not a good idea in this case. — user2105469, Sep 18 '13 at 18:25
It doesn't _have_ to be real customer names and addresses... it just has to represent your use case. — Joshua Ulrich, Sep 18 '13 at 18:30

IRTFM · Accepted Answer · 2015-11-24T19:49:06.753

The fill=TRUE setting to the parameters of read.table (or its derivative cousin read.csv) are probably what you are looking for.

  df <- read.table(dat, sep=',' , header=T , fill=TRUE,
      colClasses = c("numeric" , "numeric", "character", "character"))
 df
#
  x1 x2      x3      x4
1  1  2 present present
2  3  4                
3  5  6

The default for fill is TRUE for read.csv, but your error says you used fill=T suggesting that you have an object named T in your workspace. The default for read.table is fill=!blank.lines.skip and since the default is also blank.lines.skip = TRUE, the usual default for fill in read.table is FALSE.

Your edited question suggests you have other problems in your character fields. The usual suspects are unmatched quotes or octothorpes(#) which are effectively line terminators, so try this instead:

df <- read.table( 'C.tab' , header=T , sep='\t', fill=TRUE, 
              quote="",
              comment.char="",
              colClasses = c(rep('numeric',7),rep('character',28)))

If you are having difficulty with errors related to varying numbers of items per line, it can be very useful to use count.fields. It accepts similar parameters to those used by read.table. If you have a large number of input lines it can be useful to wrap the call to count.fields in a table call:

length_tbl <- table( count.fields( 'C.tab' , header=TRUE , sep='\t', 
                                    quote="",
                                    comment.char="")
                     )

You can then experiment with different options. Once you know what you are looking for you can also identify the line numbers that are causing problems by wrapping a which call around count.fields:

bad_lines <- which( count.fields( 'C.tab' , header=TRUE , sep='\t', 
                                    quote="",
                                    comment.char="")
                     != 7  # or whatever is the "correct" length
                     )

They have `fill =T`, which (as I know you do know) is not the same. — IRTFM, Sep 18 '13 at 18:04
I've put TRUE in the code, same result. I was expecting `fill=TRUE` to do the trick, set `comment.char=""` just in case addresses contained '#'. If the consensus is that fill is the option to use, then it seems like I'm looking for special characters. — user2105469, Sep 18 '13 at 18:12
The "special characters" to worry about are `'`, `"`, and `#`. Look at the count.fields function to get a handle on that. I find that `table(count.fields(...))` is a compact way of determining the effect of various combinations of read-parameters. — IRTFM, Sep 18 '13 at 18:16
@DWin You could put `count.fields` suggestion into your answer. — Marek, Sep 19 '13 at 07:53

SAS' MISSOVER for data input in R

1 Answers1