I am reading an UTF-8 encoded XML file in R using xmlParse and xPathApply of Duncan Temple Lang's XML package. I have issues reading text from the file into a data frame for various languages. I am currently on Windows OS but this R script will be used on different machines so I need a solution that will be suitable for all. See sample XML file below:
<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
    <L1 lang="zh-TW">使用者識別碼</L1>
    <L2 lang="vi-VN">ID người dùng</L2>
</CATALOG>
This text value is being displayed in an encoded format as in 
<U+4F7F><U+7528><U+8005><U+8B58><U+5225><U+78BC>, ID nguo`i du`ng respectively. Note this is just a sample and the actual XML file has text in different languages. 
Code Snippet:
library(XML)
library(plyr)
getValues <- function(x) {
  List <- list()
  if(inherits(x, "XMLInternalElementNode")) {
    if(length(xmlValue(x, recursive=FALSE)) != 0) {
      List[[length(List)+1]] <- c(node = xmlName(x), value = xmlValue(x, recursive=FALSE))
    }
  }
return(List)
}
visitNode <- function(node, xpath = "//node()") {
  if (is.null(node)) {
    return()
  }
  result <- xpathSApply(node, path = xpath, getValues)
  if(is.list(result)) {
    dt <<- rbind.fill(lapply(result,function(y){as.data.frame(do.call(rbind, y),stringsAsFactors=FALSE)}))
  }
} 
xtree <- xmlParse("C:/Users/I308232/Desktop/test.xml")
root <- xmlRoot(xtree)
dt <- data.frame(node = NA, value = NA)
visitNode(root)
dt
sessionInfo() output:
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_Australia.1252        LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] RODBC_1.3-10 plyr_1.8.1   XML_3.98-1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.3 tools_3.1.2
Any help will be appreciated. Thanks.
 
     
    