Identifying unique values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector. 
Let x be a vector with slightly different names for an entity, e.g. Kentucky loader may appear as Kentucky load or Kentucky loader (additional info) or somewhat similar. 
x <- c("Kentucky load" ,                                                                                                            
       "Kentucky loader (additional info)",                                                                                     
       "CarPark Gifhorn (EAP)",
       "Car Park  Gifhorn (EAP) new 1.5.2012",
       "Center Kassel (neu 01.01.2014)",
       "HLLS Bremen (EAP)",
       "HLLS Bremen (EAP) new 06.2013",
       "Hamburg total sum (abc + TBL)",
       "Hamburg total (abc + TBL) new 2012")
What I what to get out is something like:
c("Kentucky loader" ,                                                                                                            
  "Car Park Gifhorn (EAP)",
  "Center Kassel (neu 01.01.2014)",
  "HLLS Bremen (EAP)",
  "Hamburg total (abc + TBL)")
Idea
- Calculate some similarity measure between all strings (e.g. Levenshtein distance)
- Use longest common subset method
- Somehow :( decide which strings belong together based on this information.
But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it.
Does someone have a hint or is there a package that does this?
 
     
    