De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
                    
                    32
                    
            votes
                
                1 answer
            
        Remove duplicate documents from a search in Elasticsearch
I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field. 
Aggregators will come to me as counters. I would like a list of documents.
My index :
Doc 1 {domain: 'domain1.fr', name: 'name1',…
        
        Bastien D
        
- 1,395
 - 2
 - 14
 - 26
 
                    20
                    
            votes
                
                3 answers
            
        Java 8 String deduplication vs. String.intern()
I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern() obsolete. 
I know that this JVM feature needs the G1 garbage collector, which might not be an option…
        
        Hilikus
        
- 9,954
 - 14
 - 65
 - 118
 
                    12
                    
            votes
                
                1 answer
            
        Remove duplicates from list based on multiple fields or columns
I have a list of type MyClass
public class MyClass
{
   public string prop1 {} 
   public int prop2 {} 
   public string prop3 {} 
   public int prop4 {} 
   public string prop5 {} 
   public string prop6 {} 
   ....
}
This list will have…
        
        user20358
        
- 14,182
 - 36
 - 114
 - 186
 
                    12
                    
            votes
                
                3 answers
            
        sbt assembly error - deduplicate: different file contents found in the following
I get the following error when I do a ./sbt assembly on my Scala project. I saw the first after adding these dependencies to my build.sbt I can compile and run my code. 
libraryDependencies  ++= Seq(
  "org.scalanlp" % "breeze_2.10" % "0.7",
 …
        
        Soumya Simanta
        
- 11,523
 - 24
 - 106
 - 161
 
                    10
                    
            votes
                
                3 answers
            
        What are some of the best hashing algorithms to use for data integrity and deduplication?
I'm trying to hash a large number of files with binary data inside of them in order to:
(1) check for corruption in the future, and
(2) eliminate duplicate files (which might have completely different names and other metadata).
I know about md5 and…
        
        King Spook
        
- 381
 - 4
 - 10
 
                    6
                    
            votes
                
                2 answers
            
        Java: a time-delayed queue that de-dupes
G'day everyone,
I have a system (the source) that needs to notify another system (the target) asynchronously whenever certain objects change.  The twist is that the source system may mutate a single object many times in a short interval (updates are…
        
        Peter
        
- 519
 - 6
 - 15
 
                    6
                    
            votes
                
                3 answers
            
        How to store bidirectional relationships
I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance. 
However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered…
        
        alj
        
- 2,839
 - 5
 - 27
 - 37
 
                    6
                    
            votes
                
                3 answers
            
        Email deduplication
is it true that e-mail can be deduplicated by just using some of their headers as according to RFC their message-id should be unique?
Is there any way to calculate the chance of 1 single email beeing missed in this deduplication method below (sha512…
        
        Floris
        
- 299
 - 3
 - 17
 
                    6
                    
            votes
                
                3 answers
            
        Deduping database records comparing values in numerous fields
So I'm trying to clean some phone records in a database table.
I've found out how to find exact matches in 2 fields using:
/* DUPLICATE first & last names */
SELECT 
    `First Name`, 
    `Last Name`, 
     COUNT(*) c 
FROM phone.contacts  
GROUP…
        
        Still_Learning
        
- 63
 - 6
 
                    5
                    
            votes
                
                1 answer
            
        Data deduplication framework?
I want to integrate data deduplication into software that I am writing to back up vmware images.  I haven't been able to find anything suitable for what I think I need. There seem to be a LOT of complete solutions that include one form of…
        
        stifin
        
- 1,390
 - 3
 - 18
 - 28
 
                    5
                    
            votes
                
                5 answers
            
        Bad Performance for Dedupe of 2 million records using mapreduce on Appengine
I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of…
        
        charming30
        
- 171
 - 10
 
                    5
                    
            votes
                
                4 answers
            
        bash scripting de-dupe
I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to…
        
        aidan
        
- 9,310
 - 8
 - 68
 - 82
 
                    5
                    
            votes
                
                4 answers
            
        Deduplicate this java code duplication
I have about 10+ classes, and each one has a LUMP_INDEX and SIZE static constant.
I want an array of each of these classes, where the size of the array is calculated using those two constants.
At the moment i have a function for each class to create…
        
        terryhau
        
- 549
 - 2
 - 9
 - 18
 
                    5
                    
            votes
                
                1 answer
            
        Java Set with multiple equality criteria
I have a particular requirement where I need to dedupe a list of objects based on a combination of equality criteria.
e.g. Two Student objects are equal if:
1. firstName and id are same OR 2. lastName, class, and emailId are same
I was planning to…
        
        Suraj Bajaj
        
- 6,630
 - 5
 - 34
 - 49
 
                    5
                    
            votes
                
                2 answers
            
        mysql efficient join of 2 tables to the same 2 tables
I have 2 tables that can be simplified to this structure:
Table 1:
+----+----------+---------------------+-------+
| id | descr_id |        date         | value |
+----+----------+---------------------+-------+
| 1  |        1 | 2013-09-20 16:39:06…
        
        Eric Fitting
        
- 160
 - 1
 - 7