I want to create an application that can determine if some text was copied between two documents by reading the text from the two documents and comparing them. I wanted to know if anyone had ever tried to do this and what was the best way of handling the same. If machine learning and natural language processing are involved: to what level?
            Asked
            
        
        
            Active
            
        
            Viewed 134 times
        
    2 Answers
1
            
            
        There are techniques which rely purely on set-theoretic concepts
Try http://en.wikipedia.org/wiki/W-shingling for a good start.
 
    
    
        Viktor Latypov
        
- 14,289
- 3
- 40
- 55
0
            I believe Copyscape uses 4-grams to help determine uniqueness.
These strings are referred to as N-Grams.
However, another SO answer linked to a language independent algo comparing bi-grams on a character basis. It's already implemented in Java, which would help save time.
 
    
    
        Community
        
- 1
- 1
 
    
    
        HappyTimeGopher
        
- 1,377
- 9
- 14
