i was given a list that contains over 90,000 names. i am to check the names that have >= 50% similarity, and write the result to a file in the format:
ID 1, ID 2, Similarity percent.
i already have an algorithm that checks the similarity, but iterating through the whole list takes alot of time. Can someone help out with a fast algorithm to compare the names?
below is the code
public static void main(String[] args) throws IOException {
    List<String> list = new ArrayList<>();
    int count = 0;
    FileWriter f = new FileWriter(new File("output.txt"));
    StringBuilder str = new StringBuilder();
    Scanner scanner = new Scanner(new File("name.csv"));
    while (scanner.hasNextLine()) {
        count++;
        list.add(scanner.nextLine());
    }
    long start = System.currentTimeMillis();
    //////////////////////////////////////////////////////////
    for (int i = 0; i < list.size(); i++) {
        for (int j = i + 1; j < list.size(); j++) {
            int percent = StringSimilarity.simi(list.get(i), list.get(j));
            if (percent >= 50) {
                str.append("ID " + i + ",ID " + j + "," + percent + " percent");
                str.append("\n");
            }
        }
    }
    ////////////////////////////////////////////////////////
    long end = System.currentTimeMillis();
    f.write(str.toString());
    System.out.println((end - start) / 1000 + " second(s)");
    f.close();
    scanner.close();
}
public static String getString(String s) {
    Pattern pattern = Pattern.compile("[^a-z A-Z]");
    Matcher matcher = pattern.matcher(s);
    String number = matcher.replaceAll("");
    return number;
}
This is a sample of how the data looks.....the names are stored in a . csv file, so I read the file and stored the names in the list.
FIRST NAME,SURNAME,OTHER NAME,MOTHER's MAIDEN NAME
Kingsley, eze, Ben, cici
Eze, Daniel, Ben, julie
Jon, Smith, kelly, Joe
Joseph, tan, , chellie
Joseph,tan,jese,chellie
....and so on A person can have 3 NAMEs at least.....like I stated earlier, the program is to check how similar the names are, so when comparing Id 1 and id 2, "ben" is common and "eze" is common, so they have a 50 percent similarity. Comparing id 4 and id 5, the similarity is 75percent....because they have 3 names in common even though id 4 doesn't have a 3rd name....
So the problem is...during the similarity check using the two for loops, I start with the 1st id and check it through the remaining 90,000 names and save the id's that it has >= 50 percent similarity with, then take the next id 2 and do same......and so on
 
     
     
     
     
     
    