In my db, I have a df of hundreds of thousand companies, and I have to retrieve them in another df that contains all companies existing.
To do that, I use PySpark :
def match_names(df_1, df_2):
    pipeline = Pipeline(stages=[
        RegexTokenizer(
            pattern="", inputCol="name", outputCol="tokens", minTokenLength=1
        ),
        NGram(n=3, inputCol="tokens", outputCol="ngrams"),
        HashingTF(inputCol="ngrams", outputCol="vectors"),
        MinHashLSH(inputCol="vectors", outputCol="lsh")
    ])
    model = pipeline.fit(df_1)
    stored_hashed = model.transform(df_1)
    landed_hashed = model.transform(df_2)
    landed_hashed = landed_hashed.withColumnRenamed('name', 'name2')
    matched_df = model.stages[-1].approxSimilarityJoin(stored_hashed, landed_hashed, 1, "confidence").select(
            col("datasetA.name"), col("datasetB.name2"), col("confidence"))
    return matched_df
Then I also calculate Levenshtein distance of each pair.
It works for one hundred of rows to compare, but for hundreds of thousand, it takes too long time, and I really need to make it faster. I think we can parallelize it but I don't know how to do it.
Thanks in advance !