I like working with pandas due to my affinity to tidyverse in R when dealing with tables. I have a table of about 200,000 rows and need to replace punctuations and extract non-English words, and put it another column named non_english in the same table. I prefer using enchant library because I found it more accurate than using nltk library. My dummmy table df has dundee column which I am working on. A dummy data is as thus:
df = pandas.DataFrame({'dundee':    ["I love:Marae", "My Whanau is everything",  "I love Matauranga", "Tāmaki Makaurau is Whare", "AOD problem is common"]})
My idea is to remove punctuation first, write a function to extract non-english words, and then apply the function to the dataframe, but I got this error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. Here is my code:
import pandas as pd
import enchant
import re
import string
# remove punctuations
df['dundee1'] = df['dundee'].str.replace(r'[^\w\s]+', ' ')
# change words to lower case
df['dundee1'] = df['dundee1'].str.lower()
# Function to check if a word is english
def check_eng(word):
    
    # use all available english dictionary
    en_ls = ['en_NZ', 'en_US', 'en_AU', 'en_GB']
    en_bool = False
            
    # check all common dictionaries if word is English 
    for en in en_ls:
        dic = enchant.Dict(en)
        if word != '':
            if dic.check(word) == True:
                en_bool = True
                break
    disp_non_en = ""
    word = word.str.split(' ')
    if len(word) != 0:
        if en_bool == False:
             disp_non_en = disp_non_en + word + ', '
    return disp_non_en
df['non_english'] = check_eng(df['dundee1'])
The desired table is this:
    dundee                          non_english
0   I love:Marae                    Marae
1   My Whanau is everything         Whanau
2   I lov Matauranga                love, Matauranga
3   Tāmaki Makaurau is Whare        Tāmaki Makaurau, Whare
4   AOD problem is common           AOD
 
     
     
    