In the following sample code, in one cell of our Azure Databricks notebook, the code loads about 20 million records into a Python pandas dataframe from an Azure SQL db, does some dataframe column tranformation by applying some functions (as shown in the code snippet below). But after running the code for about half an hour, the Databricks throws the following error:
Error:
ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
Remarks: Table has about 150 columns. The Spark setting on the Databricks is as follows:
Cluster: 128 GB , 16 Cores, DBR 8.3, Spark 8.3, Scala 2.12
Question: What could be a cause of the error, and how can we fix it?
import sqlalchemy as sq
import pandas as pd
def fn_myFunction(lastname):
    testvar = lastname.lower()
    testvar = testvar.strip()
    
    return testvar
pw = dbutils.secrets.get(scope='SomeScope',key='sql')
engine = sq.create_engine('mssql+pymssql://SERVICE.Databricks.NONPUBLICETL:'+pw+'MyAzureSQL.database.windows.net:1433/TEST', isolation_level="AUTOCOMMIT")
app_df = pd.read_sql('select * from MyTable', con=engine)
#create new column
app_df['NewColumn'] = app_df['TestColumn'].apply(lambda x: fn_myFunction(x))
.............
.............
 
     
    