How to find size (in MB) of dataframe in pyspark ,
df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json
How to find size (in MB) of dataframe in pyspark ,
df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json
 
    
    Late answer, but since google brought me here first I figure I'll add this answer based on the comment by user @hiryu here.
This is tested and working for me. This requires caching, so probably is best kept to notebook development.
# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching
# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()
# always try to remember to free cached data once finished
df.unpersist()
print("Total table size: ", convert_size_bytes(size_bytes))
You need to access the hidden
_jdfand_jSparkSessionvariables. Since Python objects do not expose the needed attributes directly, they won't be shown by IntelliSense.
My convert_size_bytes function looks like:
def convert_size_bytes(size_bytes):
    """
    Converts a size in bytes to a human readable string using SI units.
    """
    import math
    import sys
    if not isinstance(size_bytes, int):
        size_bytes = sys.getsizeof(size_bytes)
    if size_bytes == 0:
        return "0B"
    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])
 
    
    In general this is not easy. You can
org.apache.spark.util.SizeEstimatordf.inputfiles() and use an other API to get the file size directly (I did so using Hadoop Filesystem API (How to get file size). Not that only works if the dataframe was not fitered/aggregated 
    
    My running version
# Need to cache the table (and force the cache to happen)
df.cache()
nrows = df.count() # force caching
    
# need to access hidden parameters from the `SparkSession` and `DataFrame`
size_bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)
