I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file. Can anyone help me to solve this, thanks in advance.
            Asked
            
        
        
            Active
            
        
            Viewed 7,071 times
        
    2 Answers
5
            You can use maxRecordsPerFile option while writing dataframe.
- If you need whole dataframe to write 1000 records in each file then use 
repartition(1)(or)write 1000 records for each partition use.coalesce(1) 
Example:
# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)
# 1000 records written per file for dataframe 100 files created for 100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)
#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()
df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)
Method-2:
Caluculating number of partitions then repartition the dataframe:
df = spark.range(10000)
#caluculate partitions
no_partitions=df.count()/1000
from pyspark.sql.functions import *
#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()
#+-----------+--------+
#|partiton_id|count(1)|
#+-----------+--------+
#|          1|    1001|
#|          6|    1000|
#|          3|     999|
#|          5|    1000|
#|          9|    1000|
#|          4|     999|
#|          8|    1000|
#|          7|    1000|
#|          2|    1001|
#|          0|    1000|
#+-----------+--------+
df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)
        notNull
        
- 30,258
 - 4
 - 35
 - 50
 
- 
                    How can I give custom file name instead of auto-generated? – tsuresh97 Apr 24 '20 at 18:13
 - 
                    @tsuresh97, by default we can't `control filenames` in spark but able to specify only **directory** names, to change filenames need to use hadoop.fs api for this case https://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv – notNull Apr 24 '20 at 18:16
 - 
                    1Thank you so much @Shu – tsuresh97 Apr 24 '20 at 18:20
 - 
                    I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2}, I want to write the dataframe like this [{"a":1} ,{"b":2}]. Can you please @Shu. Thanks in advance. – tsuresh97 Apr 25 '20 at 10:56
 - 
                    @tsuresh97, please check my answer https://stackoverflow.com/a/61425542/7632695 – notNull Apr 25 '20 at 12:21
 
1
            
            
        Firstly, create a row number column
df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))
Now, run a loop and keep saving the records.
for i in range(0, df.count(), 1000):
   records = df.where(F.col("row_num").between(i, i+999))
   records.toPandas().to_csv("file-{}.csv".format(i))
        Prateek Jain
        
- 547
 - 1
 - 5
 - 16
 
- 
                    I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2}, I want to write the dataframe like this [{"a":1} ,{"b":2}]. Can you please help me. Thanks in advance. @Prateek Jain – tsuresh97 Apr 25 '20 at 10:56
 - 
                    1Using pandas we can also store in the Json format, which is what you require. So, just need to change .to_csv to .to_json. Refer to this link for details - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html – Prateek Jain Apr 25 '20 at 14:05
 - 
                    1