I am trying to partition a customer sample set by region and marketplace using Spark SQL, but in the partition result, I would like to select a random 100,000 users that was returned.
Is there a way to use random() in partition by order by clause in Spark SQL? The code below using random() always returns error, but without random(), it works perfect. Thank you in advance for your answers!
customer_panel_s3_location = f"s3://my-bucket/region_id={region_id}/marketplace_id={marketplace_id}/"
customer_panel_table = spark.read.parquet(customer_panel_s3_location)
customer_panel_table.createOrReplaceTempView("customer_panel")
dataset_date = '2023-03-16'
customer_sample_size = 100000
partition_customers_by = 'region_id, marketplace_id'
# The code below returns an error
customer_panel_df = spark.sql(f"""
SELECT *
FROM (
SELECT *
, row_number() over (partition by {partition_customers_by} order by random()) AS rn
FROM
customer_panel AS c
WHERE
c.target_date < CAST('{dataset_date}' AS DATE)
AND c.target_date >= date_sub(CAST('{dataset_date}' AS DATE), 7)
) t
WHERE t.rn <= bigint({customer_sample_size})
""")
# But after removing 'random()', it works
customer_panel_df = spark.sql(f"""
SELECT *
, row_number() over (partition by {partition_customers_by} order by {partition_customers_by}) AS rn
FROM
customer_panel AS c
WHERE
c.target_date < CAST('{dataset_date}' AS DATE)
AND c.target_date >= date_sub(CAST('{dataset_date}' AS DATE), 7)
""")
print(f"Row count of {table_name}: {customer_panel_df.count():,}")