What is the difference between the .mean() and the .avg() methods?

Question

Currently, I am working with PySpark to analyze some data. I have a CSV file with Payroll data in it. I want to know what Job has the best pay. To do that I need the median() because I want to know the average.

The methods for groupBy in Pyspark are these: agg, avg, count, max, mean, min, pivot, sum

When I try the .mean() method it looks like this:

mean_pay_data = reduced_data.groupBy("JOB_TITLE").mean("REGULAR_PAY")
mean_pay_data.show(3)

# +--------------------+-----------------+
# |           JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+

Here is what it looks like with the .avg() method:

average_pay_data = reduced_data.groupBy("JOB_TITLE").avg("REGULAR_PAY")
average_pay_data.show(3)

# +--------------------+-----------------+
# |           JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+

They return the exact same values. What's the difference between mean() and avg()?

I also want to find the median, so that one person doesn't have too much of an impact. Since there is no median() method in PySpark I don't know what to do here.

there *is* a median method in pyspark. see [`percentile_approx`](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/api/pyspark.sql.functions.percentile_approx.html#pyspark-sql-functions-percentile-approx). as for `mean` and `avg` - they're same. see [func list](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/functions.html) — samkart, Oct 11 '22 at 09:57

score 1 · Accepted Answer · edited Oct 12 '22 at 09:57

Both avg and mean documentation tell this:

mean() is an alias for avg()

Both of these functions are identical. Both names are needed, so that developers coming from different backgrounds would feel comfortable.

Regarding the median:

Approximate (efficient) median: F.expr('percentile_approx(col_name, .5) over()')
Accurate (inefficient) median: F.expr('percentile(col_name, .5) over()')

What is the difference between the .mean() and the .avg() methods?

1 Answers1