Currently, I am working with PySpark to analyze some data. I have a CSV file with Payroll data in it. I want to know what Job has the best pay. To do that I need the median() because I want to know the average.
The methods for groupBy in Pyspark are these:
agg, avg, count, max, mean, min, pivot, sum
When I try the .mean() method it looks like this:
mean_pay_data = reduced_data.groupBy("JOB_TITLE").mean("REGULAR_PAY")
mean_pay_data.show(3)
# +--------------------+-----------------+
# | JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+
Here is what it looks like with the .avg() method:
average_pay_data = reduced_data.groupBy("JOB_TITLE").avg("REGULAR_PAY")
average_pay_data.show(3)
# +--------------------+-----------------+
# | JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+
They return the exact same values. What's the difference between mean() and avg()?
I also want to find the median, so that one person doesn't have too much of an impact.
Since there is no median() method in PySpark I don't know what to do here.