It's as easy as using User Defined Functions. By creating a specific UDF to deal with average of many columns, you will be able to reuse it as many times as you want.
Python
In this snippet, I'm creating a UDF that takes an array of columns, and calculates the average of it.
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DoubleType
avg_cols = udf(lambda array: sum(array)/len(array), DoubleType())
df.withColumn("average", avg_cols(array("marks1", "marks2"))).show()
Output:
+-----+------+------+--------+
| name|marks1|marks2| average|
+-----+------+------+--------+
|Alice| 10| 20| 15.0|
| Bob| 20| 30| 25.0|
+-----+------+------+--------+
Scala
With the Scala API, you must process the selected columns as a Row. You just have to select the columns using the Spark struct function.
import org.apache.spark.sql.functions._
import spark.implicits._
import scala.util.Try
def average = udf((row: Row) => {
val values = row.toSeq.map(x => Try(x.toString.toDouble).toOption).filter(_.isDefined).map(_.get)
if(values.nonEmpty) values.sum / values.length else 0.0
})
df.withColumn("average", average(struct($"marks1", $"marks2"))).show()
As you can see, I'm casting all any values to Double with Try, so that if the value cannot be casted, it won't throw any exception, performing the average only on those columns that are defined.
And that's all :)