How can I use bitwiseOR as a aggregation function in pySpark Dataframe.groupBy, is there a inbuilt function like sum, which can do this for me?
            Asked
            
        
        
            Active
            
        
            Viewed 988 times
        
    1
            
            
        - 
                    What version of pyspark? – pault Aug 22 '19 at 14:25
- 
                    @pault I am looking at 2.4.0. – Arjun Ahuja Aug 22 '19 at 15:58
- 
                    Spark 2.4 allows for higher order functions like [`aggregate`](https://spark.apache.org/docs/latest/api/sql/index.html#aggregate). You should be able to do this without a `udf`. – pault Aug 22 '19 at 15:59
- 
                    @pault I understand but is there a performance overhead of using udf? – Arjun Ahuja Aug 23 '19 at 08:45
- 
                    Yes. [Spark functions vs UDF performance?](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance) – pault Aug 23 '19 at 11:27
1 Answers
1
            There is no built-in Bitwise OR aggregation function in Pyspark.
If your column is Boolean, you can simply use df.agg(F.sum('colA'))
Otherwise, you'll have to make a custom aggregation function.
There are three ways :
1 - The fastest is to implement a custom aggregation function in Scala called by Pyspark.
2 - Use an UDF :
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
def bitwiseOr(l):
    return reduce(lambda x,y: x | y, l)  # In Python 3, use `from functools import reduce`
udf_bitwiseOr = F.udf(bitwiseOr, IntegerType())
df.agg(udf_bitwiseOr(F.collect_list('colA'))).show()
3 - Use an RDD :
seqOp = (lambda local_result, row: local_result | row['colA'] )
combOp = (lambda local_result1, local_result2: local_result1 | local_result2)
rdd = df.rdd
rdd.aggregate(0, seqOp, combOp)
Method 2 and 3 share similar performances
 
    
    
        Pierre Gourseaud
        
- 2,347
- 13
- 24
