I am looking into implementing a UserDefinedAggregateFunction in spark and see that a bufferSchema is needed. I understand how to create it, but my issue is why does it require a bufferSchema? Should it not only need a size (number of elements for use in aggregation), an inputSchema and a dataType? Doesn't a bufferSchema constrain it to UserDefinedTypes in the intermediate steps in sql?
            Asked
            
        
        
            Active
            
        
            Viewed 204 times
        
    2
            
            
        
        Ghastone
        
- 75
 - 4
 
1 Answers
1
            
            
        This is needed because the buffer schema can differ from the input type. For example if you want to calculate the average (arithmetic mean) of doubles, the buffer needs a count and a sum in this case See e.g. the example from databricks how to calculate the geometric mean : https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
        Raphael Roth
        
- 26,751
 - 15
 - 88
 - 145
 
- 
                    But why does it need to be specified in the schema? Why can a zero element not be specified as a tuple of (count, sum) such as in regular rdd aggregation? Why is it constrained to a sql schema? For instance, lets say we want to have a mutable.Seq as our aggregator, if we constrain to a schema, that means we will have to recreate the array every time as its immutable (I believe it being a schema wraps it in a immutable WrappedArray) – Ghastone Aug 13 '19 at 18:37