We are working to get off of SAS and onto Python/Pandas.  However, one thing we are having trouble with is creating a replacement for PROC SUMMARY (AKA PROC MEANS) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:
- Ability to apply different stats to different columns (for now just count, sum, mean, weighted mean)
- Ability to handle zero to many grouping variables
- Ability to specify a weight variable for weighted mean
We are not trying to do anything else (anything graphical, etc.)
Here is what we have so far:
def wmean_ungrouped (d,w):
    return (d.dot(w)).sum() / w.sum()
def wmean_grouped (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()
FUNCS = {
    "mean"   : np.mean ,
    "sum"   : np.sum ,
    "count" : np.count_nonzero
}
def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None
):
    result = DataFrame()
    if var_names_group is not None:
        grouped = data.groupby (var_names_group)
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
                result[var_name_out] = Series(grouped.apply(func))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = grouped[var_name_in].apply(func)
    else:
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                result[var_name_out] = \
                    Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = Series(func(data[var_name_in]))
    return result
Here is a sample call to the my_summary() function:
    my_summary (
        data=df,
        var_names_in=["x_1","x_1","x_1","x_1"] ,
        var_names_out=[
            "x_1_c","x_1_s","x_1_m","x_1_wm"
        ] ,
        var_functions=["count","sum","mean","wmean"] ,
        var_name_weight="val_1" ,
        var_names_group=["Region","Category"]
)
my_summary() works, but as you can see, its implementation is not the prettiest. Here are the main issues:
- Two different code paths depending on grouped or ungrouped - this stems completely from the fact that DataFrameandDataFrameGroupByhave different ways for applying a programmatically-selected reducing function to a single column. ForDataFrame, the only way I've found is directly invokingfunc(data[var_name_in]).data[var_name_in].apply(func)doesn't work becauseapply()on aSeriesdoes not reduce (unlikeapply()on aDataFrame). On the other hand, forDataFrameGroupBy, I have to use that very approach:grouped[var_name_in].apply(func). That's because something likefunc(grouped[var_name_in])will not work (no reason it should.)
- Special treatment for weighted mean - this is because it operates on two columns, unlike all the other calculations, which operate on just one; I don't know if this can be helped.
- Two different weighted mean functions - this is a consequence of the first issue. The ungrouped function has Series-type parameters and needsdot()to multiply and reduce them; the grouped function eventually deals withSeriesGroupByobjects and has to use the*operator (acknowledgements to the answer to this SO post for the weighted average function code.)
So my questions are:
- Is there something native to pandas that can do all of this (i.e. throw out the above and use that instead)?
- If not, are there any fixes to any of the issues mentioned above?
- By any chance, is there some way to group by nothing - that is, to obtain a DataFrameGroupByobject from aDataFramewithout grouping on any variable? Then the code paths would be reduced as we would be dealing with theDataFrameGroupByinterface exclusively.
Update (old - scroll down for current)
@JohnE's answer provides a way to group by nothing:  groupby(lambda x: True). This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg(), which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions)  take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean.  The generated functions are always applied over the entire DataFrameGroupBy, as was originally the case just with wmean, with the parameters being the correct column name(s) to use. I also replaced all the np.* implementations with pandas calculations, to better deal with NaN values. 
Unless there is something native to pandas that can do this, this is our solution:
FUNC_GENS = {
    "mean"  : lambda y,z : lambda x : x[y].mean(),
    "sum"   : lambda y,z : lambda x : x[y].sum() ,
    "count" : lambda y,z : lambda x : x[y].count() ,
    "wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}
def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):
    result = pd.DataFrame()
    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)
    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        func_gen = FUNC_GENS[var_function]
        func = func_gen (var_name_in, var_name_weight)
        result[var_name_out] = grouped.apply(func)
    return result
Update 2019 / Current Solution
Versions of pandas released subsequent to my original post now implement most of this functionality:
- Grouping by nothing - Wes M. had spoken in the past of the need for a DataFrame.agg()and it was indeed added in version 0.20.0 along with aSeries.agg().
- Multiple aggregations of multiple columns, with specified names for output columns - this is now part of pandas as of version 0.25.+ in the form of NamedAgginputs to theagg()function
So, basically everything except the weighted average. A good current solution for that is here.
 
     
    