I'm trying to port a code from R to Scala to perform Customer Analysis. I have already computed Recency, Frequency and Monetary factors on Spark into a DataFrame.
Here is the schema of the Dataframe :
df.printSchema 
root
 |-- customerId: integer (nullable = false)
 |-- recency: long (nullable = false)
 |-- frequency: long (nullable = false)
 |-- monetary: double (nullable = false)
And here is a data sample as well :
df.order($"customerId").show 
+----------+-------+---------+------------------+
|customerId|recency|frequency|          monetary|
+----------+-------+---------+------------------+
|         1|    297|      114|            733.27|
|         2|    564|       11|            867.66|
|         3|   1304|        1|             35.89|
|         4|    287|       25|            153.08|
|         6|    290|       94|           316.772|
|         8|   1186|        3|            440.21|
|        11|    561|        5|            489.70|
|        14|    333|       57|            123.94|
I'm trying to find the intervals for on a quantile vector for each column given a probability segment.
In other words, given a probability vector of non-decreasing breakpoints, in my case it will be the quantile vector, find the interval containing each element of x;
i.e. (pseudo-code),
if i <- findInterval(x,v), 
for each index j in x 
    v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and N <- length(v). 
In R, this translates to the following code :
probSegment <- c(0.0, 0.25, 0.50, 0.75, 1.0)
RFM_table$Rsegment <- findInterval(RFM_table$Recency, quantile(RFM_table$Recency, probSegment)) 
RFM_table$Fsegment <- findInterval(RFM_table$Frequency, quantile(RFM_table$Frequency, probSegment)) 
RFM_table$Msegment <- findInterval(RFM_table$Monetary, quantile(RFM_table$Monetary, probSegment))
I'm kind of stuck with the quantile function thought.
In an earlier discussion with @zero323, he suggest that I used the percentRank window function which can be used as a shortcut. I'm not sure that I can apply the percentRank function in this case.
How can I apply a quantile function on a Dataframe column with Scala Spark? If this is not possible, can I use the percentRank function instead?
Thanks.
 
     
    



 
    