How to do discretization of continuous attributes in sklearn?

Question

My data consists of a mix of continuous and categorical features. Below is a small snippet of how my data looks like in the csv format (Consider it as data collected by a super store chain that operates stores in different cities)

city,avg_income_in_city,population,square_feet_of_store_area,  store_type ,avg_revenue
NY  ,54504            , 3506908   ,3006                       ,INDOOR    , 8000091
CH  ,44504            , 2505901   ,4098                       ,INDOOR    , 4000091
HS  ,50134            , 3206911   ,1800                       ,KIOSK     , 7004567
NY  ,54504            , 3506908   ,1000                       ,KIOSK     , 2000091

Her you can see that avg_income_in_city, square_feet_of_store_area and avg_revenue are continuous values where as city,store_type etc are categorical classes (and few more which I have not shown here to maintain the brevity of the data).

I wish to model the data in order to predict the revenue. The question is how to 'Discretizate' the continuous values using sklearn? Does sklearn provide any "readymade" class/method for Discretization of the continuous values? (like we have in Orange e.g Orange.Preprocessor_discretize(data, method=orange.EntropyDiscretization())

Thanks !

I don't see why you should bin/discretize the continuous variables. That's throwing away information. — Fred Foo, Apr 26 '14 at 21:21
I guess it depends on the type of data you are working with and how good subsequent mechanisms in your pipeline are at exploiting this information. Sometimes vector quantization or generally clustering as preprocessing can make representations a lot more stable. — eickenberg, Apr 27 '14 at 12:04

score 12 · Answer 1 · edited May 26 '21 at 06:53

Update (Sep 2018): As of version 0.20.0, there is a function, sklearn.preprocessing.KBinsDiscretizer, which provides discretization of continuous features using a few different strategies:

Uniformly-sized bins
Bins with "equal" numbers of samples inside (as much as possible)
Bins based on K-means clustering

Unfortunately, at the moment, the function does not accept custom intervals (which is a bummer for me as that is what I wanted and the reason I ended up here). If you want to achieve the same, you can use Pandas function cut:

import numpy as np
import pandas as pd
n_samples = 10
a = np.random.randint(0, 10, n_samples)

# say you want to split at 1 and 3
boundaries = [1, 3]
# add min and max values of your data
boundaries = sorted({a.min(), a.max() + 1} | set(boundaries))

a_discretized_1 = pd.cut(a, bins=boundaries, right=False)
a_discretized_2 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False)
a_discretized_3 = pd.cut(a, bins=boundaries, labels=range(len(boundaries) - 1), right=False).astype(float)
print(a, '\n')
print(a_discretized_1, '\n', a_discretized_1.dtype, '\n')
print(a_discretized_2, '\n', a_discretized_2.dtype, '\n')
print(a_discretized_3, '\n', a_discretized_3.dtype, '\n')

which produces:

[2 2 9 7 2 9 3 0 4 0]

[[1, 3), [1, 3), [3, 10), [3, 10), [1, 3), [3, 10), [3, 10), [0, 1), [3, 10), [0, 1)]
Categories (3, interval[int64]): [[0, 1) < [1, 3) < [3, 10)]
 category

[1, 1, 2, 2, 1, 2, 2, 0, 2, 0]
Categories (3, int64): [0 < 1 < 2]
 category

[1. 1. 2. 2. 1. 2. 2. 0. 2. 0.]
 float64

Note that, by default, pd.cut returns a pd.Series object of dtype Category with elements of type interval[int64]. If you specify your own labels, the dtype of the output will still be a Category, but the elements will be of type int64. If you want the series to have a numeric dtype, you can use .astype(np.int64).

My example uses integer data, but it should work just as fine with floats.

score 10 · Answer 2 · answered Apr 26 '14 at 07:59

10

The answer is no. There is no binning in scikit-learn. As eickenberg said, you might want to use np.histogram. Features in scikit-learn are assumed to be continuous, not discrete. The main reason why there is no binning is probably that most of sklearn is developed on text, image featuers or dataset from the scientific community. In these settings, binning is rarely helpful. Do you know of a freely available dataset where binning is really beneficial?

answered Apr 26 '14 at 07:59

Andreas Mueller

27,470
8
62
74

1

This sounds like the actual answer to the question, i.e. "No". – eickenberg Apr 27 '14 at 12:01
1

Hi Andreas, I have not tried binning on any freely available data sets myself, however you can check 'Titanic Data Set' where Sex, Class etc are categorical features and Age is real number. The data set categorizes class survived=yes/no, which again is categorical. In this case, if you have to run a classification algorithm (such a decision tree), it will make sense to _bin_ the feature 'Age' Does it help? – data_learner May 01 '14 at 09:48
3

Actually I don't think binning makes sense with trees, but it might help with linear classifiers on this dataset. – Andreas Mueller May 01 '14 at 12:32
1

There have been some discretization algorithms built for discretization of continuous attributes, with the application to trees. Check my scikit-learn issue for more information: https://github.com/scikit-learn/scikit-learn/issues/4468 – hlin117 May 31 '15 at 22:57

eickenberg · Answer 3 · 2014-04-27T12:08:25.513

5

You may also consider rendering the Categorical variables numerical, e.g. via indicator variables, a procedure also known as one hot encoding

Try

from sklearn.preprocessing import OneHotEncoder

and fit it to your categorical data, followed by a numerical estimation method such as linear regression. As long as there aren't too many categories (city may be a little too much), this can work well.

As for discretization of continuous variables, you may consider binning using an adapted bin size, or, equivalently, uniform binning after histogram normalization. numpy.histogram may be helpful here. Also, while Fayyad-Irani clustering isn't implemented in sklearn, feel free to check out sklearn.cluster for adaptive discretizations of your data (even if it is only 1D), e.g. via KMeans .

edited Apr 27 '14 at 12:08

answered Apr 24 '14 at 11:56

eickenberg

14,152
1
48
52

I was looking for some ready made class in sklearn but seems like there isn't a ready made one. I will do some _binning_ of continuous values (with some clustering algo) and fit the data, but again that's some work ! – data_learner Apr 24 '14 at 12:22
1

You may want to try [`numpy.histogram`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html). I am not sure if there are smart/efficient ways for calculating histograms that work well for information retrieval. – eickenberg Apr 24 '14 at 12:28

score 3 · Answer 4 · answered Sep 05 '17 at 08:23

you could using pandas.cut method, like this:

bins = [0, 4, 10, 30, 45, 99999]
labels = ['Very_Low_Fare', 'Low_Fare', 'Med_Fare', 'High_Fare','Very_High_Fare']
train_orig.Fare[:10]
Out[0]: 
0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
5     8.4583
6    51.8625
7    21.0750
8    11.1333
9    30.0708
Name: Fare, dtype: float64

pd.cut(train_orig.Fare, bins=bins, labels=labels)[:10]
Out[50]: 
0          Low_Fare
1    Very_High_Fare
2          Low_Fare
3    Very_High_Fare
4          Low_Fare
5          Low_Fare
6    Very_High_Fare
7          Med_Fare
8          Med_Fare
9         High_Fare
Name: Fare, dtype: category
Categories (5, object): [High_Fare < Low_Fare < Med_Fare < Very_High_Fare < Very_Low_Fare]

score 1 · Answer 5 · answered Aug 19 '20 at 08:49

Thanks to the ideas above;

To Discretizate continuous values, you may utilize:

the Pandas cut or qcut functions (input array Must be 1-dimensional)

or

the sklearn's KBinsDiscretizer function (with parameter encode set to ‘ordinal’)
- parameter strategy = uniform will discretize in the same manner as pd.cut
- parameter strategy = quantile will discretize in the same manner as pd.qcut function

Since examples for cut/qcut are provided in previous answers, here let's go on with a clean example on KBinsDiscretizer:

import numpy as np
from sklearn.preprocessing import KBinsDiscretizer

A = np.array([[24,0.2],[35,0.3],[74,0.4], [96,0.5],[2,0.6],[39,0.8]])
print(A)
# [[24.   0.2]
#  [35.   0.3]
#  [74.   0.4]
#  [96.   0.5]
#  [ 2.   0.6]
#  [39.   0.8]]


enc = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
enc.fit(A)
print(enc.transform(A))
# [[0. 0.]
#  [1. 0.]
#  [2. 1.]
#  [2. 1.]
#  [0. 2.]
#  [1. 2.]]

As shown in the output, each feature has been discretized into 3 bins. Hope this helped :)

Final notes:

To compare cut versus qcut, see this post
For your categorical features, you can utilize KBinsDiscretizer(encode='onehot') to perform one-hot encoding on that feature

I'm not getting the same results between `pd.qcut` and `KBinsDiscretizer` `quantile` for a simple 1D array (e.g. `np.random.rand(10,1)`). Are you sure they should be identical? — m_power, Jun 20 '21 at 15:43

How to do discretization of continuous attributes in sklearn?

5 Answers5

Linked