for loop to find cluster centers

Question

I have a clustered a DataFrame and then used groupby to group it by the resulting 'clusters' value

clusterGroup = df1.groupby('clusters')

Each group in clusterGroup has multiple rows (and ~30 columns) and I need to create a new dataframe of a single row for each group that is that represents the cluster center for each group. I'm using Kmeans to do this, specifically ".cluster_centers_" The idea was to loop through each group and calculate the cluster center then append this to a new dataframe called logCenters.

df1.head()

9367    13575   13577   13578   13580   13585   13587   13588   13589   13707   13708   13719   13722   13725   13817   13819   14894   20326   20379   20384   20431   20433   22337   22346   22386   22388   22391   clusters
493 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 105.0   0.0 0.0 0.0 0.0 0.0 0.0 112.0   0.0 107.0   0.0 0.0 0.0 14
510 0.0 0.0 0.0 113.0   0.0 0.0 111.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 105.0   0.0 0.0 0.0 0.0 0.0 26
513 0.0 0.0 0.0 114.0   0.0 0.0 106.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 106.0   0.0 0.0 0.0 0.0 0.0 26
516 0.0 0.0 0.0 114.0   0.0 0.0 111.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 108.0   0.0 0.0 0.0 0.0 0.0 26
519 0.0 0.0 0.0 113.0   0.0 0.0 113.0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 109.0   0.0 0.0 0.0 0.0 0.0 26

.

    from sklearn.cluster import KMeans
K = 1
logCenters = []
for x in clusterGroup:
    kmeans_model = KMeans(n_clusters=K).fit(x)
    centers = np.array(kmeans_model.cluster_centers_)
    logCenters.append(centers)

The error I get when running this loop is:

    ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-108-148e4053f5fb> in <module>()
      3 logCenters = []
      4 for x in clusterGroup:
----> 5     kmeans_model = KMeans(n_clusters=K).fit(x)
      6     centers = np.array(kmeans_model.cluster_centers_)
      7     logCenters.append(centers)

/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
    878         """
    879         random_state = check_random_state(self.random_state)
--> 880         X = self._check_fit_data(X)
    881 
    882         self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \

/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/cluster/k_means_.py in _check_fit_data(self, X)
    852     def _check_fit_data(self, X):
    853         """Verify that the number of samples given is larger than k"""
--> 854         X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
    855         if X.shape[0] < self.n_clusters:
    856             raise ValueError("n_samples=%d should be >= n_clusters=%d" % (

/home/nbuser/anaconda3_23/lib/python3.4/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: setting an array element with a sequence.

post output of `df1.head()` so we get to know about your data. — Sociopath, Feb 03 '18 at 09:01
what is `x` here? is it a list or a numpy array, if array what is `x.dtype`? — Pratik Kumar, Feb 03 '18 at 09:31
Added df1.head() spend 30 mins trying to get the formatting better. Not easy! — Mat, Feb 03 '18 at 09:42
`x.dtype--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 x.dtype` AttributeError: 'tuple' object has no attribute 'dtype' — Mat, Feb 03 '18 at 10:09

Pratik Kumar · Answer 1 · 2018-02-03T11:17:58.607

-1

clusterGroup = df1.groupby('clusters') returns an object see here

sklearn works with numpy arrays or pandas dataframes

but you're trying to feed it tuples. Hence the Error : ValueError: setting an array element with a sequence. refer this

try to convert it back to a dataframe, may be refer this here to debug

edited Feb 03 '18 at 11:17

answered Feb 03 '18 at 11:03

Pratik Kumar

2,211
1
17
41

Do you mean that I should create a array or dataframe per loop and then operate on this small array or dataframe to do the .cluster_centers work on. Then loop to the next cluster create a new array or dataframe from a next cluster set and do it all again? – Mat Feb 04 '18 at 18:57
@Mat yes, try something like that, the last link may be of some help – Pratik Kumar Feb 04 '18 at 19:10

for loop to find cluster centers

1 Answers1