Python module providing a bridge between Scikit-Learn’s Machine Learning methods and pandas-style DataFrames
Questions tagged [sklearn-pandas]
1336 questions
                    
                    98
                    
            votes
                
                6 answers
            
        How to one-hot-encode from a pandas column containing a list?
I would like to break down a pandas column consisting of a list of elements into as many columns as there are unique elements i.e. one-hot-encode them (with value 1 representing a given element existing in a row and 0 in the case of absence). 
For…
         
    
    
        Melsauce
        
- 2,535
- 2
- 19
- 39
                    42
                    
            votes
                
                4 answers
            
        Sklearn plot_tree plot is too small
I have this simple code:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
tree.plot_tree(clf.fit(X, y))
plt.show()
And the result I get is this graph:
How do I make this graph legible? I'm using PyCharm Professional 2019.3 as my IDE.
         
    
    
        Artur
        
- 614
- 1
- 6
- 9
                    28
                    
            votes
                
                4 answers
            
        sklearn stratified sampling based on a column
I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one…
         
    
    
        Azee.
        
- 703
- 1
- 5
- 12
                    26
                    
            votes
                
                2 answers
            
        python sklearn multiple linear regression display r-squared
I calculated my multiple linear regression equation and I want to see the adjusted R-squared. I know that the score function allows me to see r-squared, but it is not adjusted.
import pandas as pd #import the pandas module
import numpy as np
df =…
         
    
    
        jeangelj
        
- 4,338
- 16
- 54
- 98
                    23
                    
            votes
                
                3 answers
            
        Using K-means with cosine similarity - Python
I am trying to implement Kmeans algorithm in python which will use cosine distance instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance…
         
    
    
        ise372
        
- 231
- 1
- 2
- 5
                    18
                    
            votes
                
                2 answers
            
        Multivariable/Multiple Linear Regression in Scikit Learn?
I have a dataset (dataTrain.csv & dataTest.csv) in .csv file with this format:
Temperature(K),Pressure(ATM),CompressibilityFactor(Z)
273.1,24.675,0.806677258
313.1,24.675,0.888394713
...,...,...
And able to build a regression model and prediction…
         
    
    
        Drizzer Silverberg
        
- 193
- 1
- 1
- 7
                    17
                    
            votes
                
                4 answers
            
        Scikit K-means clustering performance measure
I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering.
I'm not an expert but I am eager to learn more about clustering.
Here is my code :
import pandas as pd
from sklearn import…
         
    
    
        Viphone Rathikoun
        
- 187
- 1
- 1
- 5
                    17
                    
            votes
                
                6 answers
            
        ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.0
I have applied Logistic Regression on train set after splitting the data set into test and train sets, but I got the above error. I tried to work it out, and when i tried to print my response vector y_train in the console it prints integer values…
         
    
    
        Amey Kumar Samala
        
- 904
- 1
- 7
- 20
                    17
                    
            votes
                
                4 answers
            
        No module named 'pandas' in Pycharm
I read all the topics about, but I cannot solve my problem:
 Traceback (most recent call last):
 File "/home/.../.../.../reading_data.py", line 1, in 
 import pandas as pd
 ImportError: No module named pandas     
This is my… 
         
    
    
        ElenaPhys
        
- 443
- 2
- 5
- 16
                    16
                    
            votes
                
                2 answers
            
        How to normalize the Train and Test data using MinMaxScaler sklearn
So, I have this doubt and have been looking for answers. So the question is when I use,
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df =…
         
    
    
        Tia
        
- 521
- 2
- 6
- 18
                    16
                    
            votes
                
                1 answer
            
        'DataFrame' object has no attribute 'ravel' when transforming target variable?
I was fitting a logistic regression with a subset dataset. After splitting the dataset and fitting the model, I got a error message of the following:
/Users/Eddie/anaconda/lib/python3.4/site-packages/sklearn/utils/validation.py:526:…
         
    
    
        Edward Lin
        
- 609
- 1
- 9
- 16
                    16
                    
            votes
                
                1 answer
            
        use Featureunion in scikit-learn to combine two pandas columns for tfidf
While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.
I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the…
         
    
    
        BLodge
        
- 163
- 1
- 1
- 4
                    15
                    
            votes
                
                4 answers
            
        What is the difference between X_test, X_train, y_test, y_train in sklearn?
I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split().
In the Documentation, I found some examples but it wasn't sufficient to end my doubts.
Does the code use the X_train to…
         
    
    
        Jancer Lima
        
- 744
- 2
- 10
- 19
                    14
                    
            votes
                
                3 answers
            
        Append tfidf to pandas dataframe
I have the following pandas structure:
col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text
I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn…
         
    
    
        lte__
        
- 7,175
- 25
- 74
- 131
                    14
                    
            votes
                
                2 answers
            
        How to load Only column names from csv file (Pandas)?
I have a large csv file and don't want to load it fully into my memory, I need to get only column names from this csv file. How to load it clearly?
         
    
    
        Ivan Shelonik
        
- 1,958
- 5
- 25
- 49