Questions tagged [dask-dataframe]
403 questions
                    
                    6
                    
            votes
                
                1 answer
            
        Read group of rows from Parquet file in Python Pandas / Dask?
I have a Pandas dataframe that looks similar to this:
datetime                 data1  data2
2021-01-23 00:00:31.140     a1     a2
2021-01-23 00:00:31.140     b1     b2       
2021-01-23 00:00:31.140     c1     c2
2021-01-23 00:01:29.021     d1    …
         
    
    
        Mike
        
- 155
- 2
- 8
                    6
                    
            votes
                
                2 answers
            
        How to create unique index in Dask DataFrame?
Imagine I have a Dask DataFrame from read_csv or created another way.
How can I make a unique index for the dask dataframe?
Note:
reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition…
         
    
    
        Spar
        
- 463
- 1
- 5
- 23
                    5
                    
            votes
                
                1 answer
            
        Apply a function over the columns of a Dask array
What is the most efficient way to apply a function to each column of a Dask array? As documented below, I've tried a number of things but I still suspect that my use of Dask is rather amateurish.
I have a quite wide and quite long array, in the…
         
    
    
        chameau13
        
- 626
- 7
- 24
                    5
                    
            votes
                
                1 answer
            
        Implement Equal-Width Intervals feature engineering in Dask
In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals.
For example, given the…
         
    
    
        ps0604
        
- 1,227
- 23
- 133
- 330
                    5
                    
            votes
                
                0 answers
            
        Dask distributed KeyError
I am trying to learn Dask using a small example. Basically I read in a file and calculate row means.
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=4, memory='24 GB')
cluster.scale(4)
from dask.distributed import Client
client…
         
    
    
        Phoenix Mu
        
- 648
- 7
- 12
                    5
                    
            votes
                
                1 answer
            
        Efficiently read big csv file by parts using Dask
Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database).
Avoiding load all data in memory, I want to read by chunks of current size: read first…
         
    
    
        Mikhail_Sam
        
- 10,602
- 11
- 66
- 102
                    4
                    
            votes
                
                1 answer
            
        Setting maximum number of workers in Dask map function
I have a Dask process that triggers 100 workers with a map function:
worker_args = .... # array with 100 elements with worker parameters 
futures = client.map(function_in_worker, worker_args) 
worker_responses = client.gather(futures)
I use docker…
         
    
    
        ps0604
        
- 1,227
- 23
- 133
- 330
                    4
                    
            votes
                
                2 answers
            
        Is there a way to traverse through a dask dataframe backwards?
I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?
         
    
    
        Anina Hitt
        
- 61
- 3
                    4
                    
            votes
                
                1 answer
            
        Get column value after searching for row in dask
I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3.
Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where…
         
    
    
        Tanmay Bhatnagar
        
- 2,330
- 4
- 30
- 50
                    4
                    
            votes
                
                1 answer
            
        Merging on columns with dask
I have a simple script currently written with pandas that I want to convert to dask dataframes.
In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask.
def merge_dfs(df1, df2,…
         
    
    
        Eliran Turgeman
        
- 1,526
- 2
- 16
- 34
                    4
                    
            votes
                
                1 answer
            
        Dask Cluster: AttributeError: 'DataFrame' object has no attribute '_data'
I'm working with a Dask Cluster on GCP. I'm using this code to deploy it:
from dask_cloudprovider.gcp import GCPCluster
from dask.distributed import Client
enviroment_vars = {
    'EXTRA_PIP_PACKAGES': '"gcsfs"'
}
cluster = GCPCluster(
   …
         
    
    
        Paula Vallejo
        
- 43
- 6
                    4
                    
            votes
                
                2 answers
            
        How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?
Goal
I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column.
CODE
mydtype = {'col1': 'object',
           'col2': 'object',
           'col3': 'object',
           'col4': 'float32',}
do =…
         
    
    
        sogu
        
- 2,738
- 5
- 31
- 90
                    4
                    
            votes
                
                1 answer
            
        Dask crashing when saving to file?
I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using…
         
    
    
        Lostsoul
        
- 25,013
- 48
- 144
- 239
                    4
                    
            votes
                
                0 answers
            
        Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…
         
    
    
        Michael Wheeler
        
- 849
- 1
- 10
- 29
                    4
                    
            votes
                
                3 answers
            
        Dask: convert a dask.DataFrame to an xarray.Dataset
This is possible in pandas.
I would like to do it with dask.
Edit: raised on dask here
FYI you can go from an xarray.Dataset to a Dask.DataFrame
Pandas solution using .to_xarry:
import pandas as pd
import numpy as np
df = pd.DataFrame([('falcon',…
         
    
    
        Ray Bell
        
- 1,508
- 4
- 18
- 45