Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.
Questions tagged [data-partitioning]
337 questions
                    
                    80
                    
            votes
                
                14 answers
            
        python equivalent of filter() getting two output lists (i.e. partition of a list)
Let's say I have a list, and a filtering function. Using something like
>>> filter(lambda x: x > 10, [1,4,12,7,42])
[12, 42]
I can get the elements matching the criterion. Is there a function I could use that would output two lists, one of elements…
         
    
    
        F'x
        
- 12,105
- 7
- 71
- 123
                    71
                    
            votes
                
                3 answers
            
        Difference between df.repartition and DataFrameWriter partitionBy?
What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?
I hope both are used to "partition data based on dataframe column"? Or is there any difference?
         
    
    
        Shankar
        
- 8,529
- 26
- 90
- 159
                    49
                    
            votes
                
                11 answers
            
        C# - elegant way of partitioning a list?
I'd like to partition a list into a list of lists, by specifying the number of elements in each partition.
For instance, suppose I have the list {1, 2, ... 11}, and would like to partition it such that each set has 4 elements, with the last set…
         
    
    
        David Hodgson
        
- 10,104
- 17
- 56
- 77
                    35
                    
            votes
                
                6 answers
            
        What is the best way to divide a collection into 2 different collections?
I have a Set of numbers :
 Set mySet = [ 1,2,3,4,5,6,7,8,9]
I want to divide it into 2 sets of odds and evens.
My way was to use filter twice :
Set set1 = mySet.stream().filter(y -> y % 2 ==…  
         
    
    
        user1386966
        
- 3,302
- 13
- 43
- 72
                    21
                    
            votes
                
                5 answers
            
        Create grouping variable for consecutive sequences and split vector
I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in: 
L1: 1
L2: 3,4,5
L3: 9,10
L4:…
         
    
    
        letsrock
        
- 211
- 2
- 3
                    21
                    
            votes
                
                2 answers
            
        Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
I have a large JSON file with I'm guessing 4 million objects.  Each top level has a few levels nested inside.  I want to split that into multiple files of 10000 top level objects each (retaining the structure inside each).  jq should be able to do…
         
    
    
        Chaz
        
- 787
- 2
- 9
- 16
                    18
                    
            votes
                
                7 answers
            
        QuickSort and Hoare Partition
I have a hard time translating QuickSort with Hoare partitioning into C code, and can't find out why.  The code I'm using is shown below:
void QuickSort(int a[],int start,int end) {
    int q=HoarePartition(a,start,end);
    if (end<=start) return;
…
         
    
    
        Ofek Ron
        
- 8,354
- 13
- 55
- 103
                    17
                    
            votes
                
                2 answers
            
        Querying Windows Azure Table Storage with multiple query criteria
I'm trying to query a table in Windows Azure storage and was initially using the TableQuery.CombineFilters in the TableQuery().Where function as follows:
TableQuery.CombineFilters(
    TableQuery.GenerateFilterCondition("PartitionKey",… 
         
    
    
        Captain John
        
- 1,859
- 2
- 16
- 30
                    13
                    
            votes
                
                5 answers
            
        How to sort an integer array into negative, zero, positive part without changing relative position?
Give an O(n) algorithm which takes as input an array S, then divides S into three sets:  negatives, zeros, and positives. Show how to implement this in place, that is, without allocating new memory. And you have to keep the number's relative…
         
    
    
        Gin
        
- 1,763
- 3
- 12
- 17
                    11
                    
            votes
                
                1 answer
            
        What is the difference between partitioning and bucketing in Spark?
I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId").
df1 is very small (5M) so I broadcast it among the nodes of the spark cluster.
df2 is very large (200M rows) so I tried to…
         
    
    
        nofar mishraki
        
- 526
- 1
- 4
- 15
                    11
                    
            votes
                
                4 answers
            
        How to write SQL query that selects distinct pair values for specific criteria?
I'm having trouble formulating a query for the following problem:
For pair values that have a certain score, how do you group them in way that will only return distinct pair values with the best respective scores?
For example, lets say I have a…
         
    
    
        Stephen Tableau
        
- 113
- 5
                    10
                    
            votes
                
                5 answers
            
        3D clustering Algorithm
Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any…
         
    
    
        Teng Lin
        
- 129
- 1
- 1
- 6
                    10
                    
            votes
                
                2 answers
            
        Hashing VS Indexing
Both hashing and indexing are use to partition data on some pre- defined formula. But I am unable to understand the key difference between the two. 
As in hashing we are dividing the data on the basis of some key value pair, similarly in Indexing…
         
    
    
        coolDude
        
- 407
- 1
- 7
- 17
                    10
                    
            votes
                
                2 answers
            
        partitioning an float array into similar segments (clustering)
I have an array of floats like this:
[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]
Now, I want to partition the array like this:
[[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]]
// [200] will…
         
    
    
        alessandro
        
- 1,681
- 10
- 33
- 54
                    9
                    
            votes
                
                4 answers
            
        python: Generating integer partitions
I need to generate all the partitions of a given integer.
I found this algorithm by Jerome Kelleher for which it is stated to be the most efficient one:
def accelAsc(n):
    a = [0 for i in range(n + 1)]
    k = 1
    a[0] = 0
    y = n - 1
   …
         
    
    
        etuardu
        
- 5,066
- 3
- 46
- 58