Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
                    
                    27
                    
            votes
                
                2 answers
            
        In Apache Spark, why does RDD.union not preserve the partitioner?
As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
  sc.parallelize(1 to 50).keyBy(_ % 10)
   …
         
    
    
        tribbloid
        
- 4,026
- 14
- 64
- 103
                    23
                    
            votes
                
                4 answers
            
        What is the use of grouping comparator in hadoop map reduce
I would like to know why grouping comparator is used in secondary sort of mapreduce.
According to the definitive guide example of secondary sorting
We want the sort order for keys to be by year (ascending) and then by
temperature (descending):
1900…
         
    
    
        Pramod
        
- 493
- 1
- 8
- 16
                    22
                    
            votes
                
                5 answers
            
        hadoop map reduce secondary sorting
Can any one explain me how secondary sorting works in hadoop ?
Why must one use GroupingComparator and how does it work in hadoop ?
I was going through the link given below and got doubt on how groupcomapator works.
Can any one explain me how…
         
    
    
        user1585111
        
- 1,019
- 6
- 19
- 35
                    17
                    
            votes
                
                3 answers
            
        Secondary Sort in Hadoop
I am working on a hadoop project and after many visit to various blogs and reading the documentation, I realized I need to use secondary sort feature provided by hadoop framework.
My input format is of the form:
DESC(String) Price(Integer) and some…
         
    
    
        Abhishek Singh
        
- 275
- 1
- 2
- 18
                    12
                    
            votes
                
                5 answers
            
        How the data is split in Hadoop
Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given…
         
    
    
        HHH
        
- 6,085
- 20
- 92
- 164
                    11
                    
            votes
                
                5 answers
            
        Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y
I am running this command --
sudo -u hdfs hadoop fs -du -h /user | sort -nr 
and the output is not sorted in terms of gigs, Terabytes,gb 
I found this command - 
hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT";…
         
    
    
        Mayur Narang
        
- 111
- 1
- 1
- 5
                    9
                    
            votes
                
                1 answer
            
        Spark: can you include partition columns in output files?
I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar), if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output"), I get an output…
         
    
    
        erwaman
        
- 3,307
- 3
- 28
- 29
                    8
                    
            votes
                
                1 answer
            
        Hive: When insert into partitioned table, in most of the rows, hive double url-encode the partition key column
I created a partitioned table:
create table  t1 ( amount double) partitioned by ( events_partition_key string) stored as paquet;
added some data to tmp_table, where column 'events_partition_key' contains timestamp (string type) in the following…
         
    
    
        marnun
        
- 808
- 8
- 23
                    8
                    
            votes
                
                4 answers
            
        Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?
I am trying to create a table in Hive
CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;
This syntax is failing - but I am not sure if it is even…
         
    
    
        Andrew
        
- 6,295
- 11
- 56
- 95
                    7
                    
            votes
                
                1 answer
            
        How to check specific partition data from Spark partitions in Pyspark
I have a created two dataframes in pyspark from my hive table as:
data1 = spark.sql("""
   SELECT ID, MODEL_NUMBER, MODEL_YEAR ,COUNTRY_CODE
   from MODEL_TABLE1 where COUNTRY_CODE in ('IND','CHN','USA','RUS','AUS')
""");
each country is having…
         
    
    
        vikrant rana
        
- 4,509
- 6
- 32
- 72
                    7
                    
            votes
                
                2 answers
            
        Hive Partition recovery
How to recover partitions in easy fashion. Here is the scenario : 
Have 'n' partitions on existing external table 't'
Dropped table 't'
Recreated table 't' // Note : same table but with excluding some column
How to recover the 'n' partitions that…
         
    
    
        Balaji Boggaram Ramanarayan
        
- 7,731
- 6
- 43
- 46
                    6
                    
            votes
                
                2 answers
            
        Efficient way of joining multiple tables in Spark - No space left on device
A similar question has been asked here, but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows and I need to join them, by doing a full join based on the column ID, thereby creating a…
         
    
    
        cph_sto
        
- 7,189
- 12
- 42
- 78
                    6
                    
            votes
                
                1 answer
            
        HDINSIGHT hive, MSCK REPAIR TABLE table_name throwing error
i have an external partitioned table named employee with partition(year,month,day), everyday a new file come and seat at the particular day location call for today's date it will be at 2016/10/13.
TABLE SCHEMA:
create External table employee(EMPID…
         
    
    
        anand
        
- 503
- 1
- 7
- 20
                    6
                    
            votes
                
                1 answer
            
        Hadoop - Produce multiple values for a single key
I was able to successfully change the wordcount program in hadoop to suit my requirement. However, I have another situation where in I use the same key for 3 values. Let's say my input file is as below. 
A Uppercase 1 firstnumber  I  romannumber a…
         
    
    
        Ramesh
        
- 765
- 7
- 24
- 52
                    6
                    
            votes
                
                1 answer
            
        Understanding a mapreduce algorithm for overlap calculation
I want help understanding the algorithm. I ve pasted the algorithm explanation first and then my doubts.
Algorithm:( For calculating the overlap between record pairs)
Given a user defined parameter K, the file DR( *Format: record_id, data*) is split…
         
    
    
        Mahalakshmi Lakshminarayanan
        
- 714
- 2
- 11
- 34