Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.6]
111 questions
                    
                    36
                    
            votes
                
                3 answers
            
        PySpark serialization EOFError
I am reading in a CSV as a Spark DataFrame and performing machine learning operations upon it. I keep getting a Python serialization EOFError - any idea why? I thought it might be a memory issue - i.e. file exceeding available RAM - but drastically…
         
    
    
        Tom Wallace
        
- 383
- 1
- 3
- 6
                    25
                    
            votes
                
                2 answers
            
        Reading CSV into a Spark Dataframe with timestamp and date types
It's CDH with Spark 1.6.
I am trying to import this Hypothetical CSV into a apache Spark DataFrame:
$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a
I use databricks-csv jar.
val…
         
    
    
        Mihir Shinde
        
- 657
- 2
- 8
- 13
                    19
                    
            votes
                
                1 answer
            
        Where is the reference for options for writing or reading per format?
I use Spark 1.6.1.
We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use 
df.write().orc()
we would rather do something like
df.write().options(Map("format" -> "orc", "path" -> "/some_path")
This… 
         
    
    
        Satyam
        
- 645
- 2
- 7
- 20
                    15
                    
            votes
                
                2 answers
            
        How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?
In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
         
    
    
        Dzmitry Haikov
        
- 199
- 1
- 2
- 6
                    14
                    
            votes
                
                1 answer
            
        What to do with "WARN TaskSetManager: Stage contains a task of very large size"?
I use spark 1.6.1. 
My spark application reads more than 10000 parquet files stored in s3. 
val df = sqlContext.read.option("mergeSchema", "true").parquet(myPaths: _*)
myPaths is an Array[String] that contains the paths of the 10000 parquet files.…
         
    
    
        reapasisow
        
- 275
- 1
- 2
- 9
                    11
                    
            votes
                
                2 answers
            
        Spark CrossValidatorModel access other models than the bestModel?
I am using Spark 1.6.1:
Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the…
         
    
    
        MeiSign
        
- 1,487
- 1
- 15
- 39
                    10
                    
            votes
                
                3 answers
            
        Get first non-null values in group by (Spark 1.6)
How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import…
         
    
    
        Kamil Sindi
        
- 21,782
- 19
- 96
- 120
                    8
                    
            votes
                
                2 answers
            
        Why Spark application on YARN fails with FetchFailedException due to Connection refused?
I am using spark version 1.6.3 and yarn version 2.7.1.2.3 comes with HDP-2.3.0.0-2557. Becuase, spark version is too old in the HDP version that I use, I prefer to use another spark as yarn mode remotely.
Here is how I run spark shell;
./spark-shell…
         
    
    
        Ahmet DAL
        
- 4,445
- 9
- 47
- 71
                    7
                    
            votes
                
                2 answers
            
        How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?
Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with,
sqlContext.getAllConfs.foreach(println)
But, I am not…
         
    
    
        Krishna Reddy
        
- 1,069
- 5
- 12
- 18
                    7
                    
            votes
                
                1 answer
            
        Dynamic Allocation for Spark Streaming
I have a Spark Streaming job running on our cluster with other jobs(Spark core jobs). I want to use Dynamic Resource Allocation for these jobs including Spark Streaming. According to below JIRA Issue, Dynamic Allocation is not supported Spark…
         
    
    
        Akhila Lankala
        
- 193
- 1
- 11
                    7
                    
            votes
                
                3 answers
            
        How to replace NULL to 0 in left outer join in SPARK dataframe v1.6
I am working Spark v1.6. I have the following two DataFrames and I want to convert the null to 0 in my left outer join ResultSet. Any suggestions?
DataFrames
val x: Array[Int] = Array(1,2,3)
val df_sample_x = sc.parallelize(x).toDF("x")
val y:…
         
    
    
        Prasan
        
- 111
- 1
- 2
- 4
                    7
                    
            votes
                
                2 answers
            
        How to dynamically choose spark.sql.shuffle.partitions
I am currently processing the data using spark and foreach partition open a connection to mysql and insert it to the database in a batch of 1000. As mentioned in the SparkDocumentation default value of spark.sql.shuffle.partitions is 200 but i want…
         
    
    
        Naresh
        
- 5,073
- 12
- 67
- 124
                    6
                    
            votes
                
                3 answers
            
        How to register S3 Parquet files in a Hive Metastore using Spark on EMR
I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.
Use case: I have a Spark cluster used for processing data.  That data is stored in S3 as Parquet files.  I want tools to be able to query the data using names…
         
    
    
        Sam King
        
- 2,068
- 18
- 29
                    6
                    
            votes
                
                2 answers
            
        Running spark job not shown in the UI
I have submitted my spark job as mentioned here bin/spark-submit --class DataSet BasicSparkJob-assembly-1.0.jar without mentioning the --master parameter or spark.master parameter. Instead of that job gets submitted to my 3 node spark cluster. But i…
         
    
    
        Naresh
        
- 5,073
- 12
- 67
- 124
                    5
                    
            votes
                
                2 answers
            
        PySpark- How to use a row value from one column to access another column which has the same name as of the row value
I have a PySpark df:
+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|
+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1|
|  1|  2| 43|  8| 10| 20| 43| e1|
|  2|  3| 15|  0|  1| 23|  7| b1|
|  3|  4|  2|  6| 11| …
         
    
    
        Mia21
        
- 119
- 2
- 10