Parametrized/reusable AWS Glue job

Question

I am new to AWS and I'm trying to create a parameterized AWS Glue job which should have input parameters:

Datasource
Data size
Count
Variable List

Has anyone done something similar before?

jbgorski · Accepted Answer · 2019-01-22T07:19:25.203

First of all, I am not sure that you will be able to limit the data by size. Instead of that I suggest to limit the data by number of rows. Two of first variables you can put into your jobs as I described in AWS Glue Job Input Parameters. When it comes to the variable list, if it is a big number of the variables, I am worry that you will not able to provide these inputs by using standard way. In this case I suggest to provide these variables in the same way like the data, I mean by using flat file. For example:

var1;var2;var3
1;2;3

Summarizing, I suggest to define the following input variables:

Datasource (the path to the place in S3 where you store the data, you can also split this variable into two variables - database and table (in Glue data catalog))
Rows count (numbers of rows which you want to select)
Variables source (the path to the place in S3 where you store file with variables)

This is example of the code:

import sys 
from awsglue.transforms import * 
from awsglue.utils import getResolvedOptions 
from pyspark.context import SparkContext 
from awsglue.context import GlueContext 
from awsglue.job import Job 
## @params: [JOB_NAME] 
args = getResolvedOptions(sys.argv, ['JOB_NAME','SOURCE_DB','SOURCE_TAB','NUM_ROWS','DEST_FOLDER']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
spark = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 
df_new = glueContext.create_dynamic_frame.from_catalog(database = args['SOURCE_DB'], table_name = args['SOURCE_TAB'], transformation_ctx = "full_data") 
df_0 = df_new.toDF() 
df_0.createOrReplaceTempView("spark_dataframe") 
choice_data = spark.sql("Select x,y,z from spark_dataframe") 
choice_data = choice_data.limit(int(args['NUM_ROWS']))
choice_data.repartition(1).write.format('csv').mode('overwrite').options(delimiter=',',header=True).save("s3://"+ args['DEST_FOLDER'] +"/")

job.commit()

Of course you have also to provide proper input variables in Glue job configuration.

score 0 · Answer 2 · answered Jan 22 '19 at 03:45

args = getResolvedOptions(sys.argv, ['JOB_NAME','source_db','source_table','count','dest_folder']) 
sc = SparkContext() 
glueContext = GlueContext(sc) 
spark = glueContext.spark_session 
job = Job(glueContext) 
job.init(args['JOB_NAME'], args) 
df_new = glueContext.create_dynamic_frame.from_catalog(database = args['source_db'], table_name = args['source_table'], transformation_ctx = "sample_data") 
df_0 = df_new.toDF() 
df_0.registerTempTable("spark_dataframe") 
new_data = spark.sql("Select * from spark_dataframe") 
sample = new_data.limit(args['count'])
sample.repartition(1).write.format('csv').options(delimiter=',',header=True).save("s3://"+ args['dest_folder'] +"/")
job.commit()

I am getting error for line 
sample = new_data.limit(args['count'])

error: 
py4j.Py4JException: Method limit([class java.lang.String]) does not exist 

but the argument passed is not a string.

It should be new question. args['count'] is string. Convert it to int using int(args['count']) and pass it to limit() — Sandeep Fatangare, Jan 22 '19 at 17:04

Parametrized/reusable AWS Glue job

2 Answers2