mouthful title but the point is this, I have some data science pipelines with these requirements (python based):
- are orchestrated with an "internal" orchestrator based off on a server
- are run across a number of users/products /etc where N could be relatively high
- the "load" of this jobs I want to distribute and not be tethered by the orchestrator server
- these jobs are backed by a docker image
- these jobs are relatively fast to run (from 1 second to 20 seconds, post data load)
- these jobs most often require considerable I/O both coming in and out.
- no
sparkrequired - I want minimal hassle with scaling/provisioning/etc
- data (in/out) would be stored in either a
HDFSspace in a cluster orAWS S3 dockerimage would be relatively large (encompasses data science stack)
I was trying to understand the most (a) cost-efficient but also (b) fast solution to parallelize this thing. candidates so far:
AWS ECSAWS lambdawith Container Image Support
please note for all intents and purposes scaling/computing within the cluster is not feasible
my issue is that I worry about the tradeoffs about huge data transfers (in aggregate terms), huge costs in calling docker images a bunch of times, time you would spend setting up containers in servers but very low time doing anything else, serverless management and debugging when things go wrong in case for lambda functions.
how generally are handled these kind of cases?