Overview
Currently my product maintains a DAL that is separated from business logic and exposed via a set of services where each service generally corresponds to an element i.e. Car objects are accessed via the CarService. These services are powered through Spring Data Repositories and access data (models) stored in both PostgreSQL and Elasticsearch.
We are now processing more and more data (documents in, our models out or documents in, clustering, models out) and have realized that computation has become a bottleneck. To overcome this we are evaluating Spark or Apache Beam to distribute the computation horizontally which would solve the problem.
Problem
After looking into the Spark (and Beam) frameworks I have found that they generally provide their own integration (or plugin) for reading/writing from/to datasources, which in and of itself is great. The problem for me is that I can't find anyway for these frameworks to support distributed reading/writing through our current set of services. Spark requires RDD and Beam requires PCollection and I'd rather not support 2 methods of reading/writing from our datastores to accommodate.
My Question
Has anyone encountered this before? What was your strategy?
Did you go ahead and support 2 types of DAL? If so, were there any caveats with this especially with regards to the ongoing maintenance of the code?

