In an Python GAE application that I'm working on, we need to retrieve n rows from storage, and we are running into performance issues for n > 100. We expect n to be less than 10000 is most cases.
So let's consider a simple model:
class MyEntity(ndb.Model):
field1 = nbd.StringProperty()
field2 = ndb.StringProperty()
#...
fieldm = ndb.StringProperty()
# m is quite large, maybe ~ 30. Stored strings are short - in the order of 30 characters or less
I've populated the data store with some data, and got really bad performance using plain fetch(). I've since removed all filters, and just trying to get a number of entities seems to get very bad performance (as compared with what I would expect, say, for any common SQL deployment. I know that we shouldn't compare GAE to SQL, but just getting flat rows down - I would expect to be more performant, not less). Here's what I've tried:
- The simplest approach
MyEntity.all().fetch(n). This scales linearly withn, which is expected. Although I didn't expect it to take 7s forn = 1000. - Trying to coerce
fetch()with any reasonablebatch_sizedegrades performance further. I've tried values ranging from 1 to 1000. - Doing
keys_onlygives an order of magnitude improvement. - Doing a query manually (
through ndb.Query), and getting out just a single field gives a small improvement, in the order of 1.2. - Doing a
fetch_async(n)and waiting gives exactly same performance. - Splitting the job into
pparts, then doingfetch_async(n/p, offset=...)and then waiting and joining all futures - gives at best same performance, at worst - much worse performance. - Similar story with
fetch_page()
I've also tried using db instead of ndb, and the results are pretty much the same. So, now I'm not sure what to do? Is there a way to get half decent performance for n in the order of 10000? Even simplifying my entities to single fields, the performance is too poor. I expect the entire payload uncompressed to be roughly 1 mb. Downloading 1mb in over a minute is clearly unacceptable.
I am seeing this issue live, but for performance testing I'm using remote api. My question is similar to this question on SO: Best practice to query large number of ndb entities from datastore. They didn't seem to find a solution, but it was asked 4 years ago, maybe there is one now.