I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks table. Let's say I have ~10 new 'clicks' that I want to store each second. If possible, I would like my data to be available as soon as possible in redshift.
From what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch.
I have two ways of inserting a batch of clicks into redshift:
- Multi-row insert strategy: I use a regular- insertquery for inserting multiple rows. Multi-row insert documentation here
- S3 Copy strategy: I copy the rows in s3 as- clicks_1408736038.csv. Then I run a- COPYto load this into the- clickstable. COPY documentation here
I've done some tests (this was done on a clicks table with already 2 million rows):
             | multi-row insert stragegy |       S3 Copy strategy    |
             |---------------------------+---------------------------+
             |       insert query        | upload to s3 | COPY query |
-------------+---------------------------+--------------+------------+
1 record     |           0.25s           |     0.20s    |   0.50s    |
1k records   |           0.30s           |     0.20s    |   0.50s    |
10k records  |           1.90s           |     1.29s    |   0.70s    |
100k records |           9.10s           |     7.70s    |   1.50s    |
As you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. The upload + copy time is equal to the insert time.
Questions:
What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?
And side question: is it possible for redshift to COPY the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new .csv files are added into s3 ? Doc here and here. Or do I have to create a background worker myself to trigger the COPY commands ?
My quick analysis:
In the documentation about consistency, there is no mention about loading the data via multi-row inserts. It looks like the preferred way is COPYing from s3 with unique object keys (each .csv on s3 has its own unique name)...
- S3 Copy strategy:- PROS: looks like the good practice from the docs.
- CONS: More work (I have to manage buckets and manifests and a cron that triggers the COPYcommands...)
 
- Multi-row insert strategy- PROS: Less work. I can call an insertquery from my application code
- CONS: doesn't look like a standard way of importing data. Am I missing something?
 
- PROS: Less work. I can call an 
 
     
     
     
     
     
    