Amazon redshift: bulk insert vs COPYing from s3

Question

I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks table. Let's say I have ~10 new 'clicks' that I want to store each second. If possible, I would like my data to be available as soon as possible in redshift.

From what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch.

I have two ways of inserting a batch of clicks into redshift:

Multi-row insert strategy: I use a regular insert query for inserting multiple rows. Multi-row insert documentation here
S3 Copy strategy: I copy the rows in s3 as clicks_1408736038.csv. Then I run a COPY to load this into the clicks table. COPY documentation here

I've done some tests (this was done on a clicks table with already 2 million rows):

             | multi-row insert stragegy |       S3 Copy strategy    |
             |---------------------------+---------------------------+
             |       insert query        | upload to s3 | COPY query |
-------------+---------------------------+--------------+------------+
1 record     |           0.25s           |     0.20s    |   0.50s    |
1k records   |           0.30s           |     0.20s    |   0.50s    |
10k records  |           1.90s           |     1.29s    |   0.70s    |
100k records |           9.10s           |     7.70s    |   1.50s    |

As you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. The upload + copy time is equal to the insert time.

Questions:

What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?

And side question: is it possible for redshift to COPY the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new .csv files are added into s3 ? Doc here and here. Or do I have to create a background worker myself to trigger the COPY commands ?

My quick analysis:

In the documentation about consistency, there is no mention about loading the data via multi-row inserts. It looks like the preferred way is COPYing from s3 with unique object keys (each .csv on s3 has its own unique name)...

S3 Copy strategy:
- PROS: looks like the good practice from the docs.
- CONS: More work (I have to manage buckets and manifests and a cron that triggers the COPY commands...)
Multi-row insert strategy
- PROS: Less work. I can call an insert query from my application code
- CONS: doesn't look like a standard way of importing data. Am I missing something?

How do you make sure that you're not copying the same record twice to s3? In other words, are you uploading the whole bucket of click's to S3 every minute? I'm curious how you're avoiding duplicates — Kevin Meredith, Jan 26 '15 at 20:10
@KevinMeredith I think the prefered way is to use a staging table. [Documentation for creating the staging table](http://docs.aws.amazon.com/redshift/latest/dg/merge-create-staging-table.html), [Documentation for upserting](http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html) — Benjamin Crouzier, Feb 01 '15 at 17:21
Note: AWS has a service called [Redshift Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html) that lets you query s3 data directly from Redshift. See [this post](https://aws.amazon.com/blogs/aws/amazon-redshift-spectrum-exabyte-scale-in-place-queries-of-s3-data/) for more info. The recommended format is to store s3 files as parquet but CSV will do too. — Benjamin Crouzier, May 05 '17 at 10:10

score 51 · Answer 1 · answered Aug 31 '14 at 17:25

Redshift is an Analytical DB, and it is optimized to allow you to query millions and billions of records. It is also optimized to allow you to ingest these records very quickly into Redshift using the COPY command.

The design of the COPY command is to work with parallel loading of multiple files into the multiple nodes of the cluster. For example, if you have a 5 small node (dw2.xl) cluster, you can copy data 10 times faster if you have your data is multiple number of files (20, for example). There is a balance between the number of files and the number of records in each file, as each file has some small overhead.

This should lead you to the balance between the frequency of the COPY, for example every 5 or 15 minutes and not every 30 seconds, and the size and number of the events files.

Another point to consider is the 2 types of Redshift nodes you have, the SSD ones (dw2.xl and dw2.8xl) and the magnetic ones (dx1.xl and dw1.8xl). The SSD ones are faster in terms of ingestion as well. Since you are looking for very fresh data, you probably prefer to run with the SSD ones, which are usually lower cost for less than 500GB of compressed data. If over time you have more than 500GB of compressed data, you can consider running 2 different clusters, one for "hot" data on SSD with the data of the last week or month, and one for "cold" data on magnetic disks with all your historical data.

Lastly, you don't really need to upload the data into S3, which is the major part of your ingestion timing. You can copy the data directly from your servers using the SSH COPY option. See more information about it here: http://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-remote-hosts.html

If you are able to split your Redis queues to multiple servers or at least multiple queues with different log files, you can probably get very good records per second ingestion speed.

Another pattern that you may want to consider to allow near real time analytics is the usage of Amazon Kinesis, the streaming service. It allows to run analytics on data in delay of seconds, and in the same time prepare the data to copy into Redshift in a more optimized way.

score 4 · Answer 2 · answered Aug 23 '14 at 05:05

4

S3 copy works faster in case of larger data loads. when you have say thousands-millions of records needs to be loaded to redshift then s3 upload + copy will work faster than insert queries.

S3 copy works in parallel mode.

When you create table and do insert then there is limit for batch size. The maximum size for a single SQL is 16 MB. So you need to take care size of SQL Batch ( depends on size of each insert query)

The S3 copy automatically applies encoding ( compression) for your table. When your create table and do sample load using copy then you can see compression automatically applied.

But if you are using insert command for beginning you will notice no compression applied which will result more space for table in redshift and slow query process timing in some cases.

If you wish to use insert commands, then create table with each column has applied encodings to save space and faster response time.

answered Aug 23 '14 at 05:05

Sandesh Deshmane

2,247
1
22
25

Are you sure that the rows `insert`ed are not compressed ? Where can I find this in the docs ? Can this be solved with a `VACUUM` and/or `ANALYSE` ? – Benjamin Crouzier Aug 23 '14 at 11:24
when there is empty table which we created with out any encoding type and we do insert it using insert statement , then no compression is applied. To test encoding for each column fire below command. select "column", type, encoding from pg_table_def where tablename = 'mutable' ..... Try creating new empty table and load data using copy command and fire above query and you will see difference – Sandesh Deshmane Aug 23 '14 at 14:58
@ make sure that to test both cases you create empty table and load data using copy in one table and insert in other table. Make sure you load 10k records see difference in size of table as well. refer this one to see table inspector scripts http://docs.aws.amazon.com/redshift/latest/dg/c_analyzing-table-design.html – Sandesh Deshmane Aug 23 '14 at 15:03

score 2 · Answer 3 · answered Nov 13 '14 at 05:24

It might be worth implementing micro batching while performing bulk uploads to Redshift. This article may be worth reading as it does also contain other techniques to be followed for better performance of the COPY commmand.

http://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift

Alex B · Answer 4 · 2019-03-28T17:42:16.620

1

My test results differ a bit. I was loading CSV file to Redshift from OS Windows desktop.

Row insert was the slowest.
Multi-row insert was 5 times faster than row inset.
S3+COPY was 3 times faster than multi-row insert.

What contributed to faster bulk S3+COPY insert.

The fact that you do not have to parse insert statement from CSV line.
Stream was compressed before multipart upload to S3.
COPY command was extremely fast.

I compiled all my findings into one Python script CSV_Loader_For_Redshift

edited Mar 28 '19 at 17:42

answered May 16 '16 at 14:39

Alex B

2,165
2
27
37

The results included in the post are too shallow (query size dependence? trends?) – ivan_pozdeev May 19 '16 at 19:32
@ivan_pozdeev what trends got to do with it? – Alex B May 19 '16 at 20:08
By trends I mean how comparative times change with different input sizes – ivan_pozdeev May 19 '16 at 20:57
@ivan_pozdeev makes sense. – Alex B May 19 '16 at 21:05
Hi, @AlexB the python script link to CSV_Loader_For_Redshift is broken – Daniel Pinyol Mar 28 '19 at 07:41

score 1 · Answer 5 · answered Aug 16 '23 at 02:45

Redshift streaming ingestion feature can be used for low-latency, high-speed ingestion of data for streaming data sources like click streams. Use either Kinesis data streams or MSK to stream your clickstream data into Redshift for near-real. This is the simplest solution for your problem.

If you would like to load data into S3 first and then into Redshift as soon as it is available in S3, you can use continuous file ingestion from S3 feature (in-preview as of the time of this post).

score 0 · Answer 6 · answered Aug 26 '20 at 06:22

0

I mean COPYing the data as soon as new .csv files are added into s3 ?

Yes use can use AWS Lambda for this , which can be triggered when you have a new file uploaded

answered Aug 26 '20 at 06:22

swarnim gupta

213
1
5

Amazon redshift: bulk insert vs COPYing from s3

6 Answers6

Linked