1

Background : A content production team shoots and records content in digital media formats. These can be a mix of raw footages, converted videos and images.

These content are stored in a share folder (Linux Samba) It is 21 TB storage which is almost fully used. I would prefer having these content team re-organize and clear the data. Overlooking the need for discipline, I am asked to simply archive. It makes sense--as years pile on, the disk space will be thin, no matter how much discipline is maintained.

We had carried out archival using Tape drives under the older leadership. New leadership has discontinued that process. They have recommended archiving older content to Amazon Glacier.

Now, the content size could be around 2Tb as an archive. There may be need to pull out an old content. How frequent?--That we do not know as of now.

No matter how much bandwidth Amazon can offer, the wire I have can do a max of 40 mbit/s. Moreover, I am asked to limit the speed by some means so that others on the same Internet connection are not affected by the transfer.

What are the considerations that I should take in to account to arrive at an understanding on whether Glacier fits the bill for such a task.

Also, is there any BASH command-line tool that can push 2 Tb+ archives to the Glacier Vault ?

Anup Nair
  • 144

2 Answers2

5

Glacier is designed and priced for data you don't expect you are likely to need.

Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time.

https://aws.amazon.com/glacier/pricing/

I have several dozen terabytes stored there at the moment, and I highly recommend it -- where appropriate -- so my observations should not be taken as negative, only as emphasizing the point that you need to be sure you understand the product and its intended application.

The native Glacier interface is very low level. It behaves quite a bit like a backup tape or a big tarball . You put an "archive" into a "vault" and it's sort of a black box. You have to maintain records of what you put in each archive, because Glacier can't tell you, any more than physically looking at a backup tape can tell you.

The alternate -- and I would assert -- far better way of using Glacier is through S3. Upload your files into an S3 bucket, and set the bucket's lifecycle policy to archive the files to Glacier after, a few days. With this model, S3 hides the complexity of the raw Glacier API, and the individual files and their metadata remain visible through the S3 console and API. The cost is the same.

Understand, though, that with Glacier (whether through S3 or not) you pay a charge for recovering more than a small amount of data at a time.

Crunch the numbers and you will find that the free allowance for restores is potentially expensive until you have a lot of data stored.

Say I have 180 TB/180000 GB stored. I can only restore 50 GB in any 4 hour window if I don't want to pay additional charges for data retrieval.

180000 × 0.05 ÷ 30 ÷ 6 = 50

180000 GB, 5% monthly allowance, 30 days/no, 6 periods of 4 hours in each day. This works great for me, since my files are typically < 20 GB and it is very rare that I need them. When I do, it's usually for research that isn't pressing so I can spread out the recovery. With a smaller total storage, say 18 TB, my no-charge restoration allowance would be a small 5 GB every 4 hours. So, as I say, consider the restore pricing model carefully.

Possibly a better fit is the relatively new "Infrequent Access" storage class offered by S3. $0.0125/GB/mo is still pretty reasonable and although there is a $0.01/GB charge for downloads, there's no sharp increase in cost if you need to restore a lot of data, and there's no 4 hour wait time, as there is for Glacier restores.

https://aws.amazon.com/blogs/aws/aws-storage-update-new-lower-cost-s3-storage-option-glacier-price-reduction/

0

I'd start with this first, to get a estimate of what your pricing will be. The base rate is 0.007 dollars/gb/month not including transfer fees.

Then look at how you get your data back from Glacier. Job requests can take several hours and then data is only available for a certain time.

AWS Glacier FAQ

Here is something I found while searching for "glacier data bash."

Example Script for Uploading to Glacier/S3

I use S3 for my client's (over 100) off-site backup. I had looked into glacier as it was cheaper, but the time for data retrieval I couldn't deal with. If one of my sites has a problem, and I need to grab a file from S3 I need it now, not in 4 hours.

N. Greene
  • 595