Amazon Web Services Resource Center Community Recognition | Guidelines | Help | Site Map | RSS Feeds
Resource Center | Forums | Solutions Catalog | Co-marketing | Newsletter | Blog | Support
Advanced  


Using Amazon S3 from Amazon EC2 with Ruby

« Back to Category
Click for a printer friendly version of this document Printer Friendly Save to del.icio.us
Average Review:

Amazon S3 provides a good distributed storage solution for accessing the distributed computing power of Amazon EC2. Using Ruby scripts like s3sync and s3cmd at the command line, you can move data to and from EC2 instances in your computing cloud.

AWS Products Used: Amazon EC2, Amazon S3
Language(s): Ruby
Date Published: 2007-07-31

By Jack Herrington (jherr@pobox.com), Senior Software Engineer, Leverage Software Inc.
July 31, 2007

Introducing Amazon S3 and Amazon EC2

The distributed computing power of the Amazon Elastic Compute Cloud (Beta) (Amazon EC2™) isn't going to do you much good unless you can get data to and from each of the Amazon EC2 instances in your computing cloud. Amazon's Simple Storage Service (Amazon S3) provides a good distributed storage solution for doing this. Plus, Amazon does not charge for Amazon EC2 instances to read and write data from Amazon S3 buckets.

That's all well and good, but how do we get Amazon EC2 instances to read and write files to Amazon S3? I tried several different approaches while writing this article, and the easiest turned out to be a set of Ruby command-line scripts called s3sync. In this article, I'll show you how to set up an Amazon EC2 image using one of Amazon's Fedora Core 4 images, then show you how to install and use the s3sync code to access files from Amazon S3.

Before you begin, make sure you have Amazon EC2 command-line utilities installed. Instructions on how to do this are available on Amazon’s EC2 site. Amazon also has a complete Getting Started Guide for its EC2 web service, which is what I used to help me as I was writing this article. It clearly describes how to install the tools, generate a key set, and build Amazon EC2 instances.

Next, you have to determine the operating system image you are going to run on the Amazon EC2 instances. For the purposes of this article we are going to use one of the images provided by Amazon. Let's have a look at what those are by using the ec2-describe-images command-line utility:

% ec2-describe-images -o amazon
IMAGE   ami-20b65349    ec2-public-images/fedora-core4-base.manifest.xml     amazon  available       public
IMAGE   ami-22b6534b    ec2-public-images/fedora-core4-mysql.manifest.xml     amazon  available       public
IMAGE   ami-23b6534a    ec2-public-images/fedora-core4-apache.manifest.xml    amazon  available       public
IMAGE   ami-25b6534c    ec2-public-images/fedora-core4-apache-mysql.manifest.xml        amazon  available       public
IMAGE   ami-26b6534f    ec2-public-images/developer-image.manifest.xml  	amazon  available       public
IMAGE   ami-2bb65342    ec2-public-images/getting-started.manifest.xml  amazon  available       public
IMAGE   ami-bd9d78d4    ec2-public-images/demo-paid-AMI.manifest.xml    amazon  available       public

I'll choose the fedora-core4-apache-mysql operating system image, because that's the kind of thing I would get from a hosting company, and it's sure to be full of useful utilities. I'll run an instance of that image using the following commands at the command line:

% ec2-run-instances ami-25b6534c -k gsg-keypair
RESERVATION     r-e349af8a      961421114855    default
INSTANCE        i-59c02230      ami-25b6534c                    pending gsg-keypair     0

After the image has booted, the Amazon EC2 command-line utility will give me a hostname. I'll check the name by using the ec2-describe-instances command:

% ec2-describe-instances
RESERVATION     r-e349af8a      961421114855    default
INSTANCE        i-59c02230      ami-25b6534c    ec2-72-44-57-99.z-1.compute-1.amazonaws.com     domU-12-31-36-00-3D-83.z-1.compute-1.internal   running gsg-keypair     0

Now I have a machine running Fedora Core 4 with a lot of handy stuff installed in it. Next is to log into the Amazon EC2 instance that I just created using the hostname provided by ec2-describe-instances.

% ssh -i ~/.ec2/id_rsa-gsg-keypair root@ec2-72-44-57-99.z-1.compute-1.amazonaws.com     
The authenticity of host 'ec2-72-44-57-99.z-1.compute-1.amazonaws.com (72.44.57.99)' can't be established.
RSA key fingerprint is f1:4e:d1:14:87:f0:57:71:89:6e:ed:b5:1c:14:84:b5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-72-44-57-99.z-1.compute-1.amazonaws.com,72.44.57.99' (RSA) to the list of known hosts.

         __|  __|_  )  Rev: 2
         _|  (     / 
        ___|\___|___|

 Welcome to an EC2 Public Image
                       :-)

    Apache2+MySQL4


    __ c __ /etc/ec2/release-notes.txt

[root@domU-12-31-36-00-3D-83 ~]#

Now you have your sample files set up, you're logged in to your Amazon tools, and you are ready to apply the example scripts.

Installing S3Sync

S3sync is the Ruby package I will use to add, update, remove, and list files on the Amazon S3 servers. To do that I will first need to ensure that Ruby is installed, then get the s3sync package and set it up.

To check the Ruby version, I use the following command line:

[root@domU-12-31-36-00-3D-83 ~]# ruby -v
ruby 1.8.4 (2005-12-24) [i386-linux]
[root@domU-12-31-36-00-3D-83 ~]# 

This tells me my computer is running a recent version of Ruby--version 1.8.4--which is a version that allows the script to run. This should do nicely.

There are two ways that I can get the s3sync code. The first is to go to the s3sync web site (http://s3sync.net/wiki) and download it to my local computer. I would then copy it the Amazon EC2 instance. To do that I would use this command:

% scp -i ~/.ec2/id_rsa-gsg-keypair s3sync.tar.gz root@ ec2-72-44-57-99.z-1.compute-1.amazonaws.com:/root

The s3sync.tar.gz file would then be located in my home directory on the Amazon EC2 machine.

I can also do this directly from the Amazon EC2 instance using the following commands:

[root@domU-12-31-36-00-3D-83 ~]# wget http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
--18:31:18--  http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
           => `s3sync.tar.gz'
Resolving s3.amazonaws.com... 72.21.206.171
Connecting to s3.amazonaws.com|72.21.206.171|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26,667 (26K) []

100%[============================================================================>] 26,667        --.--K/s             

18:31:19 (3.21 MB/s) - `s3sync.tar.gz' saved [26667/26667]

[root@domU-12-31-36-00-3D-83 ~]#

Either way, after s3sync has been copied to my computer, the next thing to do is unpack it.

[root@domU-12-31-36-00-3D-83 ~]# tar -xzvf s3sync.tar.gz 
s3sync/
s3sync/HTTPStreaming.rb
s3sync/README.txt
s3sync/README_s3cmd.txt
s3sync/S3.rb
s3sync/s3cmd.rb
s3sync/s3config.rb
s3sync/s3config.yml.example
s3sync/S3encoder.rb
s3sync/s3sync.rb
s3sync/s3try.rb
s3sync/S3_s3sync_mod.rb
s3sync/thread_generator.rb
[root@domU-12-31-36-00-3D-83 ~]# 

Now we have all the s3sync files in a subdirectory and I can start moving files to and from my Amazon S3 bucket. But before I do that, I have to set two environment variables:

[root@domU-12-31-36-00-3D-83 s3sync]# AWS_ACCESS_KEY_ID=xxxx
[root@domU-12-31-36-00-3D-83 s3sync]# export AWS_ACCESS_KEY_ID
[root@domU-12-31-36-00-3D-83 s3sync]# AWS_SECRET_ACCESS_KEY=xxxx
[root@domU-12-31-36-00-3D-83 s3sync]# export AWS_SECRET_ACCESS_KEY

Change the xxxx in lines 1 and 3 to your own Access Key ID and Secret Access Key (available from the Amazon Web Services site).

If everything is working properly, you should be able to use the s3cmd.rb script to list all available buckets:

[root@domU-12-31-36-00-3D-83 s3sync]# ./s3cmd.rb listbuckets
jherr_video
[root@domU-12-31-36-00-3D-83 s3sync]# 

To test this I'm going to create a test bucket. If you aren't familiar with Amazon S3 buckets, a bucket is similar to a disk drive. You can have as many buckets as you like, each with a unique name and each containing its own set of directories and files.

I'll create a bucket for this article using the following command:

# ./s3cmd.rb createbucket art072407
#

Then, I check to see whether it worked by using the listbuckets command again:

# ./s3cmd.rb listbuckets           
art072407
jherr_video
#

Now I can list the contents of the bucket using the list command.

# ./s3cmd.rb list art072407
--------------------
#

The output tells me there is nothing in the bucket. So let's put something in it. Just to test it, I'll put the Readme.txt file that comes with the s3sync code into the bucket.

# ./s3cmd.rb put art072407:Readme.txt Readme.txt 
#

The put command copies the file to the Amazon S3 bucket. The first parameter after the put command is the bucket and the key name. The bucket name is before the colon, and the key name comes after the colon. In Amazon S3 terms, files are "keys" because, really, Amazon S3 can store any data bit. Normally though, your key will be the same as your file name. The last parameter is the name of the local file to copy.

I can then use the list command to see that the file is still in the bucket:

# ./s3cmd.rb list art072407
--------------------
Readme.txt
#

One great thing about Amazon S3 is that all uploaded files are available as URLs from a web browser (or any application that can read a URL). The format of the URL is as follows:

http://<bucketname>.s3.amazonaws.com/<key>

In the case of this example, the URL is:

http://art072407.s3.amazonaws.com/Readme.txt

But if I go to the URL at this point, I’ll get a message telling me that access to that resource is denied because, by default, uploaded data is not publicly accessible. To make it publicly accessible, we have to add to the put command:

# ./s3cmd.rb put art072407:Readme.txt Readme.txt x-amz-acl:public-read
#

Now, if I go back to that URL in my web browser, Amazon S3 will happily show me the Readme.txt file.

To remove the file from the bucket, I run the delete command:

# ./s3cmd.rb delete art072407:Readme.txt
#

Or, to delete everything in the bucket, I run the deleteall command:

# ./s3cmd.rb deleteall art072407
#

As noted above, you can use a URL to get to the data if the Amazon S3 key (the file) is designated as public. To get public data, you can use the following command:

# wget http://art072407.s3.amazonaws.com/Readme.txt
...

But what if the data is private? To do that I use the handy get command that comes with s3cmd.rb.

# ./s3cmd.rb get art072407:Readme.txt Out.txt
#

This command takes the Readme.txt file from the Amazon S3 bucket and copies it to the local file Out.txt.

S3Sync

So far I've worked only with reading and writing a single file from the Amazon S3 bucket. What about entire directories of files, with nested subdirectories, and so on? Ruby code has a solution for that as well. The s3sync.rb command synchronizes whole directory structures with Amazon S3 buckets.

To begin I'll create a new directory called /root/data and copy the contents of the s3sync code to it, just as an example:

# mkdir /root/data
# copy /root/s3sync/* /root/data
#

Now, I'll clear out the Amazon S3 bucket and copy the directory to it using s3sync:

# ./s3cmd.rb deleteall art072407
# ./s3sync.rb -r /root/data/ art072407:/
#

When I list the article bucket now, I can see all the original files:

# ./s3cmd.rb list art072407
--------------------
HTTPStreaming.rb
Readme.txt
Readme_s3cmd.txt
S3.rb
S3_s3sync_mod.rb
S3encoder.rb
s3cmd.rb
...
#

Next, I can remove all of the files from the /root/data directory and re-sync them using s3sync. First, to remove them, I use the following code:

# rm /root/data/*
#

Now, to re-sync from the Amazon S3 bucket I run:

# ./s3sync.rb -r art072407: /root/data
# ls -la /root/data/
total 120
drwxr-xr-x  2 root root  4096 Jul 24 11:59 .
drwxr-x---  5 root root  4096 Jul 24 11:48 ..
-rwxr-xr-x  1 root root  3427 Jul 24 11:59 HTTPStreaming.rb
-rwxr-xr-x  1 root root 12775 Jul 24 11:59 Readme.txt
-rwxr-xr-x  1 root root  4525 Jul 24 11:59 Readme_s3cmd.txt
...
#

Now I can get and put whole directories of data using Amazon S3 from my Amazon EC2 instance.

To finish, I'm going to delete the contents of the bucket, and then delete the bucket itself:

# ./s3cmd.rb deleteall art072407   
# ./s3cmd.rb deletebucket art072407

To finish working with this example completely, I'm going to delete the Amazon EC2 instance that I was testing:

% ec2-terminate-instances i-59c02230
%

And there you have it: Amazon Simple Storage Service (Amazon S3) access, direct from one of the standard Fedora Core 4 Amazon images with just some simple Ruby scripts and a few environment variables.

Conclusion

Amazon S3 provides a powerful mechanism for moving data between Amazon EC2 instances, and for moving data to and from Amazon EC2 instances for distributed processing. Because all of the languages supported by the Amazon Fedora Core 4 images can access the command line, it's easy to invoke these commands from within page code or batch processors to get and put data from the Amazon S3 buckets.

Jack Herrington is the author of several books, including Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written over 50 articles on technical topics, many of which use PHP. Jack is a PHP and AJAX columnist for IBM developerWorks, and the editor of the AJAX Forum on the IBM developerWorks web site.



Related Documents
Document Type: Code Samples s3sync and s3cmd in Ruby

Discussion
Click to start a discussion on this document Create a New Discussion
No discussion has been created for this document.

Reviews
Create Review Write a Review

Nice step-by-step, Aug 24, 2007 3:15 PM
Reviewer: greg13070
I'm glad to see someone take the time to put all these great technologies together in such a clear way.

Excellent, May 22, 2008 12:04 AM
Reviewer: secularscope
To the point, no bull - simple to follow :)

Works!, Jul 2, 2008 3:14 PM
Reviewer: Kevin R Haas
This was great and not painful at all to set up. One question: is there any way to monitor the progress of the sync as it is going on? It would be great to see some kind of progress stats, like x% done or y of z files transferred.
Welcome, Guest Help
Login Login
Guest Settings Guest Settings


Have something you'd like to see here? Give us your feedback on our Developer Connection Feedback forum.

Conditions of Use | Privacy Notice © 2006-2008 Amazon Web Services LLC or its affiliates. All rights reserved.