SETTING UP AND USING AN AMAZON EC2 GPU INSTANCE

ADDITIONAL NOTES

Public key authorization

If you set up your own instance, you will need to use public key authentication (which is much preferred over passwords anyway). I have included a private/public key pair (gpu_workshop.pem and gpu_workshop.pub) in the config directory, and the key is authorized for access to the guest accounts used in this workshop. So you can experiment with this if you like. When setting up your own server, you have the choice between either generating a new key (and downloading the private key) or importing a key file that you already have.

If you are using PuTTY, you will need to convert the private key into PuTTY format before using. See the ssh notes for details.

Spot Instances

Spot instances are nice because you can get time on a GPU server for as little as $0.40/hour (the usual price for a GPU server is $2.10/hour). The drawback to spot instances is that they can be terminated at any time if the spot price rises above whatever limit you have set. If you have a long-running job, you will need to save data at "checkpoints," from which you can restart the job to continue it. It is typically straightforward to do this (and is a good idea for long-running jobs anyway).

The best way to save the checkpoint data is by saving to an attached Amazon storage volume with the persistent option checked (see below). You could also, e.g., set up a script to email checkpoint data to yourself or rsync it to an accessible server.

Storage

To attach a persistent Amazon storage volume, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Storage.html. Such volumes can be mounted in the usual way (e.g., using fstab). One option might be to attach the volume at /home..

A volume can be attached when the instance is initially launched (using the launch dialog), or from the "Volumes" page of the EC2 management console. If an empty volume is attached, it must be set up before being used, e.g. (replace /dev/sdb with your device name; change mount points as desired):

        mkfs -t ext4 /dev/sdb
        mkdir /mnt/q
        mount /dev/sdb /mnt/q
        rsync -avh /home/ /mnt/q
        echo "/dev/sdb /mnt/home ext4 defaults,noatime 0 0" >> /etc/fstab
        parted -l
        <reboot>

Workflow

For the process of software development, it is best to work on a local copy of the code and upload to the server to run (but it is also possible to edit code on the host remotely).

The easiest way to do this is to set up an rsync script. I like to use a makefile with a target called, e.g., 'put-ec2'. An alternative would be to use git or something similar (e.g., using a github repository). This would have some real advantages at the cost of a slight increase in complexity. In the simplest case, one could just use sftp.

I usually keep an editor window and two terminal tabs open on my local machine. One terminal tab is to run the rsync script. The other is for an ssh session on the EC2 instance. It is often convenient to run jobs on the server using either 'tmux' or 'screen' so that they continue to run when you log out. An alternative would be to run background jobs and pipe output to a file.

Remember that spot instances can be terminated at any time and the root storage volume is not persistent (so either work on a local copy of your software or keep it on at a persistent attached volume).

Security groups

The "Default" security group disables all outside access to the server and is almost certainly NOT what you want. The "Quick-start" group enables ssh access and IS almost certainly what you want. The simplest thing to do is to edit the default group so that it is identical to the quick-start group.

Shutdown behavior

There are two ways that an instance can be stopped or terminated: from the management console; or from within the instance (using the Linux 'shutdown' command). If an instance is "terminated", any changes made to it since the time it was started will be lost. If an instance is "stopped", it can be restarted from where it was.

It is possible to protect against inadvertant termination.

Spot instances can not be "stopped", only "terminated". The most useful way to avoid data loss in this case is to keep your work on an attached volume with the "persistent" option checked.

Snapshots

It is possible to take snapshots of a storage volume at various points in time. These are essentially "backups". An AMI can be created from any snapshot, serving as a "system restore".

GPU clusters

It is possible to launch a cluster of GPU servers connected by high speed ethernet. The usual way to do this is using MPI to communicate across the cluster and CUDA on each node. See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html for details.

Log in from your phone!

For android, use e.g., ConnectBot or JuiceSSH. These can both use public key authorization.

Command line tools

In addition to the web-based management console, Amazon also makes a set of command line tools available (see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/SettingUp_CommandLine.html). These tools will make life easier for heavy users; for occasional use, the web console is simpler. List of commands is at http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/command-reference.html.

Samples:

	ec2-run-instances ami-cf3758a6 -t cg1.4xlarge -k <key> -g <security-group> 
	ec2-request-spot-instance ami-cf3758a6 -t cg1.4xlarge -k <key> -g <security-group> -p <price> 
	ec2-describe-instances
	ec2-describe-spot-instance-requests
	ec2-stop-instances      <instance-id>     
	ec2-terminate-instances <instance-id>

The output of some of these commands can be "challenging" to read. There is a perl interface that could be useful (Net::Amazon::EC2). Alternatively, one could write a simple perl script to parse and generate formatted output with little effort. One could also use Python (or Ruby or whatever...).