AWS Automated ECS Deployment Guide

This document contains instructions and cloudformation templates to automatically build an AWS ECS cluster running the Private AI CPU container. This is intended to quickly set up a clustered environment to shorten prototyping time, and as a guide for production deployments. Any production deployment should have capacity reviewed and tuned as appropriate and should be reviewed by your security team to match existing guidelines and policies.

Quick Start

Just replace your key-name and ip-address below and run the command below! Read on for more detail.

Copy
Copied
aws cloudformation create-stack --stack-name private-ai --template-url https://privateai-infrastructure.s3.amazonaws.com/ecs/main.yaml --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=AdminIpAddress,ParameterValue=<ip-address>/32 ParameterKey=SSHKeyName,ParameterValue=<key-name>
Note:
  • SSH Key pair creation for key-name in the command above is beyond the scope of this document. Check Create key pairs for more information.
  • The ip-address for the command above should refer to the administrator's computer or subnet. This restricts who can SSH into your cluster host via an EC2 Security Group. Check Authorize inbound traffic for your Linux instances for more information.

Overview

The main cloudformation template is called main.yaml which builds a cloudformation stack with all the necessary nested stacks. The stack will create two public and two private subnets, create an EC2 instance to host the ECS cluster, service, and tasks, and all the corresponding networking and security configuration. AWS ECS requires tasks to be run on a private subnet, and so the template also launches a bastion host in a public subnet to enable SSH for troubleshooting.

Prerequisites

An AWS IAM account with access to the AWS console to create and manage cloudformation stacks is required.

You must subscribe to the AWS Marketplace Private AI product. Follow the instructions on the Private AI AWS Marketplace Deployment Guide to create your subscription.

If you prefer to use the AWS CLI, it must also be installed. Check out the Install or update the latest version of the AWS CLI guide for more information.

If you plan to SSH to the ECS hosts (recommended for troubleshooting) you must generate an SSH key. Check out the Amazon EC2 key pairs and Linux instances guide for more information.

Deployment Steps

Optional - Customize deployment

If you would like to make changes to the cloudformation templates, download each file locally and make any modifications if desired.

Create an S3 bucket which is accessible by the user creating the cloudformation stacks in the region where you plan to deploy ECS. Upload each of the customized cloudformation templates to the S3 bucket except for main.yaml. Once the files are uploaded, copy the URLs for each file and update the main.yaml stack resources with the corresponding URLs. Upload the completed main.yaml file to the S3 bucket.

Create the cloudformation stack

Option 1 - AWS Console

  1. Log into the AWS Console and select your region from the top-right corner.
  2. Navigate to the Cloudformation dashboard and click "Create stack" > "With new resources (standard)".
  3. Select "Template is ready" and enter the URL https://privateai-infrastructure.s3.amazonaws.com/ecs/main.yaml file in S3, or your own private S3 if you chose to customize the deployment templates, and click "Next".
  4. Enter a stack name, your source IP address (use CIDR notation by adding /32 to the end of your IP address), and the name of your SSH key pair and click "Next".
  5. For image and PDF processing we recommend changing the InstanceType parameter of the Cluster child stack to m5zn.3xlarge for better performance. Otherwise, leave all defaults and click "Next".
  6. Check the "Capabilities" boxes at the bottom of the page and click "Submit".

Option 2 - AWS CLI

  1. Open a terminal window with a valid path to the AWS CLI.
  2. Ensure you have authenticated with the AWS CLI via aws configure .
  3. Find your local IP address with curl https://checkip.amazonaws.com
  4. Run the following command to create the main stack from the directory with main.yaml with the appropriate IP address, environment-name , and key-name (we recommend also setting the InstanceType parameter to m5zn.3xlarge for image and PDF processing use cases):
    Copy
    Copied
    aws cloudformation create-stack --stack-name environment-name --template-url https://privateai-infrastructure.s3.amazonaws.com/ecs/main.yaml --capabilities CAPABILITY_NAMED_IAM --parameters ParameterKey=AdminIpAddress,ParameterValue=a.b.c.d/32 ParameterKey=SSHKeyName,ParameterValue=key-name

Testing

  1. Wait for the deployment to finish, and then check the main stack's Output for the load balancer URL.
  2. You should be able to open the URL in a new tab and see the Private AI container version.
  3. Check out the API Reference documentation for more information.

Infrastructure Files

Main

This stack wires all the other nested stacks together. Each nested stack defines outputs that are required for subsequent steps. Make sure to update the TemplateUrl settings with the correct S3 bucket if you chose to make changes to the cloudformation files and host them in your own environment.

main.yaml

VPC

This nested stack creates a separate VPC with two public subnets and two private subnets. The VPC has a corresponding Internet gateway, with public routes to direct traffic to the gateway. It also includes redundant NAT gateways to route traffic from the private subnets to the Internet gateway.

vpc.yaml

Security Groups

This nested stack creates the security groups required to allow traffic for the EC2 cluster hosts in the private subnet, the load balancer, and the bastion host.

security-groups.yaml

Bastion Host

This nested stack creates an EC2 instance in a public subnet with a public IP address to enable administration and support of the EC2 cluster hosts. The bastion acts as an SSH jump box between the public internet and the private subnet, and can be stopped or terminated to cut off terminal access and prevent unauthorized access.

bastion.yaml

Load Balancer

This nested stack creates the load balancer to route traffic from the public internet to the ECS tasks. The target group is created but not populated until a later nested stack.

load-balancer.yaml

ECS Cluster

This nested stack creates the ECS cluster and associated cluster resources. This includes an EC2 launch template which defines the EC2 hosts, an EC2 auto scaling group to manage the scaling configuration, and the ECS capacity provider to connect the ECS cluster to the auto scaling group. This nested stack also manages the various IAM roles and policies required for the hosts, service, ECS agents, and tasks to connect to AWS services.

Note:

The ECS agent in particular references the execution task role by name, and so changing the role requires a new host (or a new role name) in order for the permissions to be picked up. It is included in the cluster nested stack rather than the service nested stack to simplify deployment.

cluster.yaml

ECS Service

This nested stack defines the ECS service, the task definition, and the cloudwatch group for logging. It also starts the ECS tasks and populates the load balancer target group with the task endpoints, which enables the load balancer to respond to requests.

service.yaml

ECS Deployments Without AWS Marketplace

You are able to use this guide to deploy your non-Marketplace container as well. In order to do this, you'll need to download the above CloudFormation yaml files and make the following changes:

  • Point to your non-Marketplace container (pulled from either Azure CR or your local container registry)
  • Mount your enterprise license file as an EFS storage volume to the container

Once you have made the approrpaite changes, you can upload these manifests to an S3 bucket to reference in your CloudFormation deployment.

Troubleshooting

CIDR block a.b.c.d is malformed

Sample error message:

Copy
Copied
CIDR block a.b.c.d is malformed (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterValue; Request ID: request-id; Proxy: null)

Cause: Trying to create a the stack with your AdminIpAddress not in full CIDR notation.

Solution: If you plan to SSH from a single host, update the IP address to a.b.c.d/32 with your computer's IP address. Otherwise, specify the full CIDR notation, such as a.b.0.0/16 to allow access from the entire a.b.0.0 to a.b.255.255 range. Learn more about Classless Inter-Domain Routing on wikipedia.

IAM User permission issue

Sample error message:

Copy
Copied
User: user-arn is not authorized to perform: cloudformation:CreateUploadBucket because no identity-based policy allows the cloudformation:CreateUploadBucket action

Cause: The user you are using to create the stack does not have appropriate permissions

Solution: Log into the AWS Console as an administrator and update the user's permissions to include the necessary permissions. Learn more about Policies and permissions in IAM on the AWS documentation site.

Delete cluster failed

Sample error message:

Copy
Copied
Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: request-id; Proxy: null)'." (RequestToken: request-token, HandlerErrorCode: GeneralServiceException)

Cause:

The EC2 hosts require time to be terminated before the ECS cluster can be deleted. Since EC2 hosts are attached to the cluster via the auto scaling group, capacity provider, and capacity provider association, cloudformation is unable to determine the order in which the resources must be deleted.

Solution:

Ensure that the AWS::AutoScaling::AutoScalingGroup resource has the DependsOn: ECSCluster attribute

No Container Instance

Sample error message:

You may experience very long service start times, with multiple failed tasks. In the ECS dashboard, under Service > Events you will see an error as below:

Copy
Copied
service environmentname-service was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.

Check the Cluster > Infrastructure dashboard and verify that there are no Container instances available.

Cause:

If you ssh into the ec2 host that is running the ecs agent, you will see the following error message in the ecs agent logs /var/log/ecs/ecs-agent.log:

Copy
Copied
level=error time=2023-08-21T20:46:56Z msg="Unable to register as a container instance with ECS: ClientException: Cluster not found." module=client.go

You can double-check the name of the cluster that was generated for the configuration via the EC2 task template by checking the file /etc/ecs/ecs.config:

Copy
Copied
ECS_CLUSTER=EnvironmentName-cluster

Solution:

Double-check that the cluster name in the configuration file matches the cluster name in the ECS dashboard. If there is a mismatch, the service will not be able to allocate resources to start the ECS tasks.

Task fails to start due to license failure

Sample error message:

If the container keeps restarting, check the cloudwatch logs or SSH into the host and check the docker container logs.

Copy
Copied
2023-08-21T21:33:24+0000 [INFO] text.license - reading license from [license/license.json] for validation
2023-08-21T21:33:24+0000 [ERROR] root - AWS register_usage exception: An error occurred (AccessDeniedException) when calling the RegisterUsage operation: User: arn:aws:sts::807274709480:assumed-role/ecsdocs-TaskRole-us-east-1/7b61801bc2364251901d770c7ec24129 is not authorized to perform: aws-marketplace:RegisterUsage because no identity-based policy allows the aws-marketplace:RegisterUsage action
2023-08-21T21:33:24+0000 [ERROR] text.license - License 142 requires an AWS Marketplace deployment.

Cause:

The license file for AWS Marketplace is signed and stored within the container. However, a request to the AWS Marketplace register_usage API must be returned successfully. In this case, the application was started with a role that did not have the correct policy attached.

Solution:

Check the cluster.yaml file for the ECSTaskRole and ensure the correct permission is applied.

HTTP 404 Error on API calls

Sample error message:

Copy
Copied
HTTP/1.1 404 Resource Not Found
Content-Length: 54
Content-Type: application/json
Date: Tue, 22 Aug 2023 20:51:41 GMT

{ "statusCode": 404, "message": "Resource not found" }

Cause:

When requesting the default URL you are able to see the application version, but when trying to use another endpoint (such as process text) you get the error message above. This is due to an incorrect URI for the endpoint. Our API Reference public endpoint has a slightly different signature.

Solution:

Ensure the endpoint you are using is of the form https://load-balancer-url/v3/process/text rather than https://load-balancer-url/deid/v3/process/text

© Copyright 2024 Private AI.