Automate default AWS SageMaker scaling policy

TLDR; To properly set the default/built-in target metric values for autoscaling SageMaker endpoints, you need to make sure you use SageMakerEndpointInvocationScalingPolicy for your policy name and SageMakerVariantInvocationsPerInstance as your target metric:

target_tracking_json='{"TargetValue": SCALING_VALUE, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}}'
echo $target_tracking_json > target_tracking_scaling_policy.json
aws application-autoscaling put-scaling-policy --region REGION \
      --policy-name SageMakerEndpointInvocationScalingPolicy \
      --service-namespace sagemaker \
      --resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
      --scalable-dimension sagemaker:variant:DesiredInstanceCount \
      --policy-type TargetTrackingScaling \
      --target-tracking-scaling-policy-configuration file://target_tracking_scaling_policy.json

Where SCALING_VALUE, REGION, ENDPOINT_NAME, VARIANT_NAME need to be updated with your endpoint specific values. Read on for a more in-depth explanation.


We deploy our machine learning models for realtime inference using AWS SageMaker. For our use case, this involves creating a container, along with an ML model, and deploying it as an endpoint. In this post, I’d like to focus specifically on setting up a SageMaker endpoint so that it autoscales properly. Registering an autoscaling group to your endpoint and setting max/min node values for this autoscaling group will get you most of the way there. However, one of the pieces missing, is automatically setting the target metric that tells the ASG when to start scaling out/in.

SageMaker endpoints with autoscaling enabled, default to scale on the SageMakerVariantInvocationsPerInstance metric. This means that given some value, if the invocations (inference calls) per instance go above that number, (over a given amount of time) it will cause the ASG to start spinning up new instances to handle the increased load. The ASG will keep adding instances until either SageMakerVariantInvocationsPerInstance goes below the target value, or it hits the max number of instances set when registering the ASG.

This is what we want, especially for endpoints that don’t have predictable load, or load that changes greatly throughout the day. It can be a helpful cost saving tactic and increases availability under heavy traffic. However, it’s not clear how to programmatically set this target value, hence this post.

We mostly use the awscli to automate our endpoint deploys, but you could substitute any of these cli commands with analogous api calls. Once you have an endpoint live in SageMaker, you can enable autoscaling by running the following:

aws application-autoscaling register-scalable-target --region REGION \
    --service-namespace sagemaker \
    --resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
    --scalable-dimension sagemaker:variant:DesiredInstanceCount \
    --min-capacity MIN_VALUE \
    --max-capacity MAX_VALUE

Where REGION, ENDPOINT_NAME, VARIANT_NAME, MIN_VALUE, MAX_VALUE should be set with your specific endpoint related values. This will register an autoscaling group to the SageMaker endpoint. In the SageMaker UI you should now see, under “Endpoints” -> YOUR_ENDPOINT -> “Endpoint runtime settings”, that automatic scaling is set to “Yes”. You’ll also see the max/min instance settings updated with the values you used in the command above.

Confusingly, we need to use another command to specify when our ASG actually triggers its scaling. This can be done manually by going to your endpoint in the UI and selecting your variant under the endpoint runtime settings:

This allows you to select “Configure auto scaling” which will take you to a screen that allows you to edit your max/min instance counts, deregister/disable auto scaling, and, what we are most interested in, set your target scaling value for the built-in scaling policy:

There is a lot of useful information here. First you’ll notice the policy name is SageMakerEndpointInvocationScalingPolicy with a target metric called SageMakerVariantInvocationsPerInstance, this basically means that by default SageMaker endpoints will scale based on the number of incoming requests per instance. You can get more information on this metric here. If you’re not sure what value to use, you can approach the value using the load testing techniques described in this post. We also have the option to set our scale in and scale out cool down values. All of these values will need to be tweaked, experimented with, and tested for each of your endpoints, since most models will have different performance characteristics for the various problems they solve.

It’s not super clear from the documentation how we programmatically set these values. After reading the documentation, it looks like we will need to define a scaling policy. This makes sense given what we see in the UI. We can optionally use either the SageMakerVariantInvocationsPerInstance pre-defined metric, or define one ourselves using a custom metric. AWS “strongly” recommends using the default built-in metric so we’ll stick with that for now.

Now its time for us to apply our scaling policy. We can do this through the aws cli:

target_tracking_json='{"TargetValue": SCALING_VALUE, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}}'
echo $target_tracking_json > target_tracking_scaling_policy.json
aws application-autoscaling put-scaling-policy --region REGION \
      --policy-name POLICY_NAME \
      --service-namespace sagemaker \
      --resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
      --scalable-dimension sagemaker:variant:DesiredInstanceCount \
      --policy-type TargetTrackingScaling \
      --target-tracking-scaling-policy-configuration file://target_tracking_scaling_policy.json

Where SCALING_VALUE, REGION, POLICY_NAME, ENDPOINT_NAME, VARIANT_NAME are set specific to our endpoint. Should solve our problem right? Well, sort of. If you name this scaling policy anything other than SageMakerEndpointInvocationScalingPolicy like we have above (and is done in most examples) then this will actually create a custom scaling policy. When someone goes to view this endpoint they are greeted with this:

Okay, I guess I just need to trust that the values I set in the code are correct. There isn’t much information given in the UI about what that custom policy actually is. We can run:

aws application-autoscaling describe-scaling-policyies --service-namespace sagemaker

Which will give us all the settings related to our autoscaling policies on our SageMaker endpoints. You’ll then have to search through these to find the endpoint you care about to verify that we set things correctly. This can be confusing for end users taking a look at the endpoint, and seems a little black box asking them to just trust that this custom policy is what they wanted.

We said earlier that we just want to use the built-in SageMakerVariantInvocationsPerInstance metric. We don’t want to set a custom scaling policy for this endpoint, just set the default built-in one. You may have already figured out where we went wrong. We need to use SageMakerEndpointInvocationScalingPolicy as our policy name when running the put-autoscaling-policy command. Here is the corrected command:

target_tracking_json='{"TargetValue": SCALING_VALUE, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}}'
echo $target_tracking_json > target_tracking_scaling_policy.json
aws application-autoscaling put-scaling-policy --region REGION \
      --policy-name SageMakerEndpointInvocationScalingPolicy \
      --service-namespace sagemaker \
      --resource-id endpoint/ENDPOINT_NAME/variant/VARIANT_NAME \
      --scalable-dimension sagemaker:variant:DesiredInstanceCount \
      --policy-type TargetTrackingScaling \
      --target-tracking-scaling-policy-configuration file://target_tracking_scaling_policy.json

This is almost identical to the last command, the only change is we updated POLICY_NAME to be SageMakerEndpointInvocationScalingPolicy. Now we can go back to the UI and verify we have programmatically set these values via the cli. The end users should be happy since they can see what their endpoint is using as a target scaling value.

This is nowhere called out in the documentation which is unfortunate. I also didn’t see any examples online that point this out, which motivated writing this post.

It might seem obvious to do this, but without explicit documentation from AWS, I was left guessing, and it took longer than I would have liked to figure out the right solution. Hopefully this post helps out anyone else who is in a similar position.