clusters

Creates, updates, deletes or gets a cluster resource or lists clusters in a region

Overview

Name	`clusters`
Type	Resource
Description	Resource Type definition for AWS::SageMaker::Cluster
Id	`awscc.sagemaker.clusters`

Fields

get (all properties)
list (identifiers only)

Name	Datatype	Description
`cluster_arn`	`string`	The Amazon Resource Name (ARN) of the HyperPod Cluster.
▶`vpc_config`	`object`	Specifies an Amazon Virtual Private Cloud (VPC) that your SageMaker jobs, hosted models, and compute resources have access to. You can control access to and from your resources by configuring a VPC. For more information, see https://docs.aws.amazon.com/sagemaker/latest/dg/infrastructure-give-access.html
`node_recovery`	`string`	If node auto-recovery is set to true, faulty nodes will be replaced or rebooted when a failure is detected. If set to false, nodes will be labelled when a fault is detected.
▶`instance_groups`	`array`	The instance groups of the SageMaker HyperPod cluster.
▶`restricted_instance_groups`	`array`	The restricted instance groups of the SageMaker HyperPod cluster.
▶`orchestrator`	`object`	Specifies parameter(s) specific to the orchestrator, e.g. specify the EKS cluster.
`cluster_role`	`string`	The cluster role for the autoscaler to assume.
`node_provisioning_mode`	`string`	Determines the scaling strategy for the SageMaker HyperPod cluster. When set to 'Continuous', enables continuous scaling which dynamically manages node provisioning. If the parameter is omitted, uses the standard scaling approach in previous release.
`creation_time`	`string`	The time at which the HyperPod cluster was created.
`cluster_name`	`string`	The name of the HyperPod Cluster.
`failure_message`	`string`	The failure message of the HyperPod Cluster.
▶`auto_scaling`	`object`	Configuration for cluster auto-scaling
`cluster_status`	`string`	The status of the HyperPod Cluster.
▶`tags`	`array`	Custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging.
`region`	`string`	AWS region.

Name	Datatype	Description
`cluster_arn`	`string`	The Amazon Resource Name (ARN) of the HyperPod Cluster.
`region`	`string`	AWS region.

For more information, see AWS::SageMaker::Cluster.

Methods

Name	Resource	Accessible by	Required Params
`create_resource`	`clusters`	`INSERT`	`, region`
`delete_resource`	`clusters`	`DELETE`	`Identifier, region`
`update_resource`	`clusters`	`UPDATE`	`Identifier, PatchDocument, region`
`list_resources`	`clusters_list_only`	`SELECT`	`region`
`get_resource`	`clusters`	`SELECT`	`Identifier, region`

`SELECT` examples

get (all properties)
list (identifiers only)

Gets all properties from an individual cluster.

SELECT
  region,
  cluster_arn,
  vpc_config,
  node_recovery,
  instance_groups,
  restricted_instance_groups,
  orchestrator,
  cluster_role,
  node_provisioning_mode,
  creation_time,
  cluster_name,
  failure_message,
  auto_scaling,
  cluster_status,
  tags
FROM awscc.sagemaker.clusters
WHERE
  region = 'us-east-1' AND
  Identifier = '{{ cluster_arn }}';

Lists all clusters in a region.

SELECT
  region,
  cluster_arn
FROM awscc.sagemaker.clusters_list_only
WHERE
  region = 'us-east-1';

`INSERT` example

Use the following StackQL query and manifest file to create a new cluster resource, using stack-deploy.

Required Properties
All Properties
Manifest

/*+ create */
INSERT INTO awscc.sagemaker.clusters (
  ,
  region
)
SELECT
  '{{  }}',
  '{{ region }}';

/*+ create */
INSERT INTO awscc.sagemaker.clusters (
  VpcConfig,
  NodeRecovery,
  InstanceGroups,
  RestrictedInstanceGroups,
  Orchestrator,
  ClusterRole,
  NodeProvisioningMode,
  ClusterName,
  AutoScaling,
  Tags,
  region
)
SELECT
  '{{ vpc_config }}',
  '{{ node_recovery }}',
  '{{ instance_groups }}',
  '{{ restricted_instance_groups }}',
  '{{ orchestrator }}',
  '{{ cluster_role }}',
  '{{ node_provisioning_mode }}',
  '{{ cluster_name }}',
  '{{ auto_scaling }}',
  '{{ tags }}',
  '{{ region }}';

version: 1
name: stack name
description: stack description
providers:
  - aws
globals:
  - name: region
    value: '{{ vars.AWS_REGION }}'
resources:
  - name: cluster
    props:
      - name: vpc_config
        value:
          security_group_ids:
            - '{{ security_group_ids[0] }}'
          subnets:
            - '{{ subnets[0] }}'
      - name: node_recovery
        value: '{{ node_recovery }}'
      - name: instance_groups
        value:
          - instance_group_name: '{{ instance_group_name }}'
            instance_storage_configs:
              - {}
            life_cycle_config:
              source_s3_uri: '{{ source_s3_uri }}'
              on_create: '{{ on_create }}'
            training_plan_arn: '{{ training_plan_arn }}'
            threads_per_core: '{{ threads_per_core }}'
            override_vpc_config: null
            instance_count: '{{ instance_count }}'
            on_start_deep_health_checks:
              - '{{ on_start_deep_health_checks[0] }}'
            image_id: '{{ image_id }}'
            current_count: '{{ current_count }}'
            scheduled_update_config:
              schedule_expression: '{{ schedule_expression }}'
              deployment_config:
                auto_rollback_configuration:
                  alarms:
                    - alarm_name: '{{ alarm_name }}'
                blue_green_update_policy:
                  maximum_execution_timeout_in_seconds: '{{ maximum_execution_timeout_in_seconds }}'
                  termination_wait_in_seconds: '{{ termination_wait_in_seconds }}'
                  traffic_routing_configuration:
                    canary_size:
                      type: '{{ type }}'
                      value: '{{ value }}'
                    linear_step_size: null
                    type: '{{ type }}'
                    wait_interval_in_seconds: '{{ wait_interval_in_seconds }}'
                rolling_update_policy:
                  maximum_batch_size: null
                  maximum_execution_timeout_in_seconds: '{{ maximum_execution_timeout_in_seconds }}'
                  rollback_maximum_batch_size: null
                  wait_interval_in_seconds: '{{ wait_interval_in_seconds }}'
            instance_type: '{{ instance_type }}'
            execution_role: '{{ execution_role }}'
      - name: restricted_instance_groups
        value:
          - override_vpc_config: null
            instance_count: '{{ instance_count }}'
            on_start_deep_health_checks: null
            environment_config:
              f_sx_lustre_config:
                size_in_gi_b: '{{ size_in_gi_b }}'
                per_unit_storage_throughput: '{{ per_unit_storage_throughput }}'
            instance_group_name: null
            instance_storage_configs: null
            current_count: '{{ current_count }}'
            training_plan_arn: '{{ training_plan_arn }}'
            instance_type: null
            threads_per_core: '{{ threads_per_core }}'
            execution_role: null
      - name: orchestrator
        value:
          eks:
            cluster_arn: '{{ cluster_arn }}'
      - name: cluster_role
        value: '{{ cluster_role }}'
      - name: node_provisioning_mode
        value: '{{ node_provisioning_mode }}'
      - name: cluster_name
        value: '{{ cluster_name }}'
      - name: auto_scaling
        value:
          mode: '{{ mode }}'
          auto_scaler_type: '{{ auto_scaler_type }}'
      - name: tags
        value:
          - value: '{{ value }}'
            key: '{{ key }}'

`UPDATE` example

Use the following StackQL query and manifest file to update a cluster resource, using stack-deploy.

/*+ update */
UPDATE awscc.sagemaker.clusters
SET PatchDocument = string('{{ {
    "NodeRecovery": node_recovery,
    "ClusterRole": cluster_role,
    "NodeProvisioningMode": node_provisioning_mode,
    "AutoScaling": auto_scaling,
    "Tags": tags
} | generate_patch_document }}')
WHERE
  region = '{{ region }}' AND
  Identifier = '{{ cluster_arn }}';

`DELETE` example

/*+ delete */
DELETE FROM awscc.sagemaker.clusters
WHERE
  Identifier = '{{ cluster_arn }}' AND
  region = 'us-east-1';

Permissions

To operate on the clusters resource, the following permissions are required:

Read
Create
Update
List
Delete

sagemaker:DescribeCluster,
sagemaker:ListTags

sagemaker:CreateCluster,
sagemaker:DescribeCluster,
sagemaker:UpdateClusterSoftware,
sagemaker:AddTags,
sagemaker:ListTags,
sagemaker:BatchAddClusterNodes,
sagemaker:BatchDeleteClusterNodes,
eks:DescribeAccessEntry,
eks:DescribeCluster,
eks:CreateAccessEntry,
eks:DeleteAccessEntry,
eks:AssociateAccessPolicy,
iam:CreateServiceLinkedRole,
iam:PassRole,
kms:DescribeKey,
kms:CreateGrant,
ec2:DescribeImages,
ec2:DescribeSnapshots,
ec2:ModifyImageAttribute,
ec2:ModifySnapshotAttribute

sagemaker:UpdateCluster,
sagemaker:UpdateClusterSoftware,
sagemaker:DescribeCluster,
sagemaker:ListTags,
sagemaker:AddTags,
sagemaker:DeleteTags,
sagemaker:BatchAddClusterNodes,
sagemaker:BatchDeleteClusterNodes,
eks:DescribeAccessEntry,
eks:DescribeCluster,
eks:CreateAccessEntry,
eks:DeleteAccessEntry,
eks:AssociateAccessPolicy,
iam:PassRole,
kms:DescribeKey,
kms:CreateGrant,
sagemaker:BatchAddClusterNodes,
sagemaker:BatchDeleteClusterNodes,
ec2:DescribeImages,
ec2:DescribeSnapshots,
ec2:ModifyImageAttribute,
ec2:ModifySnapshotAttribute

sagemaker:ListClusters

sagemaker:DeleteCluster,
sagemaker:DescribeCluster,
eks:DescribeAccessEntry,
eks:DeleteAccessEntry

Overview​

Fields​

Methods​

SELECT examples​

INSERT example​

UPDATE example​

DELETE example​

Permissions​