Policy Language

The policies/ directory in the configuration repository contains the c7n policies, one per file. Policies must have a name that matches their filename (NAME.yml) and should all have a “comment” or “comments” section that provides a human-readable summary of what the policy does (this comment is used to generate the Current Policies documentation).

All policies are built on top of defaults.yml; see Defaults merging for further information.

Policies are built via the policygen command (or the manheim-c7n-tools policygen step), which runs Policygen and generates per-region custodian_REGION.yml files.

Policy Repository Layout

The overall layout of the configuration repository must be as follows:

mailer-templates/  (optional)
manheim-c7n-tools.yml
policies/
├── all_accounts
│   └── common
│       └── policy-one.yml
├── defaults.yml
└── ACCOUNT-NAME
    ├── common
    │   ├── policy-three.yml
    │   └── policy-two.yml
    ├── us-east-1
    │   ├── policy-five-us-east-1.yml
    │   └── policy-four-us-east-1.yml
    ├── us-east-2
    │   └── policy-four-us-east-2.yml
    ├── us-west-1
    │   └── policy-four-us-west-1.yml
    └── us-west-2
        └── policy-four-us-west-2.yml

The policies/ directory contains:

  • defaults.yml, the defaults used for ALL policies in all accounts (see Defaults merging for further information).

  • A all_accounts/ directory of policies shared identically across all accounts.

  • A directory of account-specific policies for each account; the directory name must match the account_name value in manheim-c7n-tools.yml.

Within each subdirectory (all_accounts or an account name) is a directory called common and optionally directories for one or more specific regions. Policies in the common/ directory will be applied in all regions and policies in a region-specific directory will only be applied in that region.

When building the final configuration, policies from the account-specific directory will be layered on top of policies from the all_accounts/ directory. A policy with the exact same file and policy name in a per-account directory will override a policy with the same name from the all_accounts/ directory. Similarly, within the all_accounts/ or account-named directories, a region-specific policy will override a common/ policy with the same name and filename.

An example configuration repository can be seen at https://github.com/manheim/manheim-c7n-tools/tree/master/example_config_repo.

Multiple Repository Layout

mailer-templates/  (optional)
manheim-c7n-tools.yml
policies/
├── app
│   ├── defaults.yml
│   ├── mailer-templates/  (optional)
│   └── ACCOUNT-NAME
│       ├── us-east-1
│       │   ├── policy-five-us-east-1.yml
│       │   └── policy-six-us-east-1.yml
│       ├── us-east-2
│       │   └── policy-six-us-east-2.yml
│       ├── us-west-1
│       │   └── policy-six-us-west-1.yml
│       └── us-west-2
│           └── policy-six-us-west-2.yml
├── common
│   ├── all_accounts
│   │   └── common
│   │       └── policy-one.yml
│   ├── defaults.yml
│   ├── mailer-templates/  (optional)
│   └── ACCOUNT-NAME
│       └── common
│           ├── policy-three.yml
│           └── policy-two.yml
└── team
    ├── all_accounts
    │   └── common
    │       └── policy-six.yml
    ├── mailer-templates/  (optional)
    └── ACCOUNT-NAME
        ├── us-east-1
        │   ├── policy-five-us-east-1.yml
        │   └── policy-four-us-east-1.yml
        ├── us-east-2
        │   └── policy-four-us-east-2.yml
        ├── us-west-1
        │   └── policy-four-us-west-1.yml
        └── us-west-2
            └── policy-four-us-west-2.yml

An example configuration for a multiple repository setup can be seen at https://github.com/manheim/manheim-c7n-tools/tree/master/example_config_multi_repo.

If mailer-templates/ directories are present in one or more of the subdirectories, their contents will be combined into ./mailer-templates/, with later files of the same name overwriting earlier ones according to the order defined in the policy_source_paths configuration item.

Policy Interpolation

When Policygen generates configuration files for each AWS Region that we deploy into, it will replace all instances of the string %%AWS_REGION%% with the specific region name. As such, the %%AWS_REGION%% macro must be used in all policies as well as the mailer config, where the current region needs to be referenced.

The list of regions that we generate configs for is taken from the regions key of manheim-c7n-tools.yml.

There are also some other values from manheim-c7n-tools.yml (the ManheimConfig class) that can be interpolated in the policies:

String

Config Value

Description

%%AWS_REGION%%

n/a

Replaced with the current region name, for each per-region config

%%BUCKET_NAME%%

output_s3_bucket_name

Name of the S3 bucket used for cloud-custodian output

%%LOG_GROUP%%

custodian_log_group

Name of the CloudWatch Log Group for custodian to log to

%%DLQ_ARN%%

dead_letter_queue_arn

ARN of the Dead Letter Queue for Custodian Lambdas

%%ROLE_ARN%%

role_arn

ARN of the IAM Role to run Custodian functions with

%%MAILER_QUEUE_URL%%

mailer_config.queue_url

c7n-mailer SQS queue URL

%%ACCOUNT_NAME%%

account_name

Configured name of the current AWS account

%%ACCOUNT_ID%%

account_id

Configured ID of the current AWS account

In addition, any POLICYGEN_ENV_-prefixed environment variables present when policygen is run will be interpolated into the configuration. Running policygen with a POLICYGEN_ENV_foo environment variable set to bar will result in all occurrences of %%POLICYGEN_ENV_foo%% in the configuration replaced with bar.

Anatomy of a Policy

Policies in this repository are augmented with the contents of defaults.yml according to the rules described under Defaults merging.

As an example, our onhour-start-ec2 policy contains:

# REMINDER: defaults.yml will be merged in to this. See the README.
name: onhour-start-ec2
comments: Start tagged EC2 Instances daily at 06:00 Eastern, or per tag value
resource: ec2
filters:
  - type: onhour
    onhour: 6
    default_tz: America/New_York
    tag: custodian_downtime
actions:
  - start
  - type: notify
    violation_desc: The following EC2 Instance(s)
    action_desc: have been started per onhour configuration
    subject: '[cloud-custodian {{ account }}] Onhour Started EC2 Instances in {{ region }}'
mode:
  schedule: rate(1 hour)

And our defaults.yml contains:

actions:
  - type: notify
    questions_email: foo@example.com
    questions_slack: our-channel
    template: redefault.html
    to:
      - resource-owner
      - 'splunkhec://%%POLICYGEN_ENV_SPLUNK_INDEX%%'
    owner_absent_contact:
      - bar@example.com
      - baz@example.com
    transport:
      queue: 'https://sqs.us-east-1.amazonaws.com/111111111111/cloud-custodian-111111111111'
      type: sqs
mode:
  execution-options: {log_group: /cloud-custodian/111111111111/us-east-1, output_dir: 's3://c7n-logs-111111111111/logs'}
  role: arn:aws:iam::111111111111:role/cloud-custodian-111111111111
  schedule: rate(1 hour)
  tags: {Component: onhour-start-ec2, Environment: dev, OwnerEmail: foo@example.com,
    Project: cloud-custodian}
  timeout: 300
  type: periodic

After merging with defaults.yml, the policy for the us-east-1 region of a sample “dev” account becomes (this example has been manually sorted to look more like the original, above; the actual output will have keys sorted alphabetically):

name: onhour-start-ec2
comments: Start tagged EC2 Instances daily at 06:00 Eastern, or per tag value
resource: ec2
filters:
  - type: onhour
    onhour: 6
    default_tz: America/New_York
    tag: custodian_downtime
actions:
  - start
  - type: notify
    violation_desc: The following EC2 Instance(s)
    action_desc: have been started per onhour configuration
    subject: '[cloud-custodian {{ account }}] Onhour Started EC2 Instances in {{ region }}'
    questions_email: foo@example.com
    questions_slack: our-channel
    template: redefault.html
    to:
      - resource-owner
      - 'splunkhec://%%POLICYGEN_ENV_SPLUNK_INDEX%%'
    owner_absent_contact:
      - bar@example.com
      - baz@example.com
    transport:
      queue: 'https://sqs.us-east-1.amazonaws.com/111111111111/cloud-custodian-111111111111'
      type: sqs
mode:
  execution-options: {log_group: /cloud-custodian/111111111111/us-east-1, output_dir: 's3://c7n-logs-111111111111/logs'}
  role: arn:aws:iam::111111111111:role/cloud-custodian-111111111111
  schedule: rate(1 hour)
  tags: {Component: onhour-start-ec2, Environment: dev, OwnerEmail: foo@example.com,
    Project: cloud-custodian}
  timeout: 300
  type: periodic

The full list of top-level keys valid for a policy can be found by viewing the source code of c7n.schema.generate or via the custodian CLI schema command, but the above example illustrates the keys that most, if not all, of our policies will have.

  • name - The unique name of the policy. For this repo, the filename must be the policy name with a .yml suffix.

  • comments - A one- or two-sentence description of what the policy does. The Jenkins deployment job extracts all of these and uses them to build the generated documentation for the configuration repo.

  • resource - The AWS resource type that this policy acts on; e.g. ec2, asg, rds, etc. Supported resource types can be found in the upstream documentation; see the "type" attributes (strings) of the various c7n.resources classes.

  • filters - Filters tell a policy which resources it should match. The filters key here is an array/list of 0 or more filters to select resources that the policy should match. Multiple filters are and-ed together, unless you nest them under an or block (see the upstream documentation on collection operators). See the Filters section, below, for more information.

  • actions - Actions tell c7n what to do with or about resources that the filters matched. The actions key here is an array/list of 0 or more actions for this policy to take. See the Actions section, below, for more information.

  • mode - The mode key determines how the policy will be deployed and run. See the Mode section, below, for more information.

  • notify_only - This is a manheim-c7n-tools addition, which is used internally and removed from the policy before Policygen generates the final YAML files for custodian. See Notify-Only Option for Policies for further information.

  • disable - This is a manheim-c7n-tools addition, which is used internally and removed from the policy before Policygen generates the final YAML files for custodian. See Disabling a policy for further information.

Filters

Cloud-custodian has support for many different kinds of filters to match various resource attributes. Upstream documentation exists on both the Generic filters as well as the AWS-specific filters. In addition to that manually-curated documentation, there is also generated documentation for the generic and resource-specific filters, as well as the source code for each (which is liked from that documentation).

  • The Generic value filters can match any attribute of the resource instance, which is generally the return value of the Describe AWS API call for the resource type. There are also some transformations that can be performed on the values, such as type conversion, array counting, normalization (lower-case) or calculating age from a date type.

  • VPC filters for things like subnet, security groups, etc.

  • IAM filters to assist with finding cross-account or public access in policies.

  • Health filters to identify resources with associated AWS Health events.

  • Metric filters to retrieve and filter based on CloudWatch metrics for resources.

  • The offhours filters.

Actions

Note

manheim-c7n-tools’ Notify-Only Option for Policies option on a policy can effect the actions specified. See that section for more information.

Cloud-custodian has both generic/global actions (such as notify) and resource-specific actions (such as stop and start). Some actions are specified as only a string (i.e. stop or start), whereas others need to be specified as a dictionary/hash/mapping including configuration options.

Global actions include:

  • Notify - Send email to static addresses, or addresses from tags on the resource, via c7n_mailer. Our defaults include configuration required for using this action with our c7n_mailer instance. The only configuration needed to make this action work is as shown in the example above; specifically, the type: notify key and the subject, violation_desc and action_desc keys.

  • invoke-lambda - Invoke an arbitraty Lambda function, passing it details of the policy, action, triggering event, and matched resource(s).

  • modify-security-groups- Modify the security groups assigned to a resource.

  • put-metric - Send a custom metric to CloudWatch

To identify available resource-specific actions, either find the appropriate resource type module in the cloud-custodian AWS documentation or the c7n source code and find all classes in it that are based on c7n.actions.Action, or use the custodian schema command line tool. There is also manually-curated documentation on resource-specific filters and actions that is helpful but incomplete.

In addition to notify, some of our most-used actions are the various resource-specific stop or suspend and start or resume actions, as well as the terminate or delete actions, as well as the resource-specific actions to add/modify/delete tags and tag (“mark”) a resource for later action.

Marking Resources for Later Action

IMPORTANT: See the Data Collection/Notification to Action Transition section, below.

c7n has built-in logic for using tags to “mark” resources for action at a future time. Note that these actions are actually resource-specific, and unfortunately some of them have different names on different resources.

The following snippet will mark matched resources with a c7n-foo tag, with a value of the specified message. In the message, {op} will be replaced with the operation (delete) and {action_date} will be replaced with the date when the action should occur (in this example, the current time plus 5 days).

filters:
  # not tagged for this policy; otherwise, we'd just keep pushing the mark date forward
  - {'tag:c7n-foo': absent}
actions:
  - type: mark-for-op
    tag: c7n-foo
    op: delete
    message: "asg-inactive-mark: {op}@{action_date}"
    days: 5

In a separate policy, we can then filter for resources which were marked for a specific action at or before the current date/time with the marked-for-op filter:

filters:
  - type: marked-for-op
    tag: c7n-asg-inactive
    op: delete

That example will filter all resources that were marked for deletion at or before the current time, with the c7n-asg-inactive tag.

The skew parameter on the marked filter skews the current date by adding a number of days to it. This allows us to filter for resources that are marked for an operation N days in the future, i.e. to send out a warning notification ahead of time. The following filter will match the same resources as the previous example, but two days before that example.

filters:
  - type: marked-for-op
    tag: c7n-asg-inactive
    op: delete
    skew: 2

The combination of these actions and filters are commonly used to build a “group” of four complementary policies:

  1. A -mark policy matches desired resources with a filter and uses the mark-for-op action to tag them for action at a later date. Note that it is extremely important to make sure the policy also incldes a filter to exclude resources that already have the marking tag present; if not, the date to take action will continually move forward every time the policy runs, and the action will never be taken.

  2. An -unmark policy matches resources that have the mark tag present but no longer meet the desired criteria, and removes the mark tag from them. For example: if we’re writing a policy to identify and terminate EC2 instances lacking required tags, the -unmark policy would match resources that were previously marked by its counterpart (1) but now have the required tags, and would remove the marking tag from them.

  3. An early-action policy using skew that warns owners of impending action, and may take some preliminary action (i.e. stopping an EC2 instance a few days before it will be terminated).

  4. A termination/deletion policy that takes the final action.

Mode

We have standardized on deploying our policies as Lambda functions, to take advantage of c7n’s excellent :std:label:`cloud custodian:lambda`. The type key of the mode section of the policy defines how the policy will be deployed and executed. defaults.yml should specify everything needed to deploy a policy in periodic mode. If the mode section is completely omitted from a policy, the default periodic mode will be applied.

Supported mode type options for Lambda functions include:

  • periodic - (default for our policies) runs on a set schedule using timer-based CloudWatch Events as a trigger.

  • cloudtrail - runs every time a CloudTrail event of a certain type is received. Note that tags may not have been applied to resources yet when this triggers.

  • ec2-instance-state - runs every time an EC2 Instance enters the specified state (e.g. running, stopped, pending, etc). Note that tags may not have been applied to instances yet when this triggers.

  • config-rule - triggers via AWS Config rules. Note that not all resource types are supported by AWS Config; see the AWS Config - Supported Resources documentation for a list of which resource types are supported.

For full documentation on the required and optional configuration keys for each mode, see the upstream documentation.

Other keys under the mode section include:

  • role - the IAM role that the policy executes under. They should all use the same terraform-managed role.

  • tags - Tags to apply to the Lambda function. policygen.py will add the policy name as the Component tag.

  • timeout - The timeout, in seconds, for the Lambda function. This should be left at the default (maximum) of 300.

  • execution_options - Internal options of the Lambda function. Our defaults send logs to a CloudWatch log group and output to an S3 bucket, and setup the Dead Letter Queue.

Disabling a policy

It is possible to disable a policy. Simply setting the disable key in a policy to true will stop that policy from being deployed.

name: onhour-start-ec2
comments: Start tagged EC2 Instances daily at 06:00 Eastern, or per tag value
resource: ec2
disable: true

The disable key can be added to an existing policy to temporarily disable the policy, and can also be used to disable a policy inherited from a higher-level policy source location by creating a new policy with the same name as the inherited policy and adding the disable key. Only the name and disable keys are required in this case, though adding about comment can help explain why the policy is disabled.

name: onhour-start-ec2
comments: Disabled due to ...
disable: true

Data Collection/Notification to Action Transition

A common pattern that we use when testing new policies is to set up some policies to either only send email notifications or to only collect data, analyze that data, and then enable real actions (i.e. stop, terminate, delete, etc.) after some data has been collected. However it is very important to note that if a “testing only” policy used the mark-for-op action to tag a resource for later action, and actions are later enabled for corresponding policies, the actions might be taken immediately when enabled as a result of the “notify only” policies marking resources for action.

As of version 1.3.0, manheim-c7n-tools supports a Notify-Only Option for Policies flag to help simplify this transition. For older versions, or policies that existed prior to 1.3.0, see the following section on manual tag cleanup.

Manual Tag Cleanup

As a result, when adding actions to policies that have been running in data collection mode, it’s important to manually purge the relevant tags so the policies don’t take any action based on tags applied during data collection.

For example, if you’re adding a “delete” action to policies that were previously only collecting data and included a mark action like:

- type: mark-for-op
  tag: c7n-foo-policy
  op: delete
  message: "foo-mark {op}@{action_date}"
  days: 7

Before enabling the real delete action, you should purge all of those tags with something like (example for EC2 instances):

TAGNAME=c7n-foo-policy
for i in $(aws ec2 describe-instances --filters Name=tag-key,Values=$tagname --output text --query 'Reservations[*].Instances[*].[InstanceId]')
do
  echo "removing tag from: $i"
  aws ec2 delete-tags --resources $i --tags Key=$tagname
done

Notify-Only Option for Policies

As described above in Data Collection/Notification to Action Transition, it’s common to want to run new policies in a “notify only” mode that sends notifications (and collects data) but does not yet take actions, assess those notifications, and enable actually taking action at a later date.

To support this, manheim-c7n-tools (specifically Policygen) supports the addition of a boolean notify_only option at the top level of policy files, or in defaults.yml for account- / repository-wide notify-only. Setting this flag will cause Policygen to pass the effected policies through NotifyOnlyPolicy for pre-processing. This will cause the following changes to the final YAML policy:

  • The comment / comments / description fields will be prefixed with the string NOTIFY ONLY:

  • If the policy has a tags list, a notify-only tag will be appended to it.

  • All tagging actions will have the string -notify-only appended to their tag names, to automate the above-described transition. Specifically:

    • Any mark or tag actions in the actions list will have the string -notify-only appended to their tag or key values (if present) or appended to every item in their tags list (if present). If none of the above are present, the tag item will be set to custodian’s DEFAULT_TAG value, with -notify-only appended.

    • Any mark-for-op actions will have the string -notify-only appended to their tag value. If they do not already have a tag value, it will be set to custodian’s DEFAULT_TAG value, with -notify-only appended.

    • Any remove-tag / unmark / untag actions wukk have the string -notify-only appended to all items in their tags list.

  • All notify actions will have their violation_desc, if present, prefixed with NOTIFY ONLY:. Their action_desc, if present, will be prefixed with in the future (currently notify-only).

  • Any filters items with tag:NAME keys, which match up with NAME tags used in mark-for-op actions, will be updated to tag:NAME-notify-only to retain their intended functionality.

  • All other action types, not listed above, will be removed from the policy. We enforce notify-only by only retaining specifically whitelisted actions in the policy.