Testing IaC with Azure Resource Graph - Part1

Testing IaC with Azure Resource Graph - Part1

Adham 10 min read

This post is for teams that treat infrastructure as real software. If you like to do TDD for IaC, or you prefer writing your spec first and then proving the deployment matches it, this approach fits very well. It is also useful when you want regression tests for infrastructure, so changes do not quietly break security settings, tags, or naming rules over time. And if your goal is to verify standards across your codebase in a repeatable way, you can do that with simple query files.

This is Part 1, and we will do it the hard way on purpose to show the internals clearly. We will use raw KQL files plus shell scripts so you can see exactly what happens. In Part 2, we will take the easier path with a more streamlined setup.

Part 2 (easier path) will be linked here when it is published.

Azure Resource Graph lets you query the actual state of your resources using KQL, the same query language used in Log Analytics and Microsoft Sentinel. You write .kql files, run them with az graph query, and treat the results as test output. No test framework, no SDK, no Python glue code. Just queries in version control and a shell script in your pipeline.

What is Azure Resource Graph?

Azure Resource Graph is a read-only, queryable index of every Azure resource across your subscriptions. It updates in near-real-time, roughly one minute of lag after a deployment, and it supports rich filtering across properties, tags, and nested fields like security rules and SKU configuration.

The big advantage over calling individual resource APIs is scale. One KQL query can scan thousands of resources across multiple subscriptions in seconds, without you writing loops or managing rate limits. You can access Resource Graph through the Azure Portal (the Resource Graph Explorer), the REST API, the Azure CLI, and even as a data source inside a Terraform plan.

The query language is KQL. If you have used Log Analytics before, it is the same syntax. If you have not, the basics are simple: start with a table name, pipe into operators like where, project, and mv-expand, and shape the output. It is worth learning even if you only use it for this one purpose.

Setup

You need the resource-graph extension for the Azure CLI. It is not installed by default. You also need jq because the CI check script parses JSON output from az graph query.

az extension add --name resource-graph
# test
az graph query -q "Resources | count"

If that last command returns a JSON object with a count field, you are ready to go.

The Approach: Query Files and a Runner Script

The idea is simple. You keep your KQL queries as individual .kql files in a directory, one file per check. A shell script reads each file and runs it through az graph query. You check the results in CI.

This approach works well because the query files are just text. They live in version control. You review them in pull requests. You can share them across projects. There is no test runner to install and no language runtime to pin.

I am not including the full directory tree in this post. The scripts and query files are in the repo under tools/az-graph-testing/, and you can use that path as a reference.

And here is the runner script that iterates over all queries and prints the results as a table:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
QUERY_DIR="$SCRIPT_DIR/queries"

for kql_file in "$QUERY_DIR"/*.kql; do
  query="$(grep -v '^//' "$kql_file" | tr -s '\n' ' ')"   # Strip comment lines starting with 
  echo "=== $(basename "$kql_file" .kql) ==="
  az graph query --graph-query "$query" --output table
  echo ""
done

Run it locally after a deployment to get a quick table overview of what is actually in your subscription. Run it in CI to catch regressions before they reach production.

The Queries

Let me walk through each query and explain what it does and why it matters. For queries that function as tests, a passing result is always zero rows. If the query returns any rows, something is wrong and your CI should fail.

For reporting queries (like the provisioner breakdown), rows are expected and informative.

This is just laying the foundation for our next post.

Tag by provisioner: who created what?

If you mix Bicep and Terraform in the same subscription, or you have some legacy resources deployed by hand, tagging resources at creation time with a provisionedBy tag is very useful. This query shows you the breakdown.

// Requires a tag key "provisionedBy" with values like "bicep" or "terraform".
Resources
| where isnotempty(tags['provisionedBy'])
| project name, type, location, provisioner=tags['provisionedBy']
| order by provisioner asc, type asc

In your Bicep modules, set the tag at the resource level:

tags: {
  provisionedBy: 'bicep'
  environment: 'prod'
}

In Terraform, use a default_tags block in your provider configuration or set it in your module variables. Either way, once the tag is there, this query makes auditing straightforward.

Tag compliance: resources missing provisionedBy

The previous query shows who created what, but this returns resources that do not have the provisionedBy tag. A passing result is zero rows.

// Returns resources missing the provisionedBy tag.
Resources
| where isempty(tags['provisionedBy'])
| project name, type, resourceGroup, tags
| order by type asc, name asc

You can run this in CI with the same check-query.sh helper.

Missing required tags

This is the first query that functions as a real test. It returns resources that are missing at least one of your required tags. If it returns anything, something is wrong.

// Returns resources missing one or more required tags.
Resources
| where not(
    isnotempty(tags['environment']) and
    isnotempty(tags['owner']) and
    isnotempty(tags['costCenter'])
  )
| project name, type, resourceGroup, tags

Adjust the tag keys to match your organization’s standards. The logic is simple: if the not(…) block matches, the resource is out of compliance. Zero rows is a passing result.

Naming convention enforcement

Naming conventions are easy to document and easy to violate, especially when people are moving fast. This query checks that your resources follow the prefixes your team agreed on.

// Returns resources that violate naming convention prefixes.
Resources
| where (type == "microsoft.storage/storageaccounts" and name !startswith "st")
      or (type == "microsoft.compute/virtualmachines" and name !startswith "vm")
      or (type == "microsoft.network/virtualnetworks" and name !startswith "vnet")
| project name, type, resourceGroup

You can extend this with as many resource types and prefix rules as you need. The pattern stays the same.

Security: storage accounts not enforcing HTTPS

This one should be a hard failure in any pipeline that touches production. If a storage account allows HTTP traffic, that is a security risk.

// 05-storage-https-only.kql
// Returns storage accounts where HTTPS-only is not enforced.
Resources
| where type == "microsoft.storage/storageaccounts"
| where properties.supportsHttpsTrafficOnly != true
| project name, resourceGroup, httpsOnly=properties.supportsHttpsTrafficOnly

Notice that this query reaches into the properties bag. Resource Graph gives you access to the full resource payload, not just top-level metadata. That makes it possible to check security configuration without calling the storage API directly.

Security: NSG rules exposing sensitive ports

This is the most complex query in the set. It uses mv-expand to unroll the array of security rules inside each NSG, then filters for rules that allow inbound access from the internet on ports you care about.

// 06-open-nsg-ports.kql
// Returns NSG rules that allow inbound access from the internet on sensitive ports.
Resources
| where type == "microsoft.network/networksecuritygroups"
| mv-expand rule=properties.securityRules
| where rule.properties.direction == "Inbound"
      and rule.properties.access == "Allow"
      and rule.properties.sourceAddressPrefix in ("*", "Internet", "0.0.0.0/0")
      and rule.properties.destinationPortRange in ("22", "3389", "1433", "*")
| project nsgName=name, ruleName=rule.name, port=rule.properties.destinationPortRange

mv-expand is one of the most useful operators in KQL for infrastructure work. Any time a resource has an array in its properties (rules, endpoints, configurations), mv-expand lets you filter across individual items instead of treating the array as an opaque blob.

Real example: AKS module spec compliance

This is a practical test from real infrastructure development. Suppose your a developer / maintainer of AKS module and your specification requires every cluster to have at least one system pool and one user pool. More than two is fine, but you never want to deploy a cluster with fewer than two or without both types. This query can be part of your module tests and catch those mistakes downstream before they reach production.

// AKS clusters must have at least one system pool and one user pool.
// Returns clusters that do not meet this spec.
Resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand pool=properties.agentPoolProfiles
| summarize systemCount=countif(pool.mode == "System"), userCount=countif(pool.mode == "User"), poolCount=dcount(pool.name) by name, resourceGroup
| where systemCount < 1 or userCount < 1
| project name, resourceGroup, systemPools=systemCount, userPools=userCount, totalPools=poolCount

The summarize operator counts up how many pools of each type exist on each cluster, then the where clause fails if either count is missing. This is the kind of test you run every deployment to catch a mistake in your module before it reaches production.

The “Zero Result = Pass” Pattern in CI

Security and compliance queries are written so they only return rows when something is wrong. That makes them easy to use as CI tests. You run the query, check the count field in the JSON response, and fail the pipeline if it is greater than zero.

Here is a reusable script you can drop into any pipeline:

#!/usr/bin/env bash

QUERY_FILE="$1"
DESCRIPTION="$2"

count=$(az graph query \
  --graph-query "$(grep -v '^//' "$QUERY_FILE" | tr -s '\n' ' ')" \
  --output json | jq '.count')

if [[ "$count" -gt 0 ]]; then
  echo "FAIL: $count $DESCRIPTION"
  exit 1
else
  echo "PASS: $DESCRIPTION"
fi

In a GitHub Actions workflow you might use it like this:

- name: Check storage HTTPS enforcement
  run: |
    ./tools/az-graph-testing/check-query.sh \
      tools/az-graph-testing/queries/storage-https-only.kql \
      "storage accounts without HTTPS-only"

- name: Check open NSG ports
  run: |
    ./tools/az-graph-testing/check-query.sh \
      tools/az-graph-testing/queries/open-nsg-ports.kql \
      "NSG rules exposing sensitive ports to the internet"

Each step fails fast and gives you a clear message. No parsing, no framework. Just shell and KQL.

When Not to Use This Approach

Resource Graph has a propagation delay. After a deployment finishes, it typically takes about a minute for the index to reflect the new state. Do not use these queries for immediate post-deploy validation in the same pipeline step as the deployment itself. Add sleep 60 or, better, move the validation into a separate pipeline job that runs after a short wait.

It is also read-only. You can query resources, but you cannot use Resource Graph to fix anything. If a query fails, your pipeline fails and a human or a follow-up remediation job needs to act.

For catching problems before deployment rather than after, use az deployment what-if for Bicep or terraform plan with policy checks. Resource Graph is best for post-deploy validation and ongoing auditing, not pre-deploy gates.

Azure Policy is the right tool if you need enforcement at deploy time, where non-compliant resources are blocked from being created at all. Resource Graph queries complement Policy, they do not replace it. Use Policy for enforcement, use Resource Graph for visibility and testing.

Resources

Useful references for this topic:

Putting It Together

The core idea is simple. Infrastructure code should be tested like application code. KQL files checked into version control, reviewed in pull requests, and run automatically in CI are a better system than one-off portal queries or hoping the deployment output tells you everything you need to know.

Write the query once, run it every deployment, and catch the mistake before anyone else does.