Mastering a new tool might be challenging, so here I’m going to share my best practices for working with Azure Policy. Those tips are based on my experience, and they are intended to complement my Azure Policy Starter Guide. Although most of them will be about pretty simple things, to my mind, it’s usually the basics that people tend to overlook and suffer the consequences of their neglect. So, let’s start.
It might seem obvious that to make your code more reusable, you should use input parameters to control its behavior. Still, you can come across many examples, even with the built-in policies, when their authors didn't bother with parameters and simply hardcoded some property values in the policy code. In such cases, what could have been accomplished with a simple update of policy assignment parameters now requires updating your Azure Policy definition, probably removing the current policy assignments and the definition if it's incompatible with the new one, deploying the new policy definition into your target scope, and creating new assignments for it. Sounds like a ton of work, right? Now, imagine you need to update a dozen of those non-parametrized policies. On a second or third occasion, it's not fun at all, and you probably start thinking about making your policy definition more adjustable to changing business requirements.
Let me illustrate how parametrizing a single policy value can instantly make your custom Azure Policy twice as useful. If you have worked with Azure Policy before, you might know that they can have different effects. The Audit effect is the most used and usually the simplest to implement. You might find many custom policy definitions where it's defined as follows:
Basically, that's what the official documentation says.
However, if you go the extra mile and check a few Azure Policy definitions implementing the same effect, you might notice that the policy developers tend to define the policy effect as an input policy parameter like that:
Now, a simple switch flip can change your custom policy behavior from simply auditing for non-compliant resources to preventing them from deploying. Same policy rule, same logic, but twice as helpful thanks to that simple change. Now, when a business tells you that they are tired of chasing Azure resource owners or fixing non-compliant configurations and just want to block any such deployments from happening, you can implement such a control in a blink of an eye.
Keep it simple
Many ARM template functions can be used in policy rules to implement some really advanced scenarios. You can reference resource properties, evaluate arrays, define conditional logic, work with strings, and do many other things. Sample Azure Policies for Guest Configuration or Kubernetes clusters can change your view of how powerful and complex(!) that tool can be. However, the complexity of your policy code doesn't mean it's a good thing to do all the time.
For instance, I occasionally find myself reading my own Azure Policy code from six or so months ago and having difficulty understanding how particular 'fancy' things work there. I often use my work notes and even read through my old blog posts to recall why I did like that or how specific code works. Now, look at this from other's people perspectives. Will it be easy for them to maintain and modify your custom policies in the future?
In my experience, 80% of using Azure Policy is all about simple stuff like auditing, modifying tags or creating guardrails for your cloud environments in Azure. So, why not keep its implementation simple too? If it's hard to do with Azure Policy, then it's probably not the right tool for your task. Look at other Azure services, many of which have integrations with the policy APIs.
Another extreme, which is opposite to not using input parameters in Azure Policy, is defining parameters for everything and trying to make a Swiss knife out of your policy. Of course, you can define the default values for those parameters so that you don't have to provide them every time when creating a policy assignment. Still, it makes your policy definition harder to read and understand. For example, if your policy logic is about Azure VMs, there is no need to supply that resource type as an input parameter. Or, if you want to audit Azure Hybrid Benefit usage across different eligible resource types, it might be better to define those controls in separate per-type policy definitions and combine their assignment with a policy initiative. That will make your solution cleaner and easier to understand for other people.
Test for side effects
I might sound repetitive in my posts about Azure Policy, but do test the policies first before assigning them to a production scope. It’s especially important in the case of the policies that modify, deploy or deny resources. Because most Azure Policy assignments happen at the subscription or management group levels, the blast radius of such changes is usually huge. Also, cloud adoption at organizations tends to evolve from prototyping with no or few rules first to catching up with cloud governance later. So, by the time you start putting your policies in place, there might be hundreds or even thousands of existing resources that don’t comply with them.
Let’s take for example the built-in policy that defines the list of allowed resource locations. It’s commonly mentioned in cloud governance practices that suggest limiting the deployment of your Azure services to specific regions. There might be multiple reasons for that, but my point here is not about cloud governance. At first glance, it seems to be quite easy to assign that policy and be done with that control. However, if you have existing resources that don’t comply with that policy, you will effectively block them from any further modification. That’s because, from the Azure Resource Manager API perspective, the same ‘write’ endpoints are used for both creating and updating(!) existing resources. Now, we have a problem here.
In the case of community-crafted Azure Policies, you should be even more watchful. The fact that somebody already created a policy that seems to perfectly fit your needs doesn’t mean that you should go and assign it straight ahead. It’s no different from copy-pasting a code from Stack Overflow without first understanding how it works. Many such policies might have errors in their logic or be not up to date with recent Azure API changes.
So, one more time – always test your policy in a lab environment first. Even if you’ve worked with it before and know it inside-out, there a likely to be edge cases that you haven’t seen yet, as every new environment might be slightly different.
Do use custom non-compliance messages
In the past, if an Azure resource deployment was blocked by some Azure Policy, you would get a generic error message and reference to the blocking policy at best. To say that it was frustrating is to say nothing. You would need to look into deployment and activity logs to figure out what that error is about, and what exactly you need to fix to pass the policy checks. Developers doing their deployments from a console or an automated pipeline usually needed to go to engineers managing cloud infrastructure to troubleshoot such issues, as the console output was even less verbose.
That user experience improved significantly with the introduction of custom non-compliant messages for policy assignments and more verbose console output. Now, instead of a generic message that a resource deployment was blocked by some policy, you can provide a user with more useful information like instructions on what exactly was wrong and how to fix that or a link to a knowledge base article with a detailed policy description and troubleshooting steps.
The functionality of custom non-compliance messages proved to be extremely helpful, especially for Azure Policies with the Deny effect. Please, don’t skip out on it. Moreover, I suggest putting double effort into crafting your custom messages, so they provide other users with clear instructions on what to do to pass the policy validation. Another person coming to you with a question (or a support request) about a policy-blocked deployment should encourage you to revisit the corresponding non-compliance message until it has enough details for self-service.
Set up a CI/CD pipeline(s) for Azure Policy
Although Azure Policy doesn’t introduce a separate programming language in a broad sense, they are still defined in code. I’ve already authored a couple of articles on that, so feel free to check them:
- How to deploy Azure Policy with Bicep
- How to deploy Azure Policies with ARM templates
- Using ARM templates to deploy Azure Policy initiatives
As with any code, all DevOps-proven technics such as code versioning, automated testing, continuous integration and continuous deployment are also applicable for developing and maintaining your custom policy definitions and assignments. All the tooling for that is here and free to use. For instance, you can check my blog post on how to automate your policy deployment:
Investing your time in setting up your automated deployment process for Azure Policy will drastically improve your development speed and increase your confidence in its quality. Even if you have no intention to develop your custom policies and plan to use the built-in ones only, it’s still reasonable to embed the creation of policy assignments in your CI/CD pipeline and use staging environments, quality checks, and gated approvals before deploying in production.
Of course, those tips are just a small fraction of best practices to follow when working with Azure Policy, and I’m planning to update that list with other relevant highlights from my experience. If you think that I missed something important here, please feel free to comment and add your suggestion in the comment box below 👇