Andrew Matveychuk

Azure App Service Cost optimization strategies that you won’t get from Azure Advisor

Andrew Matveychuk — Mon, 10 Feb 2025 12:31:23 GMT

When optimizing cloud costs for large enterprise environments, its common to focus on compute-intensive cloud resources, such as virtual machines, clusters, and container-based workloads, as they are usually the primary cost driver and top contributor to monthly cloud invoices. Luckily for us, there are already plenty of well-known approaches, straight guidelines, and easy-to-use tools from Microsoft and other third-party vendors that can save money on those workloads.

On the contrary, optimizing costs for PaaS services is an entirely different story, as there are a lot of service-specific nuances you should be aware of that can impact your resource costs. So, today, lets explore how we can optimize our cloud spending on Azure App Service, which is one of the most used PaaS services in the Azure cloud.

App Service plans consolidation

The first service-specific cost nuance for Azure App Service is that App service resources dont incur costs independently. In other words, you dont pay for your App Service resources. You pay for App Service plans that host them. When an App Service represents your application, an App Service plan provides you with actual infrastructure resources to run them.

You can think of App Service plans as an abstraction for web server farms managed by Microsoft. With that abstraction, you dont need to manage virtual machines, configure web servers, do the networking, scaling, patching, etc. You just need to pay for the compute and storage resources you consume. That is a very simplified explanation of App Service plans, and they are much more under the hood, but those details are not so important in the context of our topic

Understanding that concept is essential, as separating the application part from the infrastructure allows you to host many App Services on a single App Service plan. You can be surprised that many engineers dont know about that. Plus, the Azure portal, by default, suggests you provision a new App Service plan each time when you create a new App Service, and thats what most people do. The result is that you have an environment with a handful of underutilized App Service plans costing you lots of money, each hosting a single or very few apps.

In many cases, you can reduce your spending on App Service applications tenfold by consolidating them on a single App Service plan per environment or application group. Apart from that, you can discover that hosting a dozen apps on a single more expensive App Service Premium tier is cheaper than having a dozen Basic or Standard service plan instances.

However, when consolidating multiple apps on a single App Service plan, you must understand that high load or errors in one application can impact the performance of other applications sharing the same App Service plan. So, its important to implement appropriate monitoring and autoscaling to mitigate the impact of such events.

Rightsizing and autoscaling

Rightsizing and autoscaling are probably the second most common topic for reducing your cloud spent on Azure App Services. As with Azure VM, overprovisioned resources for App Service plans are what you can see when you check their actual utilization. The usual argument is that we need some space capacity just in case there is a surge in requests. For some reason, many think resource scaling is only about adding CPU and memory to the existing resources. The horizontal scaling, when you load balance your traffic between multiple nodes and add or remove additional nodes when needed, is still overlooked. If you dont use it, you simply dont leverage the full flexibility of the cloud when you can pay only for what you consume.

Rightsizing and autoscaling work hand in hand. You can set up your scale-out rules for App Service and pay for additional resources only when you need them, which might significantly impact your App Service costs. Adding additional instances happens pretty fast, as Azure data centers usually have some pools of ready-to-use instances for that purpose. The only exclusion from that is the Isolated tier (aks App Service Environments), where scaling might take longer. So, if using them, please check if autoscaling speed is appropriate for your needs. If their scaling is too slow for you, you can consider moving to the Premium tier, as it now has many features that were available only on App Service Environments previously, and it might fully fit your requirements.

Deployment slots vs. separate environments

Another place to look into for optimizing your App Service costs is reconsidering your application development and testing approach. It is common to create separate environments for application development and running the same application in production. Despite the benefits of that approach, it comes with additional costs, as you need to have twice as many cloud resources for it. Regarding App Service, it means that you usually have two separate App Service plans to pay for one for production and one for development purposes.

What you can (and probably should) do is leverage App Service deployment slots, which are available starting from the Standard tier. Basically, they allow you to create and manage separate hosting containers for your development, testing and production versions of your application. Those containers are hosted on the same App Service plan, effectively reducing the need for additional plans. What is more, with App Service deployment slots, its much easier to push new application versions to production by performing slot swaps. Plus, you can implement A/B testing and split incoming traffic between production and staging slots to live-test your changes on a smaller scope.

As with consolidating multiple applications, App Service deployment slots share resources of the same App Service plan. So, please keep that in mind if you need to perform some load testing for your application. It might make sense to deploy a separate environment for that purpose and decommission it when you are done with your tests.

Network isolation and pricing tiers

This one is somewhat connected to rightsizing, as network isolation and private networking were previously achievable only on the Isolated tier when you deployed your app service environments into your Azure virtual networks. Those days are long gone; you can isolate your App Service environment on the network level even when running on the Basic tier. Azure App Service virtual network integration and Azure Private Link allow you to lock down network access to your App Service instances and services they depend on for work. The virtual network integration allows your App Services to communicate with other services deployed in your Azure Virtual network without using their public endpoints. Private endpoints provide private connectivity to your App Services from your private virtual network, so the inbound traffic to them never leaves your network.

For example, you can completely turn off the public endpoint on App Service and make it available only via a private endpoint, so its accessible from your private network only. If you need to securely publish your application for public access without exposing your App Service public endpoint, you can do it with Azure Front Door Premium, which can connect to it via private endpoints. Alternatively, you can configure access restrictions for your App Service public endpoint so its only accessible to your Azure Front Door Standard instance and not for direct access.

With all that being said, you might reconsider using more expensive Isolated and Premium tiers in favor of more affordable Standard and Basic ones, knowing that you can achieve comparable network isolation and protection for your App Services. Plus, implementing traffic offloading with such services as Azure Front Door might allow you to scale down your App Service plans, as few requests will be hitting them.

Reservations and Savings Plans for App Service

If the previous recommendations required some changes in your App Service deployments, this one is the most effortless. Suppose you know that you are likely to run your App Service instances as they are for a year or more. In that case, you can look into purchasing Azure Reservations if its a stable workload in a specific Azure region or Azure Savings plans if you need more flexibility in your service and region choice. Your savings from those commitments will vary depending on your App Service plan sizes and the length of the commitments. For maximum savings, its worth trying to refactor your App Service deployment according to the previously mentioned tactics first, then consider using reservation or savings plans for your already optimized environment.

The downside of that optimization is that it applies only to Premium v3 and Isolated v2 App Service plan tiers. However, as mentioned at the beginning of this article, its often possible to hit a cost reduction combo if you consolidate your multiple App Services on a few Premium v3 App Service plans and apply reservation or savings plans to them.

How to optimize Azure App Service costs with Turbo360

After going through all those optimization tactics for Azure App Service mentioned above, you might be thinking about implementing them in your practical scenarios. Depending on your specific use case of that cloud service and the scale of your infrastructure, you usually have the following options:

With small-scale App Service deployments, you can spend a few hours examining your configuration and optimizing it according to your constraints.
In large-scale scenarios, such as enterprise deployments with hundreds or thousands of App Service instances or managed service providers (MSP) managing hundreds of client environments, performing such cost optimizations manually is usually impractical and unfeasible.

For the latter option, it might make sense to look for third-party solutions like Turbo360 or similar that allow you to significantly reduce the time spent analyzing your cost optimization options and implementing them at scale.

For example, Turbo360s Cost Analyzer can enhance your Azure App Service cost optimization strategy. It provides insights and optimization recommendations that native cloud tools often lack. The Cost Analyzer features go beyond basic cost metrics to provide granular insights into resource utilization, including identifying the exact needs of your Azure App Service. With rightsizing recommendations that include upgrade, downgrade, idle, and no change options, you can allocate your App Service resources more efficiently, avoiding their overprovisioning and underutilization.

Plus, Turbo360 allows you to create optimization schedules to scale resources down during non-peak hours and up during high load by automating scaling based on real-time resource demand and business requirements.

According to the insights from their existing client base, implementing such scaling schedules can reduce App Service costs by up to 65% for non-production environments.

Moreover, the Cost Analyzers monitoring feature allows you to set a predefined budget so that you do not exceed your spending and incur any surprise costs.

Frankly speaking, you can optimize your App Service (and any other cloud service) costs without using any third-party tools. However, you should understand that it might take a lot of time and effort to implement and monitor for new optimization opportunities on your own, using cloud-native tooling only. Thats why you might want to evaluate ready-to-use solutions like Turbo360 (it has a 15-day free trial) for Azure App Service optimization so that you can free up your time for more impactful and profitable work.

In conclusion

To sum up, optimizing costs for your App Service deployments is a creative task. It might even seem counter-effective when upgrading to higher service tiers and adding more cloud services to reduce overall operational expenses. Apart from that, you shouldnt just blindly sacrifice your reliability, security and observability requirements to save a handful of coins. Azure cost optimization is not something you do in full isolation from other aspects of managing your applications and services. Moreover, its not something you do once and for all. Having good FinOps processes and tools to monitor and assess changes in your cloud expenses is equally important to one-time optimization, as changes are almost inevitable in the cloud, which can drive your cloud bill up as well as down.

]]>

Migrating from Ghost to Hashnode

Andrew Matveychuk — Fri, 20 Dec 2024 09:01:00 GMT

Announcement

Ive migrated my blog from Ghost(Pro) to Hashnode. All links should be working, but please let me know if you encounter something broken or missing.

Reasons

I used Ghost to run my blog for many years, self-hosting it first and becoming a paying customer of Ghost(Pro) later. I would say that I was mostly satisfied with it as a blogging platform. Nevertheless, I decided to try something new last year, and Hashnode was on my radar for a while.

Ghost(Pro) has its niche in various modern blogging platforms. Its feature-reach, reliable, SEO-optimized, and fast. However, the overall direction of its evolution and its pricing model push you toward monetizing your blog as your audience grows. Besides, its Stater pricing tier doesnt allow you to use custom themes, significantly reducing your ability to customize your blog. You need to go with the Creator tier for that, which increases your annual spending on blog hosting three-fold. Those aspects made me question if there are better alternatives for the Ghost(Pro) Stater tier.

I think Hashnode, being relatively younger, cannot be compared to Ghost head-to-head. Also, it has a different product placement, positioning itself as a community blogging solution, which is somewhat similar to Medium. At the same time, you can map a custom domain to your Hashnode blog or self-host using the Hashnode Headless CMS, which allows you not to be locked into using Hashnode as a managed platform only. Basically, you can have almost the same platform functionality on Hashnode for free as you could have it on the Ghost(Pro) Stater tier if you dont need to use a paywall for your subscribers.

Apart from that, Hashnode allows you to import and export (backup) your articles to a GitHub repository in the Markdown format. Even though I have the source documents for my articles and the related images stored in my OneDrive folder and backed up on the file level, manually restoring them would be a tedious task, considering the growing number of blog posts. Plus, I tend to edit many blog posts on the website after publishing them, like adding updates and fixing broken links, which makes my source Word documents somewhat outdated. Having all my blog posts up-to-date, versioned, in the Markdown format, and automatically syncing to my GitHub repository as a backup location makes it much easier to restore or transfer to another blogging platform if need be. Compared to Ghost, where your blog posts are stored in a database and can only be exported in a cumbersome JSON format on-demand, that GitHub integration was the most convincing reason to switch to Hashnode. Because, you know, having backups is what distinguishes good engineers from not-so-good ones.

Migration process

Unfortunately, there is very little information on the Internet about migrating your content from Ghost to Hashnode. Some people migrated their content programmatically, while others leveraged the Hahnode feature to import content from RSS feeds, pointing it to their current Ghost-based blogs.

I wanted to spend as little time as possible on the migration and chose the migration via the RSS import. Unfortunately, when I was ready to start the import process, that feature was turned off for unknown reasons. It was time to speak to Hashnode Support, and I was pleasantly surprised by their responsiveness and the workaround they provided.

It turned out that the Hashnode team (kudos to Favourite Jome) had already created a migration app to migrate your content from Ghost to Hashnode. You only need to provide it with the JSON export of your Ghost blog and the API token for your new Hashnode blog, and it will import most of your existing blog content, saving most of the source formatting and media, including images, GitHub gists, and other embeds. Your posts will be imported as drafts in Hashnode, so you can (and should) review them before republishing. Depending on the formatting of your original posts and the different embeds you might use, you might need to spend some time fixing some import errors, formatting, and checking for missing content.

In my case, it took me a couple of days to properly format my drafts on Hashnode, check that all images were in place and displayed correctly, and fix a few dozen broken links in some of my old posts.

Apart from importing your old posts from Ghost, you might also need to import your subscribers. The subscriber export in Ghost and their import in Hashnodev use the CSV file format for data, so migrating them wasnt a big deal. It took me just a few PowerShell commands to clean up unnecessary fields before the import, and I was all set to send out mail notifications to my subscribers from my Hashnode blog.

Currently, managed Hashnode hosting doesnt support custom themes, and your customization options are limited to three predefined layouts, custom logos, and a branding color. However, it wasnt an issue for me, as the resulting blog layout is very similar to my previous one in Ghost.

A few final touches were related to remapping my custom domain, transferring my custom page redirects, and repointing my user analytics projects to the new blog. Those hardly require mentioning, as their configuration was pretty straightforward and well-documented. I particularly liked that on Hashnode, you dont need to mess up with code injections for Google Analytics or other similar solutions to make them work, and you only need to provide your unique tracking tags or IDs for configuration.

Back to blogging

Now, Im finally set up to continue my blogging on Hashnode, and I shall see what my impression of it is after using that platform for a while. Ive already stopped drafting my articles in Microsoft Word and completely switched to the Hashnode editor. The ability to seamlessly switch between the WYSIWYG and raw Markdown editor modes is fantastic, and Im using it to learn more about Markdown formatting. Plus, having my articles automatically backed up to a GitHub repo makes me less worried about restoring them if needed.

Although the current blog can run on Hashnode with the custom domain entirely for free, I happily paid for the Hashnode Pro to support the team behind the great product. I hope that the Hashnode team will continue releasing new features and extending the functionality of the existing ones.

P.S. Please treat all of the above as my migration experience, which Im happy with, and not a Hashnode advertisement in any way. If you have questions about my migration to Hashnode, please feel free to ask them in the comments 👇

]]>

Ghost on Azure: Project Update (Ghost 5, MySQL Flexible Server, Private Link, RBAC for Key Vault, App Service access restrictions to Front Door)

Andrew Matveychuk — Tue, 12 Nov 2024 09:40:51 GMT

It has been a while since I last updated my Ghost on Azure project, and many changes have been introduced to Azure services during that time. I decided to use the break in my work to update the project deployment templates to include those changes and to use it as a learning opportunity to catch up on new cloud service features.

Ghost on Azure is a one-click Ghost deployment on Azure Web App for Containers. Its written as a Bicep template, spinning a custom Ghost Docker container on Azure App Service, which uses Azure Database for MySQL as a backend. It can be a good starting point for anyone wanting to self-host the Ghost platform on the Microsoft Azure cloud. Plus, it leverages a lot of Azure-native services and their features. Thus, it can be considered a showcase of their practical usage.

The project started as a simple multi-container deployment and later transformed into a comprehensive solution using PaaS services such as Azure Key Vault, Front Door, and Web Application Firewall. Now, we are focusing on further security improvements and migrating from deprecated Azure services.

New Ghost 5 container image

I use a custom Ghost Docker image in this project, which is based on the official Ghost Alpine image and extended to support Azure Monitor Application Insights. Ive updated it to Ghost 5 and removed explicit database connectivity check with wait-for-it, as it has not been very reliable in testing MySQL server readiness to accept connections. As a new instance of Azure Database for MySQL takes some time to be up and running, Azure App Service hosting the container restarts it a few times until the database backend is ready and the Ghost app can connect to the server to create its database.

Note. The initial deployment might take a few minutes before the Ghost container successfully starts and is ready to serve the content.

MySQL Flexible Server

The initial project configuration used a multi-container deployment, placing a MySQL container alongside the application container running Ghost using Docker Compose. Later, I switched to a single container deployment and used Azure Database for MySQL for the database, as the support for multi-container deployment on App Service degraded and became unstable. Recently, Microsoft deprecated Azure Database for MySQL Single Server, and I needed to migrate my project to MySQL Flexible Server.

As Ghost 5 supports MySQL version 8 as a primary database option, I also switched from previous version 5, which was used with the MySQL Single Server, to that version with the Flexible Server deployment. However, the most notable change in the context of this project is probably that Azure Database for MySQL - Flexible Server now enforces using encryption connections only by default.

I think using encryption in transit, like connecting to a database, is a good thing, even if the data transfer happens in the internal network. It fully reflects the core principles of Zero Trust Architecture, as a malicious actor might also operate inside your network. The tricky part was configuring the encrypted connection options in the MySQL client used by Ghost. Putting the public certificate used by MySQL - Flexible Server into the container image or onto a file share wasnt a good option and would introduce more unwanted dependencies. So, I ended up putting the content of that public certificate into an environment variable using multi-line string support in Bicep:

https://gist.github.com/andrewmatveychuk/7113bfa8d6f89fb558ef51baeadaccaa

Azure Private Link

Another notable change is that I decided to leverage Azure Private Link to further restrict access to the database server at the network level and completely block access to it over the public network.

Configuring Azure Private Link and private endpoints for Azure services might seem like a daunting task at first, but when you grasp the core concepts of how it works under the hood, it should become your no-brainer option to reduce the attack surface of your cloud infrastructure. Yes, it requires a good knowledge of core network concepts, understanding how DNS resolution works, and configuring corresponding private DNS zones using Azure Private DNS zones or your other existing DNS services. Plus, it adds complexity to your deployments and requires extra work to automate its configuration using Bicep or Terraform templates. However, from the security standpoint, I would name locking down network access to your cloud resources the number one recommendation after setting up proper authorization and encryption controls.

Here is how the project network topology looked like before using Azure Private Link:

Communication between the app service and dependencies happened over their public endpoints. The backend services enforced encrypted connections. Plus, you could put some restrictions using firewalls on their public endpoints to limit access from other Azure services only. Still, the information was traversing over the Internet, and any misconfiguration of service firewall rules could expose them to external attacks.

After configuring private endpoints for Azure Database for MySQL, Key Vault and Storage, and configuring virtual network integration for App Service, all traffic to backend services is isolated in a private Azure Virtual Network:

The connectivity to the public endpoints of backend services is completely locked down, making it much easier to control and enforce at the enterprise scale using Azure Policy. Now, you dont have to validate and control myriads of firewall rules on various Azure resources.

As you can now integrate App Service with a virtual network even on the Basic tier, Azure Private Link and private networking in Azure became even more accessible. When your App Service is integrated with a virtual network, it can connect to other services deployed in that network, like private endpoints, without going over the public network. Plus, you can apply all other security measures available in Azure Virtual Network to further segment and restrict network access using subnets, Network Security Groups (NSG), network policies for private endpoints, etc.

Azure Private DNS zones for hosting private DNS zones for Azure Private Link service also greatly simplifies its usage, as it automates the creation of corresponding DNS records for your private endpoints.

Azure Key Vault role-based access (RBAC)

The next improvement is related to configuring access to Azure Key Vault using Azure roles instead of legacy Key Vault access policies. In my opinion, it greatly simplifies access management at scale, as you no longer need to control access at two different levels, and you can configure access to both management and data plane using Azure built-in and custom roles.

From the Bicep code perspective, we now create a separate Azure role assignment resource:

https://gist.github.com/andrewmatveychuk/96bdaab91d7793b684befa4d8325321e

Although you can configure Microsoft Entra authentication for Azure Database for MySQL and leverage App Service managed identity for authorization, it seems that authentication using managed identities is not supported by the MySQL client library used by Ghost. So, I still rely on MySQL authentication and have to use Key Vault to securely store the database password used to connect to the Ghost database.

Also, when I tried to migrate the Storage account used to host the file share as persistent storage for the Ghost container to the role-based access model, I faced a similar issue, as that scenario is not supported by custom-mounted storage.

Azure Front Door Standard and App Service access restrictions

The initial project version used Azure CDN for traffic offloading. Later, I added an option to deploy the solution with Azure Front Door, which used a managed Web Application Firewall (WAF) policy for inbound traffic inspection and site protection. Those Azure services, now renamed as classic ones, are scheduled for retirement in 2027. Microsoft released a comprehensive guide on migrating from the Azure Front Door (classic) to the Standard/Premium tier. I would consider that service update a breaking change, as service features dont map one-to-one between the classic and new service offerings. Plus, the pricing model of the new offerings is quite different, which requires careful consideration and cost estimation for your specific use case.

Having said that, Ive removed the option to deploy the solution with deprecated Azure CDN (Microsoft CDN (classic)) and updated the configuration to deploy Azure Front Door Standard as a more reasonably priced service for such a project. Unfortunately, the managed WAF policies are now supported only by the Premium tier, so Azure Front Door Standard now works more like a CDN, but you can still enhance it with custom WAF policies.

Apart from that, the Azure App service now supports more targeted access restrictions. In addition to restricting access to your Azure App service using the Azure Front Door service tag, you can now narrow it to a specific Front Door instance with HTTP headers. The Bicep template part for configuring such restrictions might look like the following:

https://gist.github.com/andrewmatveychuk/8f95d0064ab33434c0f717cb0846b225

The master project template still contains conditional logic to deploy the solution with an Azure App Service as a public endpoint or use an additional Front Door profile to serve the incoming traffic to your Ghost on Azure deployment.

Other minor tweaks

In addition to updating Azure Resource Manager API versions for resources, removing discontinued pricing tiers for some services, and updating categories for Azure Monitor Logs diagnostic settings, Ive also reduced the number of output values passed between the Bicep modules in the master deployment template and used references to existing resources whenever possible, as it provides more flexibility in referencing their properties in other resources. For example, instead of passing (aka exposing) Storage account access keys through output values, you can just reference them using the corresponding function on the referenced resource:

https://gist.github.com/andrewmatveychuk/ba3046eaf9203f40a132e3acf20ca366

For the complete deployment configuration details, check the source code in my GitHub repo.

To be continued

As you might have noticed, a few design decisions in that project originated from overcoming Azure service limitations. I think hosting a containerized app on Azure App Service is still quite limited, not production-wise, as it probably creates more challenges than solves them. Im considering trying out Azure Container Apps, which will be explored.

Another planned modification is configuring the Azure Private link for Azure Monitor components used in the solution. Its different from private links to other Azure services, so I plan to explore its specifics in a separate post using my Ghost on Azure project as a playground for that implementation.

Have you used Azure Container Apps or Azure Monitor Private Link Scopes in your projects? What was your experience with them? Please share your thoughts in the comments 👇

]]>

How to authenticate to Azure with managed identities from non-Azure servers

Andrew Matveychuk — Tue, 02 Jul 2024 11:00:17 GMT

In the third post in my series about secure authentication to Azure services, we will explore how to access Azure resources from servers hosted on-premises or in other clouds without storing any credentials, like client secrets or certificates, on them.

For the previous posts in this series, please check the following articles:

A secure password is the one you dont know

As I mentioned in my first post, I encourage people to use managed identities to authenticate to Azure services whenever possible. Using managed identities greatly simplifies the management of communication credentials in your application. Plus, it helps you mitigate many security risks related to storing and using the apps secrets, passwords, keys, etc. You no longer need to worry about leaked passwords, rotating your credentials, or ensuring that different environments use different credentials.

When hosting your applications in Azure, managed identities should be your number one choice for most authentication scenarios between application components. However, what do you do in more common hybrid or multi-cloud setups when your application infrastructure is partially outside Azure?

In that case, you can extend Microsoft Entra IDs identity functionality beyond Azure using Azure Arc for free.

Azure Arc provides many more useful features other than utilizing managed identities, but they are not the focus of this article.

Technically, after you install the Azure Connected Machine agent on your non-Azure server, it will link the server with the connected machines Azure identity and create a local identity endpoint that can be used to request an access token for your application.

How to use managed identity on Azure Arc-enabled servers

As I explained in my post about using certificate credentials to authenticate to Azure services, you can configure your application to use specific identity types in a few ways.

The first and most preferable option is to rely on the built-in fallback mechanism of the DefaultAzureCredential class. If you dont set the environment variables, it will try to authenticate with a managed identity as the third option in the sequence. In practice, that means you dont need to change anything in your code, and the same code that worked with authentication using client secret or client certificate will continue to work provided that you configured the required permissions for the managed identity to access the target Azure resource 👇:

https://gist.github.com/andrewmatveychuk/45e9c50b0be362cb90a75de0c3262868

Note. The account used to run your application must be a member of a specific group on an Azure Arc-enabled server to access the identity endpoint and get access tokens. Otherwise, you might get an error like the following:
Authentication Failed. ManagedIdentityCredential authentication failed: Access to the path 'C:\ProgramData\AzureConnectedMachineAgent\Tokens\292daa9f-1794-43f4-a246-6f6cc6ca4e03.key' is denied.
Also, if you run your application interactively, you need to do it with elevated permissions as administrator.

The second option is to use application settings like configuration in .NET to explicitly tell your application to use a managed identity to connect to an Azure service. For example, the code below is the same that I used to connect from the sample .NET Worker service using certificate credentials configured in an appsettings.json file 👇:

https://gist.github.com/andrewmatveychuk/058213eac0d8ac301b8c29fde8404c19

Now, you can tell the app to specifically use the managed identity without changing your application code:

https://gist.github.com/andrewmatveychuk/5949b3497182f9a5aa466998b88e8cf7

As you can see, compared to using certificate-based authentication to Azure services, you no longer need to worry about configuring, securing and rotating connection credentials with managed identities. All of it is done for you. You can focus more on your application functionality rather than on non-customer-facing tasks. Moreover, you can easily switch between different authentication methods by following the recommended approaches to using Azure Identity libraries in your application.

You can also check the sample code in my GitHub repository: azure.authentication-samples.

Have you tried to use managed identities on Azure Arc-enabled servers? What was your experience with it? Share your thoughts in the comments below 👇

]]>

How to use certificate credentials to authenticate to Azure services

Andrew Matveychuk — Tue, 18 Jun 2024 10:00:10 GMT

In my previous blog post, I showed how you could authenticate to Azure services other than using a username and password. Now, lets explore some technical details for certificate-based authentication and how to implement it in your applications.

If you are wondering why you should prefer using certificates over client secrets, please check case #2 in the previous blog post in this series.

Overall solution design

From the overall design perspective, the required infrastructure configuration is mostly identical to what you might have when using app registration with client secrets to authenticate:

You register your application in Azure, assign permissions to that app, import the Azure Identity SDK in your code, and use the DefaultAzureCredential object or more specific credential types to authenticate to a target Azure resource.

Note. Here, I will use the Azure Identity client library for .NET to illustrate the concept. Using Azure Identity libraries for other programming languages should not be very different.

Unfortunately, most official tutorials focus on using client secrets specified in environment variables, and there are very few samples for using client certificates. So, lets see how we can use them.

Generating and installing certificates

Before adding a certificate to your app registration, first, you must get one. You can either create a self-signed certificate or get a valid certificate from a Certificate Authority. Remember that only the public key of that certificate is saved with your app registration. The certificate private key should be accessible to your application where you host it.

If you create a self-signed certificate, its private key will already be stored in your computers Windows Certificate store. If you obtained your certificate from a CA or need to use it on another machine, you need to retrieve the certificate file containing its private key and import (or save) it on your target machine.

A .cer file contains only a public key, while a .pfx file can contain both public and private keys. For more certificate file formats, check the documentation for X.509 certificates.

In a production environment, certificates with their private keys should be imported into a certificate store as non-exportable if its a Windows host or saved into a designated folder with restricted access if its a Linux machine.

Ideally, you should configure your production certificates to be TPM-protected or use another verified hardware-based solution that protects from exporting private keys. However, its a more advanced topic well beyond this demo scope.

For the sake of this demo, I will use a self-signed certificate on a Windows machine:

After you generate or import a certificate to your machine, you might want to configure permissions so the user account executing your application can access it in the certificate store. In my case, I just allowed all users on my machine to read the certificates private keys, so I dont need to run it with elevated permissions:

In a production setup, access to your certificates should be narrowed only to a specific user or local service identity used to run your application.

The next question is how to use that certificate in your application for authentication.

Using certificates in your code to authenticate to Azure app registrations

First, lets see how we can retrieve our certificate from the certificate store and use it with the specialized ClientCertificateCredential class to understand the low-level work with certificates. Later, I will show how to abstract those details and use the DefaultAzureCredential class as a suggested approach.

https://gist.github.com/andrewmatveychuk/fb4d21ae6bf5753f4e9dd43329d43226

So, what does that code do? First, it is trying to access the Local Machine certificate store, which we used to store our self-generated certificate. Next, its looking for the certificate in that store using its unique thumbprint. The Find method of the certificate store class (X509Store) returns a certificate collection rather than a single certificate object, as you might search by certificate name, and the store might contain a few certificates with the same name. If the collection is not empty, we just use the first collection item as our certificate (in our example, there will always be no more than one item, as we search by certificate thumbprint, which is unique for each certificate). Then, we use that certificate to initiate a ClientCertificateCredential object and read a Key Vault secret using the Azure Key Vault secret client library for .NET.

Here, I use Key Vault as a target service to access in Azure to demonstrate the concept and to have some values to output in the demo console app. In real life, Key Vaults might be used to store secrets, keys and other certificates required for an application to access other services. On how to cache Azure Key Vault secrets, take a look at the following article: Cache certain responses from Key Vault.

If you have configured everything correctly, you should see your Key Vault secret value in the console.

The drawback of that approach is that you explicitly rely on the certificate-based authentication method only in your application. If you later need to switch to another authentication method, you will need to modify your code, rebuild your app, and redeploy it. Plus, that approach wont work in Linux environments, as there is no single designated certificate store. A better way would be to extract authentication configuration options into config files or environment variables so you can later change your authentication method without touching the application code.

The DefaultAzureCredential class has a built-in fallback mechanism that attempts to use multiple authentication methods in a specific order. Lets see how we can simplify the code and make it more portable.

https://gist.github.com/andrewmatveychuk/45e9c50b0be362cb90a75de0c3262868

As you can see, you no longer need to do any low-level work fetching and using certificates in your code. You delegate that part to the DefaultAzureCredential class, which, first in sequence, tries to use the authentication method specified via environment variables.

In our case, we need to set the following environment variables to use certificate-based authentication:

AZURE_CLIENT_ID, which is an application ID of your application registration,
AZURE_TENANT_ID, which is your Azure tenant ID,
AZURE_CLIENT_CERTIFICATE_PATH, which is your local path to a certificate file containing its private key,
AZURE_CLIENT_CERTIFICATE_PASSWORD (optional), which is required to read the password-protected certificate file,
KEY_VAULT_NAME (code-specific) represents your Azure Key Vault name, so you dont have it hardcoded in your application.

Unfortunately, you can only reference certificates stored locally in files when using that approach. On the one hand, it makes your application more portable, as you can read a certificate from a file both on Windows and Linux. On the other hand, it makes protecting your certificates private key harder, as you need to handle the password protecting the key, similar to handling a client secret. Plus, you should restrict access to the certificate file.

What can be done if you dont want to lock to a specific credential type in your application and prefer storing your client certificate in a Windows Certificate Store? In that scenario, you can use Microsoft.Extensions.Azure library to create different credential types from key-value pairs defined in appsettings.json and other configuration files. The modified code might look like the following:

https://gist.github.com/andrewmatveychuk/058213eac0d8ac301b8c29fde8404c19

We register our target Azure service with the AddAzureClients method, and it will be automatically configured with a configured instance of a credential type. All credential configuration details will be specified in the following appsettings.json file:

https://gist.github.com/andrewmatveychuk/772bf1699c2ef695c53e1bff791aabfb

In the sample above, we authenticate using the ClientCertificateCredential type and retrieve the certificate from the local certificate store, as in the initial sample code.

Unfortunately, there seems to be no way to explicitly configure the retrieval of a certificate from a file using that approach. So, if you need to run your application on a Linux host, you can reduct your appsettings.json file to use the DefaultAzureCredential class provided that you define the environment variables for certificate-based authentication, as in the previous example:

https://gist.github.com/andrewmatveychuk/005feff80a475ad3dc532d1e3b41ecf2

The last approach is more flexible than the option with environment variables only, as you can choose where to store your certificate depending on your host environment.

Whats next

If you followed till the end of this post, you might conclude that using certificates to authenticate to Azure services is not easy. There are a lot of nuances about managing and securing access to certificates. Also, configuring your application to use certificate-based authorization is not as straightforward as using the DefaultAzureCredential class and putting client secrets in environment variables. However, I hope that provided samples will help you better understand how to implement certificate-based authentication in your applications.

You can also check the sample code for certificate-based authentication in my GitHub repository: azure.authentication-samples.

In the next post in this series, I will show how you can simplify such an authentication process in your apps using managed identities. So, stay tuned and hit the subscribe button! 👇

]]>

How to securely authenticate your applications to Azure services

Andrew Matveychuk — Tue, 28 May 2024 13:00:18 GMT

There are multiple ways to authenticate your applications when accessing Azure services, and to be honest, authentication on its own is a vast and complex area. However, depending on your context, requirements, and application location (Azure-hosted or hosted outside of Azure), your authentication options will usually be limited to only a subset of that variety. So, instead of reviewing and understanding all possible authentication approaches in detail, it might be more practical to explain possible authentication options using a few case studies.

My primary intent with this article is to show how to eliminate using plain text credentials in application configuration options and connection strings. In the follow-up posts, I will share some code samples and configuration details to help you get started.

Disclaimer. The topic of authentication and authorization in Azure is much more complex than the cases described in this article. The cases I present here dont cover all possible authentication scenarios and should be treated as examples to help you understand the topic.

Case 1. Use managed identity whenever possible

Imagine you have an Azure App Service, Azure Function, or Azure VM that needs to connect to some database like Azure SQL Database. The application somehow needs to authenticate to that data source. What you can usually see:

People create an SQL user because its easy, or they already did it when developing locally on their machine, or they followed tutorials showcasing how to connect to a database using username and password, or for any other reason.
In the best case, that user has limited access to the target database only with required permissions. In the worst case, which is more common, that user has full access to the database (or all databases) because, you know, developers dont want to spend their time troubleshooting permission issues.
The password complexity of that user is mostly far from what is considered to be a strong password. If you think of something like Secret123, you are close to guessing it.
That username and password are used in plain text in connection strings, configuration files, or environment variables, exposing them to anybody who can access the app environment.
Those credentials are shared with other applications that need access to the same database, making updating them painful and error-prone. So, forget about password rotation.
To make matters worse, the same username and password are used in both development and production environments.

It looks like a terrifying nightmare for a security-aware person, and it is a harsh reality of what can be encountered even in top business-critical systems.

Besides the risk of such a user (aka a service account) being easily compromised, timely detection of its malicious use is usually not the case. In my experience, if people go so far as to make their application authentication insecure, such organizations also have little to no capabilities for detecting compromised accounts. Otherwise, they would pay more attention to that risk.

Luckily for us, the application authentication to the database in the described case can be improved with little effort. If you havent heard of the concept of managed identities for Azure resources, I strongly recommend familiarizing yourself with it. Their use can greatly improve your application security and boost your development experience.

A modernized version of our sample application can look like the following:

Simply put, your application can utilize a managed identity provided by the underlying Azure resource on which its hosted. If you have ever configured a connection from an IIS-hosted application to an on-premises SQL Server, it is similar to using machine accounts to authenticate to target servers.

Back then, many administrators expressed their concerns that granting access via a machine account could expose that identity to any application running on that machine. However, cloud resources, especially serverless ones, are much more granular regarding the number of hosted applications. Plus, using a managed identity is much more secure than putting usernames and passwords in plain text in files or environment variables where they can be extracted from and then used elsewhere to connect to target resources.

The list of Azure services supporting managed identity and role-based control is constantly increasing. So, its always worth checking if your applications Azure resources can leverage that authentication approach.

By upgrading your application authentication from password-based to using managed identities, you:

No longer have a password that can be compromised or shared inappropriately.
Dont need to think about configuring credentials in your application.
Can forget about credential rotation, as its done for you.
Dont have to share the same credentials with other applications, as its easier to configure access for another application using its managed identity,
Dont need to worry about updating your connection credentials when pushing new application versions between environments.

Of course, using managed identities doesnt eliminate all security risks. You still need to grant permissions to them following the principle of least privilege to avoid excessive access. Plus, you should always consider other security risks as well. For example, if your application is vulnerable to SQL injection attacks, those attacks can be executed regardless of whether you use managed identities or not to connect to your database.

Case 2. Prefer certificate-based authentication over password-based

You might say that everything I described in the previous case looks good if all your application parts run in Azure. However, what to do in hybrid scenarios when connecting to some Azure service from on-premises? Lets look at what we can do in that case.

The preferred way, suggested by Microsoft, is to register your application in an Azure tenant, creating a logical representation of it, which can be used to assign permissions and access other Azure-hosted and Microsoft-provided services. Think of it as creating an identity for your application that is used to retrieve access tokens to various services.

Why that authentication approach is preferable to configuring access at the destination application level? For example, you can create an SQL-contained user in an Azure SQL Database and use it to connect from your on-premises server.
From the security perspective, it will be as bad as in the previous case. Firstly, you again have a username and password to take care of. Secondly, access management becomes more decentralized and complex as its delegated to the application (SQL Server) level. Lastly, you will lack all the identity protection features that an identity provider like Entra ID provides. If those credentials are stolen or misused, you might notice malicious activity in transaction logs long after it happens.

Unfortunately, most tutorials and official documentation describe (and promote in a way) how to authenticate to Azure resources using client secrets, which is similar to using passwords. Thanks to the application registration service design in Entra ID, client secrets are autogenerated complex passwords. Plus, you can apply policies to your tenants limiting their maximum validity period, forcing application owners to rotate them on a regular basis.

On the surface, it looks easy, as you just need to configure your environment variables, and you are good to go. The dark side of that approach is that those client IDs with their secrets can be shared as easily as usernames and passwords, making them vulnerable to misuse, unauthorized exposure, and credential leaks. The temptation and ability to generate never-expiring client secrets (many people dont want to be burdened with secret rotation) make such app registrations ideal candidates for identity attacks.

Certificate-based authentication is considered more secure than password-based (client secret) one. First of all, with certificates, you rely on asymmetric encryption, when only the public part of your certificate is stored along with an app registration, and the private part is secured on your application side. Apart from that, the certificates private key can be securely stored in certificate stores, where it can be read only by applications hosted on target machines without the possibility to export and transfer it somewhere else. Also, the complexity of working with certificates and encryption algorithms they use makes them less prominent targets for attacks compared to client secrets.

For example, in the diagram above, the client application uses a certificate stored in a Windows Certificate Store to authenticate to the related app registration and obtain an access token. Specialized administrative solutions can manage certificates, so developers dont even need to touch them. Certificate rotation can performed independently by system administrators. Developers might not even have access to those certificates and can only use them in a controlled manner on managed devices.

Case 3. Use managed identity with Azure Arc-enabled servers

The complexity of using certificate-based authentication also makes it harder to adapt, especially when you dont have the capacity or expertise to manage those certificates efficiently. So, what can we do in that case?

What if I told you that there is a way to use managed identities even for servers hosted outside of Azure? By onboarding your on-premises server, AWS EC2 instance, or any external machine to Azure Arc, you effectively create a logical representation of it in your tenant, which you can use to manage that resource from Azure and provide it access to other Azure resources using its service principal. The Azure Arc agent running on such machines is responsible for maintaining the connection to the Azure cloud and providing access to that machine authentication context. Applications running on an Azure Arc-enabled server can use that authentication context to access Azure resources the same way if that server was hosted in Azure and you used its managed identity for authentication.

Although onboarding your servers to Azure Arc has an initial overhead, after the servers are connected, you can leverage their managed identities the same way you do with Azure VMs. For organizations with many on-premises resources, this can be a lifesaver when planning integration with Azure resources.

Case 4. Use AWS Secrets Manager, AWS Certificate Manager, or analogs to store your credentials

You might ask, okay, what do I do if I need to connect to Azure resources from another cloud and managed identity is not an option for me? In that case, using cloud services specifically designed to store and retrieve sensitive information like secrets, keys, and certificates would be security-wise.

For example, consider your client is hosted in AWS. It uses AWS Lambda functions to retrieve information from an Azure-hosted API.

You can create application registration for your AWS-hosted client in your Azure tenant and generate client credentials. Next, you store those credentials in AWS Secret Manager if its a client secret or in AWS Certificate Manager if its a client certificate. You provide access to read those credentials to your Lambda function and then use them to authenticate to the target Azure resource.

The bottom line is that if you cannot get rid of credentials, you should ensure their safety on the client side. If your Lambda function references those secrets, you dont even need to know their values. In addition, those secrets can be rotated separately without touching the client application.

In some cases, like connecting from third-party applications or SaaS services, you might have no other option than using client secrets only due to application limitations. However, in such scenarios, you also entrust your credentials handling to that third party. Here, you should carefully consider what permissions you grant to that application in your Azure tenant. If it needs excessive write (or read) access, it definitely should be a red flag for you.

In conclusion

As you might see from the described use cases, there are plenty of more secure ways to authenticate to Azure services than just using usernames and passwords. Unfortunately, many developers overlook them when designing applications or configuring integration with Azure-hosted solutions. Old habits die hard. Despite the times when applications were mostly self-hosted and connecting to a database located on another server using database user and password are long gone, that legacy still lives in many organizations. The unwillingness to take that extra step to secure your application connections leads to leaked passwords, compromised accounts, and data breaches. So, next time when you deploy an application to production and use usernames and passwords to configure it, remember that the average cost of a data breach reached an all-time high in 2023 of USD 4.45 million and increases year over year.

Subscribe 👇 to stay tuned for follow-up posts in this series. Also, share your thoughts on authenticating to Azure services in the comments section!

]]>

How to send custom Azure Automation Runbook logs to Log Analytics

Andrew Matveychuk — Mon, 26 Feb 2024 13:00:03 GMT

In this blog post, I will explain why it might be helpful to implement custom logging in your Azure Automation runbooks, why to use Log Analytics as a log destination, and how to implement a logging framework that can be scaled and shared by your runbooks.

Note. I will discuss PowerShell runbooks here, but many concepts described below also apply to other Azure Automation runbook types.

Why use custom logging for Azure Automation runbooks?

Azure Automation accounts support extensive capabilities for tracing and logging runbook execution results. If you are familiar with PowerShell streams and how they work, you can also use them in your runbooks. All regular PowerShell logging practices with verbose, debugging, warning, and error outputs can be used in Automation runbooks without any modifications. The only thing to remember is that you need to explicitly enable their logging in the runbook job history for some streams.

However, the runbook job history is available by default only for the last 30 days. So, if you want to keep your logs longer or run some queries on them, you can send them for long-term storage to a Storage account, a Log Analytics workplace, or an external system using diagnostic settings. Just remember to include the JobStreams log category in your settings.

Those system options for runbook logging will be enough for most use cases. However, those default logging options can add too much irrelevant data to the logs. Plus, extracting application-specific details from those system logs might pose some challenges, as there are no technical restrictions on what application-specific information should be present in those logs and how it should be formatted. So, can we implement structured custom logs, and where should we store them?

Why send your custom runbook logs to Log Analytics?

Although there are no technical limitations on where to store your custom logs in Azure, some Azure services are better suited for that purpose than others. For instance, the available storage options in Azure resource diagnostic settings list Log Analytics workspaces and Storage accounts as the two most common places to store log data.

You can certainly use any form of storage or database for your logs, as you might have specific requirements for your solution. However, if you dont need to use Azure SQL Database, Azure Cosmos DB, or any other functional storage solution specifically, I suggest you start with one of the default options present in Diagnostic settings.

Log Analytics workspaces store data in a tabular format and provide extensive query capabilities. So, if you need to build some analytics on top of your logs or use them extensively for troubleshooting, definitely consider them first.

Storage accounts can store data in different forms, such as files, blobs, and tables, but they lack any reporting and analytical capabilities at the service level. Using data stored in them requires using some external solutions, which can make your overall design more complex. They are a suggested log storage destination in scenarios when you need to keep a log archive that is rarely accessible and you want to make it as cheap as possible.

How to send data from PowerShell to a Log Analytics workspace?

The current suggested approach for sending data to a Log Analytics workspace is to use the Logs Ingestion API in Azure Monitor. Compared to the deprecated Data Collector API, now you can create a Data Collection Endpoint (DCE) resource in your Azure subscription, which serves as an entry point for sending your data to a workspace. Plus, as a part of the required setup, you create Data Collection Rules (DCR) that group transformation logic you can apply to the incoming data before it is sent to the destination workspace.

Unfortunately, Microsoft hasnt yet provided a PowerShell SDK to abstract REST calls to the Log Ingestion API. However, the provided sample PowerShell code can be easily adapted for making calls to the API from a runbook like the following:

https://gist.github.com/andrewmatveychuk/3a747949610efb23fd39617395027e68

As you can see from the runbook, it authenticates to Azure using a system-assigned identity for the Automation account. That identity should be assigned the Monitoring Metrics Publisher role so the target DCR can push data to it. That way, you dont have to maintain a separate application registration in your tenant to access your collection rules.

For sure, you can authenticate to your DCR rules using custom application registrations, as presented in official samples, if you want to make your access model as granular as possible. However, it might be less practical than using managed identities, as you must maintain those registrations and rotate their access credentials.

After the authentication, it uses the obtained Azure context to acquire an access token for a specific API, which is Azure Monitor in our case. The token is immediately converted to a secure string to be used by the follow-up Invoke-RestMethod cmdlet. Next, it makes the preconfigured API call and outputs response results.

Provided that you constructed the correct JSON payload that your DCR expects from your PowerShell object, you will see new log entries in your custom workspace table in a moment.

Please note that the initial population of the destination custom table might take a while. After it is set up and indexed, all further entries can be queried much faster.

You can stop here or make your runbook logging solution more robust and reusable. For example, you can extract the logic for pushing your log entries into a separate function packed in a custom PowerShell module that you can import into your Automation accounts. That way, you can make your runbook code cleaner and reuse your logging function as a regular PowerShell cmdlet in your other runbooks.

https://gist.github.com/andrewmatveychuk/3c79a5a94a8254b43a185b05cd93d5a9

In addition to that, you can also look into defining custom PowerShell classes to improve the type safety of your log entry object and make your logs more structured, as with PSCustomObject, you can assign its properties any values, which may lead to running into errors on the DCR transformation step due to mismatch in expected input format:

https://gist.github.com/andrewmatveychuk/f66e81864f2a96c8ff3a772b38cdc009

As you can see from that sample code, you can define custom PowerShell enum types to limit the number of allowed values for specific log entry properties used for categorization. Also, explicitly defining class property types will help you correct formatting errors when creating log entry objects in your runbooks, not when a receiving DCR transformation stream rejects those entries.

Now, your updated runbook that sends custom logs to a Log Analytics workspace might look like this:

https://gist.github.com/andrewmatveychuk/29064cb51f7c126698bbd121ecc0d404

There are also some community solutions for that, like the AzLogDcrIngestPS module by Morten Knudsen, which I honestly recommend checking, that allows you to automate all the work around configuring the resources required for the Log Ingestion API. However, introducing an unofficial module dependency only for a small portion of its functionality, like only sending logs, might not always be desirable.

Have you implemented custom logging in your Azure Automation runbooks? Did you find it helpful, and why? Share your opinions in the comments below 👇

]]>

How to use change history in Azure Monitor Workbooks

Andrew Matveychuk — Mon, 07 Aug 2023 11:00:27 GMT

Anyone who has worked in IT Operations long enough knows that many incidents and service outages occur due to recent application configuration changes. If an application has an SLA, incident handling routines often explicitly instruct users to check for such changes as one of the first troubleshooting steps.

Despite Azure providing many options for tracking and analyzing resource configuration changes, those tools are spread across different services and sections of the portal. While some of them can be used out-of-the-box, others, like Change analysis in VM insights, Change Tracking in Azure Automation, or saving Activity logs to Log Analytics workspaces, require additional configuration before use.

Last year, Microsoft announced the availability of multiple Change Analysis features. What makes them great is that they use Azure Resource Graph under the hood, which has an extensive list of SDKs. From the practical perspective, that allows you to pull the data about changes from the graph into many Azure-native monitoring solutions and third-party ones. Here, I will explore how you can view the history of such configuration changes in Azure Monitor Workbooks.

Pull data about changes in Azure Monitor Workbooks

In Azure Monitor Workbooks, you can use the Azure Resource Graph data source to execute supported KQL queries and display their results in various formats. For example, you can use the following query to pull all changes for the last 14 days scoped to a resource group:

https://gist.github.com/andrewmatveychuk/9ab0b3d8a05e63e81600fe60c77dcc43

The resulting table might look like the following:

Unfortunately, the Change Analysis data source, which is in preview as of now, scopes its results to a single resource only. That might not be so helpful in cases when a change in another resource causes an alert linked to your target resource. For example, an error in your web app might be caused by a changed or deleted certificate in a Key Vault.

List active alerts in Azure Monitor Workbooks

Obviously, having a list of active alerts in the same workbook alongside the configuration changes would be helpful to correlate between them. Previously, there was a separate data source for pulling the information about Azure Monitor alerts, but now the alert info is available via Azure Resource Graph. The following query will return the list of all active Azure Monitor alerts in the target resource group:s

https://gist.github.com/andrewmatveychuk/dbe46caabe9b89df1cc6c338600b927b

A sample result of that query:

Another option to get the alert information is to pull it with an Azure Resource Manager query.

Improve troubleshooting experience with Azure Monitor Workbooks

Sometimes, you might need to get back to resolved alerts to review and analyze them. You can achieve that by using parameters in Azure Monitor Workbooks.

For example, you can add parameters to display alerts that fired during the last N hours/days and optionally include resolved alerts to your results. After that, you can reference those parameters in your query to make it more dynamic:

https://gist.github.com/andrewmatveychuk/6528ac35aa295ff215a132a09a81b5f8

Also, you can improve your user experience by formatting the results with colorful icons and using column formatting:

If you need to analyze changes preceding alerts fired beyond the last 14 days, you will need to schedule exporting those logs to some external store like Log Analytics and modify your workbook to pull the change info from a workspace instead of Azure Resource Graph directly.

Because, from the troubleshooting perspective, it makes sense to look into the changes that happened before the alert was fired and not after, you can take an extra step and make the results in your change history view depending on a specific alert you select in your alerts view. For that, you can define a workbook parameter that is populated when you click on a row in your alert grid view and use it in a follow-up change history query to filter the changes that happened earlier than the alert fired:

https://gist.github.com/andrewmatveychuk/d7722c8bdfc4a2ad5d82ad04e178adda

The improved change history view:

You can also extend your workbook with information about resource and workload health in the target scope to immediately check if an outage is related to resource availability.

Feel free to check my GitHub repository for the complete sample workbook template for correlating between alerts and configuration changes in Azure.

What is your experience analyzing changes that caused application performance or availability issues? What tools do you use the most to review configuration changes in Azure resources? Please share your thoughts in the comments 👇

]]>

Naming convention for Azure resources

Andrew Matveychuk — Tue, 25 Jul 2023 11:00:49 GMT

The primary intent of having a naming convention for Azure resources is to be able to identify essential information about a resource, for example:

related service or product for the resource;
resource type;
role of the resource or resource identifier used to differentiate between multiple instances of the same resource;
environment, etc.

Many organizations tend to adopt some variation of the Microsoft-suggested naming recommendations. If you havent reviewed them yet, I strongly encourage you to do so. However, please be mindful and treat those naming patterns as suggestions and not strict rules. Picking them up without a second thought might not reflect how your organization operates Azure services.

If you havent implemented a naming convention in your Azure environment, now is a good time to review the following naming patterns, which can be a good starting point on your cloud journey.

Generic Naming Rule

The general convention I suggest for naming your Azure resources when you dont have any specific requirements is the following:

<product name>-<type of service abbreviated> [-<environment>][-<identifier>]

Where:

product name: product, application, service or platform;
type of service abbreviated: an abbreviation identifying Azure service type (see the list of sample abbreviations below);
environment (optional): environment name, e.g., production (prd), development (dev), testing (qa) or staging (stg);
identifier (optional): role of the resource or resource identifier used to differentiate between multiple instances of the same resource.

Why put the application name first and not the resource type? From my practice, it improves the search experience on the Azure portal and when using Azure PowerShell or CLI. Usually, when you are looking for a particular Azure resource, you first think of an application you work with and then of the resource type. Besides, people remember name patterns and tend to type names from the start. For example, when you have some resource name like rg-myapp1 and start typing rg in the search box on the portal, you are likely to get the list of random resource groups first, which can be quite long. Then you need to continue typing to limit the results to a specific application. The console experience is similar you need to type *myapp1* or rg-myapp1* when you could have saved your keystrokes and just typed myapp1* to limit your search at the initial step to the resource of your application only. That change in the naming pattern might not seem like a big deal, but it can save your organization a lot of time, considering how often people search for a particular resource.

Use lowercase letters and numbers only in resource names. Spaces and special characters, except for hyphens, should not be used, as there are many cases when Azure resource restrictions limit their use. Also, because of those restrictions (resource name length), all the abbreviations and codes should be as short as possible to leave more room for using meaningful product/application names.

The kebab-case format should be used whenever possible. Hyphens can be removed for services where only alphanumeric characters are allowed, e.g., Storage accounts.

As most resource names are case-insensitive, using Camel case to save on delimiters like hyphens can lead to worse readability and more errors. Dots, underscores and spaces are not allowed for most of the resources. However, many second-level resources dont have the same limitations as parent ones, and you can use more human-readable names for them.

Some resources might need to use the same name in multiple resource groups, and those might not have the suffix, which is optional, appended at the end of the name. The same is true for the optional suffix.

Optional name parts should be consistent across an entire subscription. For example, if you choose to omit a specific name part in a subscription name, you should do the same for the resource groups and resources in it.

You might wonder why I dont include the resource location in its name, as suggested by Microsoft. Nowadays, many Azure resources can be moved between regions, which wasnt possible in the past. Unfortunately, renaming a resource to include a new region code in its name is impossible. Ive seen many cases when having the resource location name part becomes completely dysfunctional and misleading. One possible exclusion from this is having multi-regional deployments when you use the location name/code as the identifier part in the resource name to distinguish between instances deployed in different regions, as resource movements rarely happen in that deployment model.

Generally, your resource name should consist only of immutable properties. Any other property that can be changed, like location, SKU or service tier, should not be included in resource names. If such updatable properties are essential for your operation, it is best to use Azure tags for them.

If you are looking to enforce a naming convention in your environment, you can check on how to do it with Azure Policy.

Naming Convention

To start with, your naming convention for Azure resources might look like the following:

Subscriptions

Naming pattern:

<organization>-<portfolio>-sub[-<environment>]

Where:

organization: organization short name;
portfolio: department, team or product line/portfolio;
environment: (optional) environment name.

Examples:

contoso-digital-platform-sub-dev
contoso-digital-platform-sub-prd

Resource Groups

Naming pattern:

<product name>-rg[-<environment>][-<identifier>]

Where:

product name: product, solution, service or platform;
environment (optional): environment name;
identifier (optional): a unique identifier used to differentiate between multiple instances of the same application deployment to avoid naming collisions (more on this below). It should be scoped to the subscription containing your resource group.

Examples:

application-rg-dev
application-rg-prd

Resources

Naming pattern:

<product name>-<type of service abbreviated> [-<environment>][-<identifier>]

Where:

name parts follow the generic naming rule;
the unique identifier should be scoped to a resource group containing the resource.

Examples of naming for the most common abbreviations for Azure services

Resource type	Resource name abbreviation	Examples
API Management services	-apim-	application-apim-dev
App Service Functions apps	-func-	salesapp-func-qa
App Service Plans	-asp-	publicweb-asp-prd
App Service Web apps	-web-	publicweb-web-prd
Application Gateways	-agw-	publicweb-agw-prd
Application Insights	-ai-	publicweb-ai-qa
Application Security Groups	-asg-	connectapp-asg-qa
Application Service Environments	-ase-	landingpage-ase-prd
Automation Accounts	-aa-	monitoring-aa-prd
Availability Sets	-as-	partnertools-as-stg
Cache for Redis	-redis-	partnertools-redis-dev
CDN profiles	-cdn-	publicweb-cdn-prd
Cognitive Services accounts	-cogs-	salesapp-cogs-qa
Container Registry	creg	connectappcregcvhiyu5h2o5o
Cosmos DBs	-cos-	salesapp-cos-qa
Event Hubs namespaces	-eh-	connectapp-eh-qa
Gateway connections	-cn-	connectapp-cn-qa
Key Vaults	-kv-	publicweb-kv-prd
Load Balancers	-lb-	connectapp-lb-qa-frontend
Log Analytics workspaces	-la-	monitoring-la-prd
Logic apps	-lapp-	salesapp-lapp-qa
Machine Learning workspaces	-ml-	salesapp-ml-qa
Network Interfaces	-nic-	connectapp-nic-qa-01
Network Security Groups	-nsg-	connectapp-nsg-qa-backend
Notification Hub namespaces	-nh-	connectapp-nh-prd
Public IPs	-pip-	connectapp-pip-qa-02
Resource Groups	-rg-	application-rg-qa
Route tables	-rt-	connectapp-rt-qa-01
Search services	-srch-	bookingservice-srch-stg
Service Buses	-sbus-	bookingservice-sbus-stg
SQL Server Managed Instances	-sqlmi-	salesapp-sqlmi-qa
SQL Databases	-sql-	publicweb-sql-prd
Storage accounts	st	publicwebsttcvhiyu5h2o5o
Traffic Manager profiles	-tm-	applications-tm-prd
Virtual Machines	-vm-	connectapp-vm-qa-db
Virtual Network Gateways	-gw-	connectapp-gw-qa-01
Virtual Networks	-vnet-	connectapp-vnet-qa

Naming Collisions

Some Azure resources must be named uniquely at the subscription scope or across all of Azure. It is common to encounter naming collisions for these resources.

One solution to get around naming collisions is to use a unique string when creating a resource. A unique string is typically a short hash of one or more concatenated input strings, so the output is sufficiently random to avoid naming collisions. Appending a unique string to a resource will help ensure the name satisfies any uniqueness constraints a resource requires. Creating a resource via an ARM/Bicep template is one way to create such unique strings.

Unique string scoped to a subscription:

"[uniqueString(subscription().subscriptionId)]"

Unique string scoped to a resource group:

"[uniqueString(resourceGroup().id)]"

A unique string that is globally unique and different between resource groups:

"[uniqueString(subscription().subscriptionId,resourceGroup().id)]"

When you need to create a new unique name each time you deploy a template and dont intend to update the resource, you can use the utcNow function along with uniqueString, for example:

"[concat(uniqueString(resourceGroup().id), utcNow())]"

You can use this approach to create unique names when you employ immutable infrastructure.

In conclusion

Please be aware that no single best approach exists for naming your Azure resources. The critical point is that you need to have a naming convention and follow it for resource maintainability and operation. Consistency in naming your resources is the key to speaking the same language between the members of different teams in your organization. So, please treat this guide as lessons from experience rather than the ultimate truth. If something doesnt work for you, you must revisit it and adjust to your needs.

What naming patterns and anti-patterns have you encountered in your work with Azure? What parts of resource naming were the most debatable in your teams? Please share your experience in the comments section 👇

]]>

Azure Savings Plans vs. Azure Reservations

Andrew Matveychuk — Mon, 17 Jul 2023 11:00:24 GMT

Last fall, Microsoft announced a new cost-saving option for Azure workloads called Savings Plans. Many customers who already were using Reservations (aka Azure Reserved Instances) to save on their cloud costs mistakenly perceived Savings Plans as a replacement for them. So, lets try to demystify Savings Plans and whether you should migrate to them from Reservations.

Key differences between Azure Reservations and Savings Plans

To make informed decisions when choosing among different cost-saving options in Azure, we first need to understand their differences, summarized in the following table:

The first difference to pay attention to is how those two options are applied. While a reservation is bound to one specific Azure region you chose when purchasing it, the saving plan will automatically target eligible resources in all regions in the assigned scope, which can be a shared scope, a management group, a subscription, or a resource group for both saving options. For example, your resource group contains two virtual machines deployed in two different regions. With Azure Reservations, you must purchase two of them bound to the respective regions and scope them to your resource group. Suppose you migrate any of those VMs to a region different from the original one. In that case, the corresponding reservation wont be applied anymore, and you will be billed on a pay-as-you-go basis for them, plus still paying for the unused reservation(s). In the case of Savings Plans, the lower savings plan prices will still apply to those VMs regardless of the region.

Apart from that, while Reservations apply only to specific resource SKUs you define during their purchase, Savings Plans will apply their discount to all resource sizes (with some exceptions). For instance, if you resize your VM, you will likely lose the benefits from the reservation unless the new VM size is from the same SKU family and the reservation was configured for instance size flexibility.

Secondly, Azure Savings Plans target only a limited list of compute resources, which are Azure Virtual Machines, Azure Container Instances, Azure Functions Premium Plans, Azure App Service (Premium v3 and Isolated v2 only), and Azure Dedicated Host at the time of writing this. That list might be extended in the future, so please refer to the official docs for up-to-date information. In contrast, Azure Reservations, in addition to compute resources, can also target storage and a wide range of PaaS resources such as databases, various data and analytics solutions, etc.

Thirdly, while Reservations are partially refundable and exchangeable, Savings Plans are not. That difference will disappear in the future, as Reservations purchased starting from Jan 1, 2024, will practically no longer allow exchanges.

Lastly, the level of potential savings. Please dont be deceived by those numbers, as they are the maximum you can achieve with specific conditions like a certain VM size, region, commitment duration, etc. The actual numbers might vary significantly, so its always good to do your homework and understand your current savings and the potential ones in your specific case.

I intentionally omitted some similarities between those two options in my comparison to focus on the most critical factors that can affect your decision about going with Savings Plans, Reservations, or both. If you need full details, I encourage you to check the official documentation for them, as their usage policies might be updated, and new services might be added to the list of applicable savings:

Now, lets look at how to use those options in practice.

Practical application of Savings Plans

Despite Microsoft doing a great job presenting that new saving option to its customers, I still observe some confusion when people start talking about using Savings Plans in practice. Most of it originates from the misunderstanding that Savings Plans intend to replace Reservations. In fact, Savings Plans should be perceived as a complementary cost optimization tool. Choosing the proper tool depends on your Azure environment and workload run patterns. Lets review a few workload examples and see what the most appropriate cost-saving solution we can use for them. I will use compute workloads to compare apples to apples, as Savings Plans currently cover only them.

In the first case, your environment consists mostly of IaaS components deployed in a few Azure regions. Your applications run on VMs that are mostly persistent resources with stable capacity demand and a relatively long lifetime (years). The movement of VMs between regions and their vertical scaling is rare, which is performed manually when a need arises. That is a typical picture of long-running, persistent infrastructure with no sudden changes in resource utilization. Azure Reservations would be the first consideration, as they can help you achieve greater savings when applied to such stable workloads.

In the second scenario, you are in the migration phase when your organization moves its application from VMs to more scalable PaaS and CaaS solutions. You still have many VMs in your hands, but at the same time, their list and configuration constantly change as you migrate to App Services, Functions, containers, etc. The landscape of those new resources is also changing and evolving as product teams learn, adapt and adjust to new business needs. Azure Savings Plans would be a preferable saving option here, as they apply to all those compute resources regardless of their type, size and location. The common denominator is the hourly commitment to consume compute resources for X dollars. Whether that money was spent on running a VM or App Service doesnt matter. In any case, the usage less or equal to X will be billed at lower prices.

A combined scenario is more likely to be observed in large environments. You have different parts of your infrastructure at different levels of cloud adoption. Some pieces are old apps with specific technical and regulatory compliance requirements that run on VMs and shouldnt be touched. Others are all bells and whistles of the most recent cloud services autoscaling, serverless, microservices, etc. Plus, there are plenty of areas between those two edge cases. Here you are likely to weigh your cost-saving options and stack them to have the optimal solution. Reservations can cover stable compute resources. The dynamic ones with steady spending can benefit from Savings Plans. If the scope of a reservation and savings plan overlaps, the reservation discount will be applied first, and the savings plan reduced pricing second.

For example, achieving 100% utilization for Azure Reservations in a dynamic environment can be quite challenging you need to monitor their utilization, analyze their resource coverage, rebalance existing underutilized reservations, and perform those activities regularly. Besides, its most likely that your savings from using Reservations at scale will differ from the promised 72% maximum. If you analyze potential savings from using Savings Plans, you might find out that they can provide you with relatively the same level of savings in your environment, but you will need to spend far less time on administering them. Or, you can reduce the list of resources covered by Reservations to decrease the management overhead and use Savings Plans to cover the emerging difference.

Of course, those examples might look simplistic, and the real-life cost optimization strategy will likely account for more requirements. The important point is to understand your environment and always run the numbers before siding with specific cost optimization approaches. Just looking for higher potential savings without understanding the context may lead to an increase in FinOps management overhead, diminishing your returns.

Have you started using Azure Savings Plans in your environment? Were they beneficial, and how? Share your experience in the comments 👇

]]>

How to find unused Azure resources

Andrew Matveychuk — Fri, 23 Jun 2023 12:30:24 GMT

This is probably one of the top 10 questions I hear from customers running their workloads in Azure. The questions intent might vary, but its ultimate goal is the same to remove unused resources. Lets explore why we keep facing that question and why most people struggle to answer it.

Why find unused Azure resources

If no negative factors were associated with unused cloud resources, nobody would probably care about their existence and what to do with them. In other words, when you want to find such resources in your cloud environment, you usually pursue a specific goal. It can be:

save on the associated resource costs (aka cost optimization),
reduce a possible attack surface on your infrastructure (aka security hardening),
reduce management overhead associated with supporting resources (aka operational excellence),
scale down overprovisioned resources (aka cost optimization),
remove ambiguity when deploying or updating your applications so no incorrect resources are used in the process (reduce errors) and others.

Identifying your primary intent(s) for your search will help you to analyze the question from the proper perspective. Why Im saying that? Because the answer to the question of unused resources depends on your understanding of it. Let me illustrate this with an example.

Have you ever considered why we get unused resources in the cloud? Large enterprises might run hundreds of applications and have thousands of infrastructure components. Infrastructure changes happen constantly: applications get updated, new services are deployed, some are decommissioned, internal processes change, reorgs happen, etc. Changes in one area might not be well-coordinated with others, causing gaps in the existing processes and creating misconfigurations. The number of those misconfigurations, which included unused resources, might be estimated in hundreds and thousands of individual cloud services. The time and resources you can spend processing so much data are usually limited, so focusing on the most important parts makes perfect sense. You should identify and focus on the primary cost drivers to optimize your cloud costs. If your focus is security, you likely want to address the most critical security issues caused by unused resources (an Azure VM with public IP and open management ports is a more significant threat to your environment than an empty resource group or unattached disk).

Now that you know what specific issues you want to address by finding unused resources in your Azure environment, its time to do the assessment.

Assessment tools to identify orphan resources in Azure

Disclaimer. As I mentioned earlier, the exact definition of resource usedness depends on the issue you are trying to solve by looking for those resources. Most of the existing solutions for assessing your Azure environment for unused resources make their suggestions based primarily on the current resource configuration and look for the most obvious cases, like dangling resources that have no practical use in that state, e.g., unattached NSGs, or compute resource not associated with any workloads, e.g., App Service Plans running no apps. However, even those seemingly apparent cases of unused resources might be delusive. For example, an unattached disk might contain some data that is yet to be processed, an empty resource group might serve as a deployment boundary with preconfigured permissions for disposable deployments, etc. So, please remember to check with resource owners or responsible teams on such resources before planning them for removal.

Now, lets explore some starter tools that you can use to fund unused resources.

The Azure Orphan Resources Workbook is probably the best solution for quickly assessing your Azure subscriptions for unused resources. Under the hood, that workbook uses Azure Resource Graph queries to pull the data about resources in scope and can provide you with a solid starting point for further investigation. Its a solid choice when you must analyze a new environment and identify the most problematic areas to focus on.

Continuous Cloud Optimization Insights is a solution based on Power BI. It pulls data from different Azure services and combines it to give you more informative reports on different aspects of your cloud environments. Rather than focusing exclusively on tracking unused resources, it can help you to aggregate recommendations from different systems like Azure Advisor, Microsoft Defender for Cloud, Azure Monitor and so on, providing you with more meaningful insights into the actual resource usage.

Although Azure Advisor might not be so powerful in generating recommendations about unused resources as more specialized tools, it can still give you a list of top cloud waste sources and recommendations on addressing them, especially regarding underused resource capacity and possible security risks. Its free, part of the Azure portal, and doesnt require any configuration. Its recommendations are structured into several categories that explicitly define your search end goal.

Azure Resource Graph is a low-level tool for analyzing the configuration of Azure resources. Using Kusto (aka KQL) queries, you can quickly find unattached disks, NSGs not associated with any network or interface, etc. If the Azure Orphan Resources Workbook provides a good summary, the resource graph allows you to dig into details and tune your search.

Azure Monitor might not be an obvious choice for exploring unused resources, but it is the number one place to analyze your Azure resource utilization. There are many cases when Azure resource configuration tells you nothing about whether it is actually used or not. However, you can make some assumptions based on resource metrics. For example, you can check how many requests were served by an App Service, the resource utilization for a specific App Service plan, or how many reads and writes occurred from a Storage account over the last time. When you see low resource usage, you should add such resources to your analysis list to confirm their actuality and usefulness.

Still, using all those instruments to analyze your infrastructure for unused resources is a reactive action. What can we do to prevent or, better yet, minimize such a waste of cloud resources?

How to prevent cloud waste

To see how we can prevent the waste of cloud resources from happening, we first need to understand why it occurs.

Have you ever wondered why we store unused resources in the cloud? Unfortunately, there is no easy answer to that question.

Probably, the number one reason for cloud resource waste is the lack of lifecycle management for cloud resources, similar to Application lifecycle management (ALM). Its usually closely connected to absent or underperforming CMDB for cloud resources when you cannot identify resource owners, what applications or services they are related to, and if they are still in use. If you dont have a CMDB for your Azure resources, I encourage you to check my series on running CMDB for Azure to understand why keeping your infrastructure neat and clean is essential.

In the second place, I would put the lack of resource performance monitoring and established resource capacity management processes. When you dont monitor resource utilization and have no insights into how they are used, you cannot make any deliberate decisions about scaling your resources up or down to use them according to current demand. For example, your website might be in use and running just fine on a high-performant App Service Premium SKU. However, it utilizes just a tiny fraction of your plan capacity and can perform equally well on a far less powerful (and less expensive) App Service plan tier. The same applies to storage resources when your actual demand in throughput and data usage pattern can be satisfied using less performance storage tiers (e.g., HDD instead of SDD) and appropriate storage type (Hot, Cool, Archive, etc.). Although autoscaling and consumption-based (serverless) resources partially solve the issue of resource overprovisioning, those cloud-native patterns should be explicitly used when configuring your cloud resources.

The third and not-so-explicit reason for cloud waste is the lack of established security controls for cloud resources and apps running on them. Despite those two not being directly connected, there is often a clear correlation between how secure an environment is and how many wasted resources are in it. Let me explain.

Many security practices involve periodic assessments and automatic scanning for common security vulnerabilities, resource misconfiguration, possible unauthorized access, or data leaks. The more resources you have, the more actions you might be tasked to take to mitigate discovered security threats. Naturally, you would be eager to reduce the number of resources (and apps) in your responsibility to decrease the amount of work you need to perform during each security review. Besides, regular security assessments serve as a trigger to delete those resources Bob/Dave/Sarah, etc., deployed to do some tests N-years ago.

As you might already see from that non-inclusive list of reasons for cloud waste, completely preventing it might be quite problematic. Instead, I would suggest focusing on minimizing the negative impact of unused resources during both the design and operation phases of your cloud infrastructure management. When designing your cloud application, you should leverage cloud-native design patterns and prefer solutions that support autoscaling and consumption-based pricing. Whereas operating existing resources requires many infrastructure support processes to be present and executed efficiently. To name a few:

Configuration Management (CMDB)
Monitoring (performance, events, infrastructure specific and application-tailored)
Capacity Management
Security/Vulnerability scans

Also, other practices such as Continues Deployment, Infrastructure as Code, Disposable Environments, Immutable Infrastructure, Application Performance Management, etc., can help you minimize your cloud waste and bring the management of your cloud resources to a new level. Nevertheless, before deciding on what prevention measures to take, dont forget to clarify your definition of resource usefulness and assess your environment for the largest gaps in effective resource utilization.

What is your experience in eliminating cloud waste and tracking unused cloud resources in Azure or another cloud? Please share your thoughts and suggestions in the comments 👇

]]>

How to validate Azure tags

Andrew Matveychuk — Mon, 15 May 2023 12:00:33 GMT

This blog post follows up on my series on running CMDB for Azure resources. In the previous parts, I shared some guiding principles for organizing your CMDB using Azure tags, so here, I will cover some practical examples of keeping your Azure tags neat and structured at scale.

The issue with Azure tags from the CMDB perspective

Azure tag names and their values are just free text properties. You can choose whatever tag name and its value you want to apply to your resources, provided they are according to the Azure tag limitations. On the one hand, its great because you have much flexibility in defining your tagging convention. On the other hand, your tag list can quickly become a mess as there are few to no default controls to validate them.

For example, you want to tag your resources with an internal application name and a reference to a resource owner. By default, nothing prevents you from tagging resources like the following:

Resource 1 (tag name: tag value)
- App_name: MyApp1
- owner: Jhon Doe
Resource 2 (tag name: tag value)
- Application: MyApp1
- _Owner: john.doe@contoso.com (here underscore in the front stands for a space character)

As you can see from that example, those resources likely belong to the same application and to the same owner. However, tag names and value formats vary significantly. That might not be a problem if you manage a dozen resources manually, but it will become a huge automation challenge at the enterprise scale with thousands of resources, applications, and users involved when its impossible to logically organize your assets without technical means. So, lets recap first what we can do to enforce specific tags that are applied to Azure resources.

Enforce tags on your Azure resources

Azure policy definitions for tag compliance would definitely be a good place to start governing your tagging convention. Generally, those policies can be grouped by the control they implement. There are built-in policies to:

require a specific tag (they will deny resource deployment without it),
inherit a tag from a parent resource container,
add/append a tag.

From the enforcement perspective, the policies that require and/or inherit tags are the ones that can help you push your tagging requirements. The policies to add or append tags are usually helpful for remediation purposes when you want to apply specific tags to different scopes.

Those built-in policies lack examples for auditing your Azure resources for tag compliance, but its pretty easy to create custom ones using the require-tag policies as a starting point. You can find such policy samples in my Azure Policy GitHub repository.

When designing your tagging convention, you might come up with a list of mandatory tags and optional ones, where the mandatory tags are an absolute must for your inventory purposes, and resources cannot be deployed without them, whereas the optional tags are not strictly required but suggested for application teams to use according to their needs. It is also common for the optional tags to have some suggested naming and value format to follow to be consistent with your overall tagging structure. Therefore, for the mandatory tags, you would usually use require-tag policies, and for the optional tags audit-tag ones (to keep track of their usage).

As a rule of thumb, when assigning your Azure policies requiring specific tags (denying resource deployment without them), specify the corresponding custom non-compliance messages that clearly explain what went wrong and how to fix it. An error message like the following one can save you thousands of hours and empower users to correct their deployment without raising a support ticket:

Azure resources must be tagged with the mandatory <tag_name> in the following format: <tag_name:tag_value>. Please refer to KB#### (<knowledgebase_URL>) for additional details.

Having those controls in place can help you to keep your tag names consistent. Now, lets see what we can do to validate tag values.

Validate tag value formatting

Here, I will use some suggested tags from my previous blog post to make my examples more practical.

Make sure to check my list of Azure Policy best practices for crafting your custom policies like using parameters, testing, deploying, etc.

For example, the Application Name tag is usually used to mark resources related to specific applications in the context of your organization. Depending on your requirements, its value might be just a free text or a text in a specific format according to your internal app/system encoding pattern:

https://gist.github.com/andrewmatveychuk/e53b43ce42a3673868b4b5149d1c92ee

The Service Map and Documentation tags can be URL links pointing to specific resources containing a live service health map and internal documentation/wiki. To enforce the URL pattern for tag values, you can use the following policy rule condition:

https://gist.github.com/andrewmatveychuk/177625cfc131e4982335cd24a24ca925

Similarly to the mentioned URL pattern validation, you can put a control to require the Owner and Technical Contact tags value to be in the form of an Email address:

https://gist.github.com/andrewmatveychuk/7d8f587b20008d4432321caf8218f198

Such tags as Business Unit, Data Profile, Service Class, and Criticality usually represent predefined lists of allowed tag options. For example, you can define the Data Profile tag to use only specific profile values:

https://gist.github.com/andrewmatveychuk/4f83b621a1882163b5bfcd53f3f9b465

If you rate your Service Classes or application Criticality using numeric values, that validation approach will work, too. Just keep in mind that numeric values are treated as strings here:

https://gist.github.com/andrewmatveychuk/659666a1f2e6b11444c68a340a67ee58

For date-related tags such as Created Date, you can ensure proper formatting using the following policy rule condition:

https://gist.github.com/andrewmatveychuk/7bbcaa53d3c7fd60746e8759b2034367

Unfortunately, Azure Policy conditional operators dont support regular expressions (aka regex). The closest expression validation you can do is by using the (not) like and match operators, and you might need to additionally validate tag values for more complex string patterns. More on that is in the next section.

Please check my Azure Policy GitHub repository for the complete sample policy definitions.

Such Azure policies should help you keep your tags and values consistent across your resources. However, tag values are still just plain text properties with enforced formatting. So, what to do to ensure that a specific object your tag represents, like a user email address, exists in your systems of records?

Automate your Azure tag validation processes

As I mentioned earlier, tags like resource Owner, Technical Contact, or Budget Approver are usually specified as a user email address, which serves as a unique user identifier and provides an immediate contact point for communication with users using automation channels. However, enforcing the Email format validation with Azure Policy doesnt prevent an editor from putting a mistyped or non-existing email address. Although Azure doesnt offer any native capabilities for tag value validation, it has all the building blocks you can use to build your custom validation processes.

For example, you can subscribe to your subscription Activity logs with Event Grid and then use Azure Functions / Logic Apps to process those events by looking for specific tags. The automation workflow can query your Azure AD to check if an entity with a specific email address exists and its active. Additionally, it can check if that email address belongs to a user or group account and implement some conditional logic according to your needs. If the email address is invalid, the workflow can send out a notification or log the event in a Log Analytics workspace with a corresponding Log alert rule configured to create alerts according to your unified monitoring process.

As email addresses can become invalid due to changes in Azure AD, it also might be helpful to set up scheduled check-ups for Azure tags representing emails and run periodic scans for all such tags:

Tags representing cost centers can rely on a fixed list of allowed values, but that list also might require periodic updates if the cost center structure in your organization changes. Such an update process can be implemented as a part of your deployment pipeline when you manage your Azure Policy via code:

Or, it can be independent of the policy deployment and implemented with an automation helper that pulls the list of valid cost center values from an external data source and updates the corresponding Azure Policy assignment with the new list of allowed tag options via a policy parameter.

After the policy is updated with the new allowed cost center list, some of your recourse might become non-compliant, so make sure you check and update invalidated tags with correct values. You can take it even further from there and configure Azure Monitor alerting for non-compliant resources.

As with many IT systems, there is no single best implementation of those processes, and the final design heavily depends on your specific requirements and constraints. Nevertheless, I hope the described approaches to validating Azure tags can make your work easier and help you maintain your CMDB records consistent across the related systems.

How do you validate Azure tags in your environments? Please share your ideas and questions in the comments 👇

]]>

How to audit Azure Hybrid Benefit usage with Azure Workbooks

Andrew Matveychuk — Thu, 05 Jan 2023 15:30:19 GMT

UPDATED. It seems that Microsoft borrowed my idea of analyzing Azure Hybrid Benefit usage with Azure Monitor Workbooks and included the corresponding charts in the Cost Optimization workbook in Azure Advisor. Feel free to check it for AHUB numbers and recommendations in your environment. Its great to see that they listen to customer feedback!

In one of my previous blog posts, I wrote about enabling Azure Hybrid Benefit at scale using Azure Policy. Today, we will explore how you can keep track of actual license counts consumed by Azure services with that license benefit turned on. Moreover, we will try to analyze those resources for optimal benefit usage.

If you want to skip the details and just get the Workbook for auditing Azure Hybrid Benefit usage, check for its template in my GitHub repository.

What are Azure Workbooks?

Azure Workbooks (also known as Azure Monitor Workbooks) is an excellent tool for visualizing data in the Azure Portal and sharing prebuilt reports with other users. They allow you to pull data from different data sources, extract some meaningful insights from that data, and present them to users in various formats. If you are familiar with Azure Dashboards, you can think of Azure Workbooks as a more advanced reporting solution.

There is enough documentation to start working with Azure Workbooks, so I will focus more on the practical case rather than explaining Workbooks functionality. If you are not familiar with them, in addition to the official documentation, I recommend checking the following videos produced by the people who created that great Azure service:

Working with most supported Azure Workbook data sources requires using Kusto Query Language (KQL) to query data and JSON to parse API responses, so some knowledge of those technologies will also be desirable.

On a personal side, I must admit that I overlooked Azure Workbooks functionality for a long time because I could usually gather required data using PowerShell much faster. However, doing everything with a familiar tool only wont help you explore other options, so you understand their pros and cons and can use them effectively. Therefore, I just decided to use this practical case as a learning opportunity for myself.

What data are we going to use, and how can we bring it to a Workbook?

For the sake of simplicity, lets focus on Azure Hybrid Benefit for Windows Server VMs only. Technically, we need to get the list of all Azure VMs with Windows Server OS and Hybrid Benefit enabled. For that task, we can leverage the power of Azure Resource Graph. A sample Graph query might look like this:

https://gist.github.com/andrewmatveychuk/019106dd6c6a1f5232dd8b2c939750eb

That query looks up for all Azure resources where the resource type is Azure VM, the OS image publisher is Microsoft Windows Server and the corresponding VM property controlling the license benefit usage is equal to Windows_Server.

Please note that KQL is case-sensitive, and I use case-insensitive operators as value naming is not always consistent across all Graph tables.

As Azure Hybrid Benefit for Windows Server VMs entitles you to cover the software costs based on the number of core licenses you own, we will summarize query findings as the counts of discovered VM sizes:

Unfortunately, Azure VM size names dont always correctly reflect the number of vCPUs according to the Azure VM sizes naming convention. So, just parsing the VM size names for CPU counts wont work. A suggested approach to get such data is to use the Resource SKUs API. As that API is a part of Azure Resource Manager REST APIs, we can use an Azure Resource Manager query in our workbook to pull the data from it:

/subscriptions/{Subscription:id}/providers/Microsoft.Compute/skus?api-version=2021-07-01&$filter=location eq '{Location}'

Note that the API is subscription-scoped, and we provide the subscription ID via a Workbook parameter. We also scope the data to a particular Azure region, which is also a parameter, to limit the amount of data the query returns and avoid duplicates.

Theoretically, if your Azure VMs are deployed across many different regions, you might miss data about VM sizes unavailable in the region scoped by the query. However, in practice, the VMs are usually deployed to a limited number of customer-selected Azure regions that support the same set of required VM SKUs. If that is not the case, you might need to pull the list of supported VM sizes for all regions and remove duplicate data from the results.

As the mentioned API returns a JSON payload containing SKU data for different resource types, we can transform the result with JSONPath expressions:

Firstly, we filter items using Azure VM as the resource type. Secondly, we select only the VM size properties we want: a VM SKU name and the number of vCPUs. A sample query output might look like the following table:

Can we combine data from different data sources?

Now, we have two pieces of data the counts of used Azure VM sizes with Azure Hybrid Benefit enabled and the number of vCPUs for each VM size. It would be great to combine that data by joining tables in SQL. Fortunately, we can do that using Azure Workbook Merge functionality. It allows you to perform different types of joins, so here we will use the left outer join to combine all items from our Azure VMs grouped by size (right table) with the matching items from the Azure VM SKUs list (left table):

Also, we can select the columns we want to see in the resulting data set and remove duplicate ones. Additionally, the Merge data source allows you to extend your data set using calculated columns. For example, you can use expressions to calculate the total number of used vCPUs per VM size:

The resulting table might look like the following:

However, those results are intermediate only, as the number of CPU cores doesnt directly translate to the number of core licenses you need. So, how do we know how many core licenses are consumed by each VM size?

Do you really benefit from Azure Hybrid Benefit?

As it turns out, regardless of the number of vCPUs, each Windows Server Azure VM with less than 8 vCPUs always consumes 8 core licenses. Servers with more than 8 vCPUs can stack more licenses on top of that minimum, but the license count should be rounded up to the nearest bigger integer divisible by 8. So, a 12-CPU VM will still consume 16 core licenses and so on.

To determine the actual number of required core licenses for our Windows Server VM fleet, we can use an additional calculated column with the following conditional logic:

Those expressions round up the number of licenses to:

8 if an Azure VM has less than 8 vCPUs,
the nearest bigger integer divisible by 8 in all other cases.

So, now we can see precisely how many core licenses are needed for each Azure VM size group without doing any math manually.

On top of that, we can do another trick with calculated columns and add icons signaling if a specific VM size is using Azure Hybrid Benefit efficiently from the license point of view:

The condition here compares the licenses required to the number of vCPUs and produces a successful output if those numbers are equal. All other cases are considered suboptimal.

That visualization can help us quickly understand which VM sizes in use dont fully benefit from the license benefit. This can be especially helpful if you dont have enough core licenses to cover all your Azure VMs, and it makes sense to redistribute them for maximum savings enable Azure Hybrid Benefit only for VMs with 8 vCPUs and more.

Can we make our solution more user-friendly and reusable?

At this point, our resulting table contains all the details we might need to audit Azure Hybrid Benefit usage by Windows Server VMs in a subscription. However, what if we want top-level indicators based on that data to provide quick insights without digging into the records? For example, lets add the total number of consumed core licenses and the total number of vCPUs to see how effectively we use licenses in general.

Instead of running additional queries in our workbook to fetch that data, we can reuse the existing dataset using the Merge data source. It allows you to make a copy of another query results so that you can present them in a different format:

Then, we can remove unnecessary columns, choose a different visualization option, and select the category to focus on:

As the sample figures above show, using licenses in a lab environment is far from efficient.

Apart from those improvements, we can easily enable data export in Azure Workbooks, so other non-technical people can export the detailed results if they want to do some analysis with external tools:

Azure Workbooks can be exported as Workbook templates or ARM templates. Those templates can be imported into other environments(subscriptions) without any substantial preparation. This allows you to share them with other people as ready-to-consume solutions that dont require technical expertise to use, making them a perfect case for solution adoption.

What is next?

Finally, you can grab the ready-to-use Workbook for auditing Azure Hybrid Benefit usage from my GitHub repository, import it into your Azure subscription, and check the license numbers in your environment. Currently, it covers the cases for Azure Hybrid Benefit for Windows Server VMs, SQL Server VMs, and Windows Client VMs (aka Multitenant Hosting Rights). Feel free to adjust it to your needs and add missing functionality for other Azure Hybrid Benefit use cases.

Got a question? Post it in the comments below and let me know if that guide was helpful for you 👇

Update. The workbook was also updated to provide information about the license benefit usage by Azure SQL Database and SQL Managed Instance.

]]>

Moving to Ghost(Pro)

Andrew Matveychuk — Thu, 29 Dec 2022 18:00:44 GMT

A while ago, I wrote about how I run my blog with a self-hosted Ghost engine. Even though most processes were running on autopilot, they still required periodic attention and maintenance: the Ghost engine required updates, the blog theme needed to be fixed to be compatible with new engine versions, the DigitalOcean droplet hosting the blog demanded patching, and so on. The most annoying part was probably rebooting and troubleshooting the server when the site went down, which started to happen every two to four weeks.

A lot has changed since then, and I decided that it would be the right time to eliminate the administrative overhead of managing the underlying infrastructure. The pricing for Ghost(Pro) became more attractive and comparable with other cheaper self-hosting options. Plus, Im not very fond of spending time on something that is not of interest or a focus area for me.

Having said that, please expect the following changes:

No Facebook Comments anymore. I regret choosing them over other solutions back then, as they happened to be very clunky, error-prone and practically unsupported by Facebook.
Im sticking to the native Ghost Comments as of now.
Im abandoning the email newsletters via RSS, Zapier and Mailchimp. Although the related process worked very well, and the email campaigns worked just fine, I dont see much value in keeping such a complex setup. As most of my blog visitors come from other sources, the native email newsletters in Ghost should be enough to update my subscribers on new blog posts.

Please expect the changes in the newsletter design as displayed in the post image.

The new clean theme focuses more on the content and should improve overall readability and user experience on mobile devices.

Under the hood, the domain was migrated from DNSimple to Cloudflare, and the Cloudflare CDN was turned off as Ghost(Pro) manages the CDN. The hosting also handles the TLS encryption.

So, now I will spend less time running my blog and more time publishing new content 😊

Subscribe for new updates, and post your questions in the comments! 👇

]]>

Azure Policy Best Practices

Andrew Matveychuk — Tue, 13 Dec 2022 06:00:00 GMT

Mastering a new tool might be challenging, so here I will share my best practices for working with Azure Policy. Those tips are based on my experience and intended to complement my Azure Policy Starter Guide. Although most of them will be about pretty simple things, to my mind, its usually the basics that people tend to overlook and suffer the consequences of their neglect. So, lets start.

Use parameters

It might seem obvious that to make your code more reusable, you should use input parameters to control its behavior. Still, you can come across many examples, even with the built-in policies, when their authors didnt bother with parameters and hardcoded some property values in the policy code. In such cases, what could have been accomplished with a simple update of policy assignment parameters now requires updating your Azure Policy definition, probably removing the current policy assignments and the definition if its incompatible with the new one, deploying the new policy definition into your target scope, and creating new assignments for it. Sounds like a ton of work, right? Now, imagine you need to update a dozen of those non-parametrized policies. On a second or third occasion, its not fun at all, and you probably start thinking about making your policy definition more adjustable to changing business requirements.

Let me illustrate how parametrizing a single policy value can instantly make your custom Azure Policy twice as valuable. If you have worked with Azure Policy before, you might know that they can have different effects. The Audit effect is the most used and usually the simplest to implement. You might find many custom policy definitions where its defined as follows:

https://gist.github.com/andrewmatveychuk/1770e42e4f80104934bc326e9a79aa56

Thats what the official documentation says.

However, if you go the extra mile and check a few Azure Policy definitions implementing the same effect, you might notice that the policy developers tend to define the policy effect as an input policy parameter like that:

https://gist.github.com/andrewmatveychuk/02fc8cedf53e98730b02da8e98853f7b

Now, a simple switch flip can change your custom policy behavior from simply auditing for non-compliant resources to preventing them from deploying. Its the same policy rule and the same logic, but twice as helpful thanks to that simple change. Now, when a business tells you they are tired of chasing Azure resource owners or fixing non-compliant configurations and want to block any such deployments from happening, you can implement such a control in the blink of an eye.

Please keep it simple

Many ARM template functions can be used in policy rules to implement some advanced scenarios. You can reference resource properties, evaluate arrays, define conditional logic, work with strings, and do many other things. Sample Azure Policies for Guest Configuration or Kubernetes clusters can change your view of how powerful and complex(!) that tool can be. However, the complexity of your policy code doesnt mean its a good thing to do all the time.

For instance, I occasionally read my own Azure Policy code from six or so months ago and have difficulty understanding how particular fancy things work there. I often use my work notes and even read my old blog posts to recall why I liked them or how specific code works. Now, look at this from other peoples perspectives. Will it be easy for them to maintain and modify your custom policies in the future?

In my experience, 80% of using Azure Policy is all about simple stuff like auditing, modifying tags or creating guardrails for your cloud environments in Azure. So, why not keep its implementation simple, too? If its hard to do with Azure Policy, then its probably not the right tool for your task. Look at other Azure services, many of which have integrations with the policy APIs.

Another extreme, which is opposite to not using input parameters in Azure Policy, is defining parameters for everything and trying to make a Swiss knife out of your policy. Of course, you can specify the default values for those parameters so that you dont have to provide them whenever creating a policy assignment. Still, it makes your policy definition harder to read and understand. For example, if your policy logic is about Azure VMs, there is no need to supply that resource type as an input parameter. Or, if you want to audit Azure Hybrid Benefit usage across different eligible resource types, it might be better to define those controls in separate per-type policy definitions and combine their assignment with a policy initiative. That will make your solution cleaner and easier for other people to understand.

Test for side effects

I might sound repetitive in my posts about Azure Policy, but please do test the policies first before assigning them to a production scope. Its crucial in the case of the policies that modify, deploy or deny resources. Because most Azure Policy assignments happen at the subscription or management group levels, the blast radius of such changes is usually huge. Also, cloud adoption at organizations tends to evolve from prototyping with no or few rules first to catching up with cloud governance later. So, by the time you start putting your policies in place, hundreds or even thousands of existing resources might not comply with them.

Lets take, for example, the built-in policy that defines the list of allowed resource locations. Its commonly mentioned in cloud governance practices that suggest limiting the deployment of your Azure services to specific regions. There might be multiple reasons for that, but my point here is not about cloud governance. At first glance, it seems easy to assign that policy and be done with that control. However, if you have existing resources that dont comply with that policy, you will effectively block them from further modification. Thats because, from the Azure Resource Manager API perspective, the same write endpoints are used for creating and updating(!) existing resources. Now, we have a problem here.

In the case of community-crafted Azure Policies, you should be even more watchful. The fact that somebody already created a policy that seems to fit your needs ideally doesnt mean that you should go and assign it straight ahead. Its no different from copy-pasting a code from Stack Overflow without first understanding how it works. Many such policies might have errors in their logic or be not up to date with recent Azure API changes.

So, one more time always test your policy in a lab environment first. Even if youve worked with it before and know it inside out, there are likely to be edge cases that you havent seen yet, as every new environment might be slightly different.

Do use custom non-compliance messages

In the past, if some Azure Policy blocked an Azure resource deployment, you would get a generic error message and reference to the blocking policy at best. To say that it was frustrating is to say nothing. You need to look into deployment and activity logs to determine what that error is about and what you must fix to pass the policy checks. Developers doing their deployments from a console or an automated pipeline are usually required to go to engineers managing cloud infrastructure to troubleshoot such issues, as the console output was even less verbose.

That user experience improved significantly with the introduction of custom non-compliant messages for policy assignments and more verbose console output. Now, instead of a generic message that some policy blocked a resource deployment, you can provide a user with more helpful information like instructions on what exactly was wrong and how to fix that or a link to a knowledge base article with a detailed policy description and troubleshooting steps.

The functionality of custom non-compliance messages proved to be extremely helpful, especially for Azure Policies with the Deny effect. Please dont skip out on it. Moreover, I suggest putting double effort into crafting your custom messages so they provide other users with clear instructions on how to pass the policy validation. Another person coming to you with a question (or a support request) about a policy-blocked deployment should encourage you to revisit the corresponding non-compliance message until it has enough details for self-service.

Set up a CI/CD pipeline(s) for Azure Policy

Although Azure Policy doesnt introduce a separate programming language in a broad sense, they are still defined in code. Ive already authored a couple of articles on that, so feel free to check them:

As with any code, all DevOps-proven techniques, such as code versioning, automated testing, continuous integration, and continuous deployment, also apply to developing and maintaining your custom policy definitions and assignments. All the tooling for that is here and free to use. For instance, you can check my blog post on how to automate your policy deployment:

How to deploy Azure Policy from an Azure DevOps pipeline

Investing your time in setting up your automated deployment process for Azure Policy will drastically improve your development speed and increase your confidence in its quality. Even if you have no intention to develop your custom policies and plan to use the built-in ones only, its still reasonable to embed the creation of policy assignments in your CI/CD pipeline and use staging environments, quality checks, and gated approvals before deploying in production.

Of course, those tips are just a tiny fraction of the best practices to follow when working with Azure Policy, and I plan to update that list with other relevant highlights from my experience. If you think that I missed something important here, please feel free to comment and add your suggestion in the comment box below 👇

]]>

Audit and Enable Azure Hybrid Benefit using Azure Policy

Andrew Matveychuk — Tue, 29 Nov 2022 13:48:26 GMT

What is Azure Hybrid Benefit?

From the cost perspective, the resulting price for an Azure resource is calculated from multiple parts, software licenses being one of them. For example, when using the Azure Pricing Calculator and estimating your Azure VM cost, you might notice that the total row changes depending on the operating system and VM configuration type (OS only, SQL Server, etc.). For some of that software, you can apply Azure Hybrid Benefit, which basically allows you to bring your own license (BYOL, for short) and save on your cloud infrastructure spend in Azure.

Speaking of Microsoft software, its a great cost optimization case for enterprise customers that usually already have lots of Windows and SQL Server licenses as part of their Enterprise agreements. That is especially true for customers with a significant portion of their existing infrastructure running outside of Azure, presumably in their on-premises data centers, and migrating their workloads to Azure.

Different Microsoft products have different conditions and applicability logic for Azure Hybrid Benefit, and I will cover those details in the following sections.

What is Azure Policy?

Azure Policy is an Azure service that can be used to implement governance for resource consistency, regulatory compliance, security, cost, and management. In other words, its a framework that allows you to define rules for resource configuration, audit resource compliance with those rules, and enforce the rules by preventing the deployment of non-compliant resources or modifying existing resources so they become compliant with them.

That service is well-documented, so Im not going into the details about how it works here. If you need a quick overview, you can check my Azure Policy Starter Guide and Azure Policy-related content.

The question worth asking here is whether Azure Policy is a good tool for our purpose, which is auditing and enabling Azure Hybrid Benefit. Firstly, as you can see from the service definition, it is specifically designed for such tasks. Secondly, it implements a declarative approach for configuration description. Lastly, Azure Policy provides all necessary tools for reporting, decoupling policy definitions and assignments, and enforcing resource configuration. So, lets consider some practical examples.

Azure Hybrid Benefit for Windows Server VMs

Azure Hybrid Benefit applicability for Windows VMs is controlled by the licenseType property of an Azure VM. Basically, you can switch it on or off by updating the license type to Windows_Server or None. You dont even need to restart the VM for that, as it is a feature of Azure Compute service and not an operating system.

The Azure Policy rule that defines the applicability logic of that benefit to your Azure VMs may vary depending on how you define your eligible VM instances. Generally, you can use specific OS images as a filtering condition, as Windows Server licenses are usually related to specific OS versions. That way, you can define the list of the benefit-eligible OS images as an input parameter and update it dynamically as your license term changes over time. Sample policy rule to enumerate eligible Windows Server Azure VMs in a policy assignment scope might look as follows:

https://gist.github.com/andrewmatveychuk/90fb0568ca12fed0f9b8c0074ab98d6d

You might notice that the policy effect is also defined as a parameter here. I highly suggest using such an approach so you can change your policy assignment effect without editing the policy definition and go from auditing to preventing (denying) the deployment of eligible Azure VMs without Azure Hybrid Benefit enabled in a target scope.

Similarly, you can craft an Azure Policy to enroll the eligible Azure VMs into using the licensing benefit:

https://gist.github.com/andrewmatveychuk/80e2b2406bf963fad996f6c1704eb7d4

You can find the complete policy definitions in my Azure Policy GitHub repository. As I prefer using Bicep to write my Azure deployment templates, you might want to check my other post on how to deploy Azure Policy with Bicep.

Azure Hybrid Benefit for Windows Client VMs

The same Azure VM property controls Azure Hybrid Benefit for Windows Client VMs, but its called Multitenant Hosting Rights, allowing you to use your existing Windows 10/11 licenses in Virtual Desktop scenarios. To apply the benefit for Azure VMs running the desktop OS, the license type should be set to Windows_Client.

From the Azure Policy perspective, you can use the same filtering approach as with Windows Server Azure VMs but target Windows Client OS images specifically:

https://gist.github.com/andrewmatveychuk/7e4e102b84b30f416cdb886d4fa6cac2

Just pay attention to a different image publisher and another license type here.

With virtual desktop scenarios like Azure Virtual Desktop, especially in the personal host pools scenario, its even more important to automate the configuration of provisioned hosts with enabled Multitenant Hosting Rights to reduce your AVD costs if they are eligible to use that benefit, which is a typical case for customers having Microsoft 365 E3, E5 and other entitlement licenses. You can achieve this on the fly when a new personal AVD host is provisioned using the Modify effect of Azure Policy.

My GitHub repository contains the complete policy definitions to audit and enable Azure Hybrid Benefit for Windows Client VMs.

Azure Hybrid Benefit for SQL Server VMs

SQL virtual machines in Azure can utilize your SQL Server licensed cores to achieve a similar licensing benefit.

Note that Azure Hybrid Benefit applies to Standard and Enterprise SQL Server editions only.

From the technical point of view, your SQL Server VMs must be appropriately registered with the SQL IaaS Agent Extension. Also, the applicability of the benefit is controlled by the sqlServerLicenseType property of the SQL Server VM resource. It should be set to:

AHUB for the Azure Hybrid Benefit
PAYG for pay-as-you-go
DR to activate the free HA/DR replica

A sample Azure Policy rule to evaluate SQL Server VMs for Azure Hybrid Benefit usage can be as follows:

https://gist.github.com/andrewmatveychuk/7b0b3f3b922b1724985371d6330a0d5d

Interestingly, you can apply Azure Hybrid Benefit for an underlying Windows Server VM and SQL Server instance running on it, provided that you have all corresponding licenses, archiving even more substantial savings. So, consider that when estimating potential Windows/SQL hosting costs in Azure.

Suppose you want to logically combine your configuration rules (policies) for both Windows Server and SQL Server licensing benefits to apply them at once. In that case, I suggest using Policy Initiatives, aka Policy Definition Sets. Such a solution will be more flexible and modular to manage.

For a complete policy definition, check my GitHub repository.

In conclusion

Technically speaking, similar configurations can be enforced using the corresponding parameters in Bicep definitions for your resources or running an Azure PowerShell one-liner to modify resource properties. However, the primary disadvantage of those approaches is that they require additional effort to prevent configuration drift and ensure compliance at scale. Having all your infrastructure defined in code (IaC) is good, but being able to test and validate it for compliance is even better. In the case of Azure PowerShell (or Azure CLI), manually running a script is not enough to ensure resource compliance in an ever-changing environment with thousands of resources. You need to schedule regular jobs to validate resource configuration, report on non-compliant resources, and develop a mitigation strategy.

Similarly, Azure Policy is not a silver bullet, as it also has limitations. For example, to comply with your license agreements, you should keep track of the number of applied Azure Hybrid benefits and how they convert to the number of licensed OS and SQL Server cores. Implementing such logic with Azure Policy seems overcomplicated as it wasnt intended for such tasks. In one of the upcoming posts, I plan to cover how you can achieve that using Azure Monitor Workbooks. If you dont want to miss that update, please subscribe to my new posts using the form below 👇

]]>

How to monitor Azure Reservations utilization

Andrew Matveychuk — Mon, 03 Oct 2022 13:56:20 GMT

UPDATE. On May 23, 2023, Microsoft announced a built-in functionality for alerting about underused Azure Reservations. If you don't need your reporting in a specific format and are okay with a standard notification template from Azure, it's totally fine to use that option in your FinOps processes.

Using Azure Reservations is one of many techniques that can optimize your cloud spending in Azure. They are a great option to reduce costs for Azure resources with a long lifespan and predictable utilization. Have a VM, storage, or database that will run permanently? Buy a reservation for it, set it up to auto-renew upon expiration, and enjoy savings of up to 70% or so. Does it sound too good to be true? Lets find out.

For small setups with few cloud resources or stable infrastructure with little changes, the easiness of using the reservations is mostly true. However, for large enterprise-scale environments, the reality is a bit different. The larger the environment, the more changes in it you inevitably have. Apart from that, the flexible nature of cloud services, which can be provisioned and decommissioned in a matter of minutes and not weeks or months, provides even more opportunities for engineers to modify their solutions and change their infrastructure requirements. Something intended to run for the next few years can be gone in a few months or replaced with other infrastructure components.

All those changes in a cloud infrastructure create the challenge of managing Azure Reservations efficiently. As the reservations are implicit commitments to pay for a fixed amount of cloud resources over one or three years, its essential to utilize those resources fully. Otherwise, you will be paying for resources you actually dont consume, and all your savings from using those reservations will be diminished.

Luckily for us, Microsoft allows exchanging existing reservations for new ones. Of course, there are some limitations, but its still more optimal to return unused reservation units and buy new reservations for services that you do use. The obvious first question here is how you know what your reservations are underutilized and to what extent.

Azure Reservations utilization info

Microsoft provides various options to check reservation utilization after its purchase. You can check it on the Reservation blade of the Azure Portal, use the corresponding section in the Cost Management + Billing interface, or run some ready-to-use Power BI reports. Also, you can use Azure PowerShell and Azure CLI to get some reservation utilization details. Apart from that, you can utilize the Reservations insight preview feature of the Azure Cost Analysis.

However, all those options have a common disadvantage none of them has built-in functionality to automatically notify or alert when utilization for Azure Reservations drops below a certain value. Even though the platform collects data about reservation utilization, and you can check it in the reservation details on the portal, its not available as a metric in Azure Monitor.

One of the options to get the reservation utilization info programmatically is to use Microsoft Cost Management APIs. On the one hand, with such an API, you are not limited to vendor-provided monitoring options only, and you can use any custom or third-party monitoring solution that can query a REST API, parse a JSON payload, and extract the average utilization percentage value from it to be compared with some threshold. On the other hand, something that could be a simple Azure Monitor metric alert requires a whole monitoring infrastructure for that and overcomplicates the monitoring setup.

Monitoring Azure Reservations utilization with Power Automate

As a decrease in Azure Reservations utilization is not something that you usually need to act upon immediately, it might make more sense to build an automated reporting that would provide you with insights about underused reservations on a regular basis related to your FinOps processes. So, by the time you are reviewing your monthly cloud invoices and analyzing your cloud spend, you have the data about what Azure Reservations require assessment and possible exchange or cancellation.

To build such a solution, I decided to try using Power Automate cloud flow. Its user-friendly and usually available for most Microsoft customers as part of their Microsoft 365 plans or as Logic apps in their Azure subscriptions.

Alternatively, you can implement a similar logic using Automation Runbooks or Azure Functions. My choice of a Power Automate flow (aka Logic app) here is merely a matter of quickly prototyping a solution that can be maintained and modified by people with no specific programming knowledge.

For a start, we can use the mentioned Cost Management APIs to get the list of all reservations with their usage summary, and the HTTP action can help us with that:

Be sure to use the correct enrolment number in the URI, which you can check on the Azure Portal or the Enterprise Portal (for EA customers only). Also, the API authentication might be tricky here, as you need to generate/get the API Access Key on the Enterprise portal first.

Then, you should provide that key in the specific format in the request Authorization header.

The successful API call will result in a JSON array, provided that you have some Reserved Instance purchases within the target enrollment.

Depending on your needs, you can switch to the daily grain when querying the API. However, from the practical point of view, its rather suitable for an in-depth analysis than for regular monitoring/reporting.

To parse that JSON output into an array we can manipulate, we can use the Parse JSON action, which is a common data operation in Power Automate:

The parsing requires a schema for understanding the structure of a JSON payload. You can generate it by yourself from your sample Reserved Instance usage details payload or use the following schema, which I already prepared:

https://gist.github.com/andrewmatveychuk/88ec45ab135e28dadd76fc89d5a95574

The resulting array will contain the details for all your Reserved Instance purchases regardless of their utilization levels. Putting all that data into your utilization report might be unnecessary, as we are primarily interested in reservations that are not fully utilized in other words, reservations with less than 100% utilization. However, as I explain next, it is not always necessary to achieve full utilization to benefit from the savings Azure Reservations provides.

Bad or good, there is no one universal threshold for reservation utilization that can be considered a best practice. Thats because, depending on the reservation terms, the savings from its usage may vary greatly. For example, in one case, the savings from using Azure Reserved VM Instances purchased for one year for a specific Azure VM SKU or instance flexibility size can be approximately 40%. In another case, purchasing reserved instances for a three-year term for the same Azure VM size(s) can result in more than 60% savings if utilized to the full extent.

Technically, if the utilization of your reservation with 60% projected savings drops by 50%, you are still saving 10% using it compared to on-demand costs. In contrast, if the utilization of the 40% reservation drops by the same amount, you are in the red.

Of course, that doesnt mean you should target such low savings by applying Azure Reservations. For analysis purposes, you can rely on the guidelines provided by Microsoft on the Reservation blade of the Azure Portal:

For the sake of this demo, lets use the 90% utilization threshold to filter the reservations that require our attention. The Filter array action is to help us here:

The avgUtilizationPercentage property is provided as a float, so be sure to use the corresponding expression to compare it with your value.

Next, the resulting array will still contain data about reservation utilization details for past months that might have little value for us if we already took some action on them. It would be better to focus on the last (or current) month, depending on what day of the month you run this report.

In my case, I run the report on the first day of each month, so it makes sense to see the reservation utilization details for the previous month only:

You can achieve that by filtering the usageDate property and using the following expression to get last months value in the same format as in the API response:

addToTime(utcNow(), -1, 'month', 'yyyy-MM')

Optionally, if you want to make your report cleaner, you can remove unnecessary data and select only properties you want to display in the resulting set with the Select action:

Now, when your resulting list of Azure Reservations is ready, you can deliver it to your target audience.

To keep it simple, we will send the flow output as an email notification. For that, we need to convert the resulting list into an HTML table using the Create HTML table action and then insert the formatted table into the message body:

Lastly, to put the reporting on autopilot, you can set it to run on a schedule using the Recurrence trigger:

Unfortunately, there is no easy way to schedule the trigger for a specific day of a month yet, so you can set it to execute daily with the following trigger condition:

@equals(utcNow('dd'), '01')

The final flow structure:

The result of your efforts might look like in the following example:

In conclusion

Monitoring your Azure Reservations utilization is really important as it explicitly correlates with the numbers in your cloud invoices. It should be a regular part of your FinOps practices in Azure, along with analyzing opportunities to purchase new Azure Reservations for the services in use. While Azure Advisor can suggest which reservation to purchase, the rebalancing of existing ones is completely up to you. Having some food for thought, like the list of Azure Reservations you dont use fully, and the list of new reservations to purchase, is the first step in the optimization process.

If you have questions about using Azure Reservations or would like to see more content about Azure FinOps practices in general, let me know in the comments! 👇

]]>

How to update the OAuth2 permission grants for an Azure AD service principal

Andrew Matveychuk — Mon, 24 Jan 2022 12:35:28 GMT

When you register a third-party application in your tenant, a service principal is created in your Azure AD to represent it. That service principal is used to define the permissions granted to the application and the permissions to manage the app itself in the tenant.

As a common practice, an application vendor defines all the permissions the app requires to function correctly, and you grant consent for those permissions while registering the app in a tenant. In some cases, e.g., working in regulated industries or environments with high-security requirements, you might need to reduce the list of permissions requested during the app registration to specific ones only.

In this blog post, I will share some tips on how to update the OAuth2 permission grants for an existing application (a service principal, if you want to be technically precise) in your tenant.

Please note that changing (revoking) OAuth2 permission grants for an app might impact its functionality. If you are to modify permissions to a third-party application, make sure to consult with its vendor first.

Problem statement

Suppose you explore the options available to you to review application permissions on the Azure portal. In that case, you might notice that most of them allow only removing all OAuth2 permission grants at once. The sample PowerShell script provided by Microsoft uses the AzureAD cmdlets that dont have the option to modify the existing grants:

The recently updated Update-AzADServicePrincipalcmdlet uses Microsoft Graph API and doesnt have sufficient documentation on constructing input parameters to make it work properly. The same lack of documentation affects the Update-MgOauth2PermissionGrant cmdlet from the Microsoft Graph PowerShell SDK, making using it quite problematic, too. My attempts to use the Microsoft.Graph cmdlets ended up with an error stating that the scope property defining the grants is read-only and cannot be updated.

Using the Azure CLI for the task doesnt help either, as it can create new grants but not modify the existing ones.

Solution

After unsuccessful attempts to use the Azure Portal, Azure PowerShell, Microsoft Graph PowerShell SDK and Azure CLI to update the OAuth2 permission grants, the only option left was to work with the Microsoft Graph REST API, which provides an endpoint for updating the delegated permission grants.

The resulting execution sequence to update the OAuth2 permission grants for an Azure AD service principal will be the following:

Go to Graph Explorer and log in with the admin account for your tenant. Consent for the 'Directory.ReadWrite.All' permissions.
Search for the Client ID of the service principal you want to update the grants to:
GET https://graph.microsoft.com/v1.0/servicePrincipals?$search="displayName:your_app_name"

Pay attention to the ConsistencyLevel header that must be set to eventual.

Search for the OAuth2 permission grants by the Client ID you identified:
GET https://graph.microsoft.com/v1.0/oauth2PermissionGrants

[Optional] Get the specific grant only
GET https://graph.microsoft.com/v1.0/oauth2PermissionGrants/{id}
Where ID is the unique identifier of the grant and NOT the Client ID of your app. It has a format different from the common GUID-like identifiers.
Update the required grant with a new scope by using the PATCH method:
PATCH https://graph.microsoft.com/v1.0/oauth2PermissionGrants/{id}
and providing a request body in the following format:
{ "scope": " offline_access Directory.AccessAsUser.All Directory.Read.All Group.ReadWrite.All Group.Read.All GroupMember.Read.All GroupMember.ReadWrite.All User.Read.All User.Read" }
The scope property of the request must enumerate the complete list of granted permissions separated by spaces.
[Optional] Verify that the OAuth2 permission grant was updated with the scope provided earlier:
GET https://graph.microsoft.com/v1.0/oauth2PermissionGrants/{id}

Additionally, you can check that the Azure portal displays the updated app permissions, too.

I hope that the workaround will be helpful for anyone looking to modify existing app permissions without completely revoking them or deleting the app registration.

If you want to get recent updates to my blog, please subscribe with your email using the form below 👇

]]>

How to connect to Azure Database for MySQL from Ghost container

Andrew Matveychuk — Mon, 06 Sep 2021 15:39:20 GMT

Continuing the topic of hosting Ghost on Azure, I decided to document some nuances of connecting to Azure Database for MySQL from a Ghost Docker container hosted on Azure Web Apps for Containers. Moreover, I will shed some light on doing it securely from any Node.js application in this post.

If you are more interested in the technical details, feel free to skip my reasoning for using Azure Database for MySQL as a backend for Ghost in the first section and scroll straight to the second one.

Why fall back to a managed MySQL server from the multi-container app

In Dec 2020, when I came up with the idea of hosting Ghost on Azure as a multi-container app using Docker Compose support in Azure App Service, that configuration worked fine. I reduced the application hosting costs by hosting MySQL as a container on the same App Service plan as the Ghost container. However, starting from May 2021, I received a few notifications indicating that the original deployment configuration stopped working, and the following error occurred in App Service container logs:

ER_HOST_NOT_PRIVILEGED: Host 172.X.X.X is not allowed to connect to this MySQL server

Nothing was changed in the deployment configuration, and I started looking for possible root causes of that issue.

As the multi-container support on App Service is still in preview, changes to that service might have caused the containers to crash. Unfortunately, I didnt find any mentions of service updates in the official docs or on community resources.

Another possible cause of the error is some updates to the MySQL container image I was using the image with the mysql:5.7 tag and didnt stick to a specific build in that minor version.

The attempts to redeploy the app using a specific MySQL container image that was up to date when I created the original Docker Compose configuration didnt succeed. Neither successful were attempts to upgrade the setup to using a MySQL 8 container. Also, it is worth mentioning that the same configurations that produced errors while running on App Service worked like a charm in a local Docker Desktop environment, so the error wasnt related to the well-known MySQL permission nuance when a user account can connect to the server from localhost only.

After troubleshooting the issue for a few days, to say that I was confused was to say nothing. Continuing to play a game of numbers when you dont even know the details of how Docker networking works in App Service didnt seem reasonable. So, it was high time to migrate to something more stable, supported and well-documented.

Setting up Azure Database for MySQL

You can definitely go to the Azure portal and use the provisioning wizard to create a new instance of the Azure Database for MySQL. However, that approach didnt fit my idea of an unattended, reproducible, easy-to-use one-click deployment concept. As I already implemented that concept with ARM templates and the Deploy to Azure button, it was natural to refactor the existing deployment templates to use a managed MySQL database instead of the MySQL container. Apart from that, as Azure Bicep a new declarative language for Azure resource deployment reached its general availability, I decided to migrate my deployment configuration to that new format completely.

The Bicep team has already published many Bicep snippets for Azure resources, so you can use them as a reference.

After some trials and errors, my resulting configuration for Azure Database for MySQL resource looked like the following:

https://gist.github.com/andrewmatveychuk/658beb5b8e7e2ecf29e81c362a57fd8e

Leaving the Bicep syntax nuances aside, the resource configuration has a few important aspects to consider. First, it enforces encryption for server connections and sets the minimal version of TLS for incoming connections. Secondly, it configures server firewall rules to accept incoming connections from Azure services only. Lastly, it uses password authentication.

The database password is specified during a template deployment and saved in an Azure Key Vault. The App Service hosting the Ghost application uses that password via Azure Key vault secret reference to authenticate to the database.

Now, lets look into the updated Ghost app configuration.

Configuring Ghost app for Azure Database for MySQL

The Ghost team provided comprehensive documentation on configuring a Ghost database connection to MySQL. That documentation even contains a detailed description of different client SSL options. To pass that nested JSON key structure as configuration for Azure App Service, all semicolons should be replaced by __ (double underscore) to represent the same configuration as key-value pairs in the app settings.

Apart from that, Ghost, as a Node.js application, also relies on the native runtime TLS options. That said, in the SSL section of your config, you can specify all tls.createSecureContext options. For example, in addition to providing your custom certificate (if needed), you can set the minimum and maximum TLS versions to be used when establishing connections:

https://gist.github.com/andrewmatveychuk/f872c6b1cbde374c7a544fd60d72d0b8

Pay attention that the secureProtocol configuration option widely referenced on the Internet is deprecated and should not be used.

Please note that the Baltimore CyberTrust Root certificate is included in the Mozilla Included CA Certificate List. As documented in the Azure Database for MySQL documentation (Configure SSL), you dont need to download and embed it in your Ghost configuration.

Try sample configuration

After much work on migrating from ARM templates to Bicep and configuring the Ghost app to use Azure Database for MySQL, the new one-click Ghost deployment on Azure App Service is finally ready. You can quickly try it out by clicking the Deploy to Azure button in the project repository on GitHub.

As Bicep makes it much easier to build modular templates, I discontinued two separate templates for deployments with Azure CDN and Azure Front Door and created the single deployment templates with conditional deployment logic you need to select the desired option:

If you experience any issues with the deployment, please submit an issue on GitHub or post it in the comments below 👇

Stay tuned to hear about my impressions on the new authoring experience for Azure deployments using Azure Bicep 😉

]]>

How to deploy to different tenants with Azure DevOps

Andrew Matveychuk — Wed, 30 Jun 2021 07:25:02 GMT

Is it possible to deploy to an Azure subscription in another Azure AD tenant with Azure DevOps? How can I configure my Azure Resource Manager service connections in Azure DevOps to point to different tenants? Can I configure multi-tenant deployments with Azure DevOps? I hear those questions occasionally, so lets try to answer them in this post.

Although most organizations, especially with centralized IT management, prefer to build and operate their infrastructure within a single Azure AD tenant, there are still enough corner cases when you need to span your deployment process across multiple tenants. For example, some enterprises prefer completely isolating their development/test environments, including identity providers. Others, like managed service providers (aka MSP), usually provide their services to multiple clients and, therefore, have to operate in a multi-tenant environment.

Azure DevOps, a common choice for application lifecycle management when a company mainly works in the Microsoft ecosystem, can be used to set up your CI/CD processes along with underlying infrastructure provisioning in the Azure cloud so that deployments are performed consistently, repeatably, and automatically, establishing the foundation for the Flow in DevOps methodology.

Here and now on, I refer to Azure DevOps Services and not to on-premises Azure DevOps Server.

Connecting your Azure DevOps organization to an Azure AD tenant usually occurs under the table while onboarding to Azure DevOps services, and creating service connections to Azure subscriptions in the same tenant is pretty intuitive when you follow the default service recommendations. On the contrary, when you need to configure your deployment into an Azure subscription bound to a tenant different from the one used in your Azure DevOps configuration settings, things become a bit trickier. So, let me clear the air first about the common misconception about the relationship between Azure AD and Azure DevOps.

Azure AD and Azure DevOps

To clarify, connecting Azure DevOps to Azure Active Directory has relatively little to do with Azure Resource Manager service connections. That connection just provides a means for your users to authenticate to Azure DevOps using the same credentials they use to log in to Office 365 and other Microsoft services used in the organization.

Azure AD tenant, which is an instance of Azure Active Directory services, provides cloud-based Identity as a Service (IDaaS) for your organization. From user experience, the integration between Azure AD and Azure DevOps simplifies the configuration of the service connections when speaking of Azure services. For instance, when creating a new Azure Resource Manager service connection in Azure DevOps, users can see all Azure subscriptions they can access that tenant. However, those connections require using separate identities, aka service principals, to function.

Be aware that reconnecting your Azure DevOps organization to another Azure AD tenant is a somewhat destructive action. That way, you change the identity provider for Azure DevOps as an application. It was not designed to help you with multi-tenant deployment scenarios.

Now, lets talk about what setups you can use to connect from Azure DevOps to Azure services in another tenant.

Azure Resource Manager service connection with an existing service principal

When creating an Azure Resource Manager service connection, you can choose to configure one using an existing service principal. Moreover, that service principal doesnt have to be in the same tenant your Azure DevOps organization is connected to.

For that scenario, you need to know the ID and name of the target Azure subscription (or management group if you plan to deploy in that scope), a target Azure AD tenant ID and a service principal ID with its key or certificate in that tenant to use in the connection. Of course, in a fresh new environment, you must create that service principal first. Nevertheless, that approach is well-documented and straightforward.

The main advantage of this approach is that the setup time is minimal, plus the configuration is performed only in two places a target Azure AD tenant and your Azure DevOps service connection. Still, that initial simplicity has its maintenance cost. When you have only a few service principals in a few different tenants to maintain, the overhead of refreshing expired service principal keys or certificates and updating the corresponding service connections is insignificant. However, when you have hundreds of tenants in management and even more Azure DevOps service connections to them, their maintenance might become a never-ending and time-consuming routine.

By default, when you create an Azure Resource Manager service connection using automated security, Azure DevOps creates an app registration with a secret thats valid for two years. When creating an app registration with service principal manually to use for Azure DevOps connection, you can increase that expiration period or even set the secret to never expire so that you dont need to go back and update them. However, I would not recommend doing that from a security perspective.

Azure DevOps service connections with Azure Lighthouse

Another approach to connecting Azure DevOps to resources in a different tenant is using delegated access via Azure Lighthouse. Although it was not explicitly designed to address the challenge of deploying to different or multi-tenant environments, Azure Lighthouse provides excellent options for unifying access experiences in multi-tenant environments.

Conceptually, when you onboard Azure subscriptions (or specific resource groups) to Azure Lighthouse for delegated management, you provide particular sets of permissions to them for identities from another (management) tenant. Those identities can be individual user accounts, security groups, or service principals. For example, you can configure delegated access to another tenant for specific security groups in your tenant, which is a suggested practice, and include a service principal used by Azure DevOps service connections into one or a few of those groups to get the permissions you need for your deployments.

In that scenario, you still need to manually create a service principal for use in Azure DevOps service connections, but you provision it in your (management) tenant that is connected to your Azure DevOps organization.

Note that with Azure Lighthouse, you cannot delegate the Owner role or manage access to resources in delegated tenant except for limited use in assigning roles to managed identities. If your deployment scenarios require some automated access configurations, you might consider the previously described approach or the next one.

The benefit of this configuration is that you can create a single or a few service principals that you can reuse when connecting to different tenants, effectively making your architecture more scalable. You will need to spend less time in the future on regenerating the keys and updating the connections. What is more, when the target tenant is offboarding from external management via Azure Lighthouse, that will remove all the accesses to resources in that tenant that were granted to external Azure DevOps service connections, so you dont have to worry about revoking access on some hell-knows-what service principals.

Implementing such a setup requires more time to initialize, provide delegated access via Azure Lighthouse, and maintain delegations relevant to your deployment scenarios.

Azure DevOps service connections and Administer on Behalf of (AOBO)

Administer on Behalf of, or AOBO for short is a functionality of the CSP program that makes all users with the Admin Agent role in your tenant owners of the subscriptions that you create through the CSP program. So, technically you can add your service principal(s) provisioned for Azure DevOps in your tenant to the Admin agents group to grant them access to the subscriptions in managed tenants.

That approach is quite similar to the delegated access via Azure Lighthouse, but its less flexible. Basically, it follows the all-or-nothing principle. Once identity is included in the Admin agents group, it has complete access to all resources in all subscriptions managed by a CSP. While with Azure Lighthouse, you have an opportunity to configure more granular and least-privileged access.

On the positive side, as the members of the Admin agents group are granted the Owner permissions to the subscriptions, they are free of Azure Lighthouses limitation of controlling access to managed resources.

Obviously, configuring Azure DevOps service connections via AOBO is limited only to the subscriptions in your CSP program. In other words, its not suitable for deploying into subscriptions that arent linked to your CSP organization.

The three mentioned approaches to deploying into a different tenant(s) with Azure DevOps arent exclusive. Azure DevOps as a platform can offer you many other options and integrations with external apps and services. For example, you might choose to store your connection credentials to other tenants in secure storage like Azure Key Vault and fetch them in your Azure Pipelines to consume in deployment PowerShell scripts. Also, you can trigger deployment by some external service like Octopus Deploy or Jenkins. The final choice of the deployment method will depend on your existing infrastructure configuration and requirements for the desired solution.

And what is your experience with deploying to different tenants in Azure DevOps? Share it in the comments below 👇

]]>