Skip to content

Commit

Permalink
feat: [PAYMCLOUD-192] Add OpenCost Terraform module for AKS deployment (
Browse files Browse the repository at this point in the history
#395)

* Add OpenCost Terraform module for AKS deployment

This commit introduces a Terraform module to deploy OpenCost on AKS with Azure Managed Identities and Prometheus integration. It includes resources for role definitions, role assignments, Kubernetes configuration, and Helm chart deployments for OpenCost and Prometheus. Documentation (README) and input validation for variables are also provided.

* Refactor variable usage and simplify role naming logic

Replaced `prefix` variable with `project` and moved logic for `env_short` and `location` into `locals`. Removed unused variables to clean up the code. Adjusted Helm provider version to use a broader compatible range.

* Update Helm provider version constraint in Terraform

Changed the Helm provider version from "~> 2.0.0" to ">= 2.0.0" to allow for greater flexibility with newer versions. This ensures compatibility with future updates while maintaining the minimum required version.

* Configure external Prometheus URL for OpenCost

Add a new Helm value to set the `opencost.prometheus.external.url` pointing to the Prometheus service within the Kubernetes cluster. This enables OpenCost to connect to Prometheus using the specified external URL format.

* Refactor Prometheus variables into a single configuration object

Replaced standalone Prometheus variables with a unified `prometheus_config` object to simplify configuration and improve maintainability. Adjusted references in the main Terraform file to use the new structure. Updated default values and descriptions accordingly.

* Swap `service_port` and `chart_version` in Prometheus config

Reordered the variables in the `prometheus_config` object to align with expected types and defaults. This ensures clarity and maintains consistency in the configuration structure. No functional behavior is affected.

* Remove Helm deployment for prometheus-opencost-exporter.

Commented out the Helm release block for prometheus-opencost-exporter in the main Terraform configuration. This change effectively disables its deployment while retaining the code for potential future use.

* Enable ServiceMonitor for metrics collection

Added `metrics.serviceMonitor.enabled` configuration in the Helm chart setup to activate ServiceMonitor. This ensures metrics are properly collected and integrated with Prometheus.

* Replace OpenCost Helm chart with Prometheus OpenCost Exporter

Switched from the deprecated OpenCost Helm chart to the Prometheus OpenCost Exporter chart. Updated resource definitions to align with the new chart repository and version, and adjusted configurations where necessary. This ensures better compatibility and alignment with Prometheus ecosystem standards.

* Add ServiceMonitor and cleanup Helm deployment block

Introduced a ServiceMonitor resource for Prometheus to scrape OpenCost metrics. Removed the commented-out Helm deployment block for clarity and maintenance. Refined comments and output descriptions for better readability.

* Add cost_analysis_enabled variable for Kubernetes cluster

Introduced a new variable `cost_analysis_enabled` to enable cost analysis for Kubernetes clusters when set to true. This feature adds namespace and deployment details to Azure portal's Cost Analysis views, enhancing cost visibility and management. Defaults to false for backward compatibility.

* Disable ServiceMonitor resource for OpenCost.

Commented out the ServiceMonitor resource configuration, effectively disabling it. This change may be necessary to prevent conflicts or due to deprecation or operational adjustments.

* Remove commented-out ServiceMonitor resource block

The ServiceMonitor resource block was unused and commented out. Removing it helps to clean up the configuration and improve readability of the Terraform file.
  • Loading branch information
ffppa authored Jan 2, 2025
1 parent bfddd6f commit b9dd50d
Show file tree
Hide file tree
Showing 8 changed files with 286 additions and 0 deletions.
1 change: 1 addition & 0 deletions kubernetes_cluster/01_main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ resource "azurerm_kubernetes_cluster" "this" {

workload_identity_enabled = var.workload_identity_enabled
oidc_issuer_enabled = local.oidc_issuer_enabled
cost_analysis_enabled = var.cost_analysis_enabled

dynamic "network_profile" {
for_each = var.network_profile != null ? [var.network_profile] : []
Expand Down
8 changes: 8 additions & 0 deletions kubernetes_cluster/99_variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,14 @@ variable "addon_azure_pod_identity_enabled" {
default = false
}

# The sku_tier must be set to Standard or Premium to enable this feature.
# Enabling this will add Kubernetes Namespace and Deployment details to the Cost Analysis views in the Azure portal.
variable "cost_analysis_enabled" {
type = bool
default = false
description = "(Optional) Should cost analysis be enabled for this Kubernetes Cluster? Defaults to false."
}

#
# 📄 Logs
#
Expand Down
1 change: 1 addition & 0 deletions kubernetes_cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -564,6 +564,7 @@ No modules.
| <a name="input_addon_azure_policy_enabled"></a> [addon\_azure\_policy\_enabled](#input\_addon\_azure\_policy\_enabled) | Should the Azure Policy addon be enabled for this Node Pool? | `bool` | `false` | no |
| <a name="input_alerts_enabled"></a> [alerts\_enabled](#input\_alerts\_enabled) | Should Metrics Alert be enabled? | `bool` | `true` | no |
| <a name="input_automatic_channel_upgrade"></a> [automatic\_channel\_upgrade](#input\_automatic\_channel\_upgrade) | (Optional) The upgrade channel for this Kubernetes Cluster. Possible values are patch, rapid, node-image and stable. Omitting this field sets this value to none. | `string` | `null` | no |
| <a name="input_cost_analysis_enabled"></a> [cost\_analysis\_enabled](#input\_cost\_analysis\_enabled) | (Optional) Should cost analysis be enabled for this Kubernetes Cluster? Defaults to false. | `bool` | `false` | no |
| <a name="input_custom_logs_alerts"></a> [custom\_logs\_alerts](#input\_custom\_logs\_alerts) | Map of name = criteria objects | <pre>map(object({<br/> # (Optional) Specifies the display name of the alert rule.<br/> display_name = string<br/> # (Optional) Specifies the description of the scheduled query rule.<br/> description = string<br/> # Assuming each.value includes this attribute for Kusto Query Language (KQL)<br/> query = string<br/> # (Required) Severity of the alert. Should be an integer between 0 and 4.<br/> # Value of 0 is severest.<br/> severity = number<br/> # (Required) Specifies the period of time in ISO 8601 duration format on<br/> # which the Scheduled Query Rule will be executed (bin size).<br/> # If evaluation_frequency is PT1M, possible values are PT1M, PT5M, PT10M,<br/> # PT15M, PT30M, PT45M, PT1H, PT2H, PT3H, PT4H, PT5H, and PT6H. Otherwise,<br/> # possible values are PT5M, PT10M, PT15M, PT30M, PT45M, PT1H, PT2H, PT3H,<br/> # PT4H, PT5H, PT6H, P1D, and P2D.<br/> window_duration = optional(string)<br/> # (Optional) How often the scheduled query rule is evaluated, represented<br/> # in ISO 8601 duration format. Possible values are PT1M, PT5M, PT10M, PT15M,<br/> # PT30M, PT45M, PT1H, PT2H, PT3H, PT4H, PT5H, PT6H, P1D.<br/> evaluation_frequency = string<br/> # Evaluation operation for rule - 'GreaterThan', GreaterThanOrEqual',<br/> # 'LessThan', or 'LessThanOrEqual'.<br/> operator = string<br/> # Result or count threshold based on which rule should be triggered.<br/> # Values must be between 0 and 10000 inclusive.<br/> threshold = number<br/> # (Required) The type of aggregation to apply to the data points in<br/> # aggregation granularity. Possible values are Average, Count, Maximum,<br/> # Minimum,and Total.<br/> time_aggregation_method = string<br/> # (Optional) Specifies the column containing the resource ID. The content<br/> # of the column must be an uri formatted as resource ID.<br/> resource_id_column = optional(string)<br/><br/> # (Optional) Specifies the column containing the metric measure number.<br/> metric_measure_column = optional(string)<br/><br/> dimension = list(object(<br/> {<br/> # (Required) Name of the dimension.<br/> name = string<br/> # (Required) Operator for dimension values. Possible values are<br/> # Exclude,and Include.<br/> operator = string<br/> # (Required) List of dimension values. Use a wildcard * to collect all.<br/> values = list(string)<br/> }<br/> ))<br/><br/> # (Required) Specifies the number of violations to trigger an alert.<br/> # Should be smaller or equal to number_of_evaluation_periods.<br/> # Possible value is integer between 1 and 6.<br/> minimum_failing_periods_to_trigger_alert = number<br/> # (Required) Specifies the number of aggregated look-back points.<br/> # The look-back time window is calculated based on the aggregation<br/> # granularity window_duration and the selected number of aggregated points.<br/> # Possible value is integer between 1 and 6.<br/> number_of_evaluation_periods = number<br/><br/> # (Optional) Specifies the flag that indicates whether the alert should<br/> # be automatically resolved or not. Value should be true or false.<br/> # The default is false.<br/> auto_mitigation_enabled = optional(bool)<br/> # (Optional) Specifies the flag which indicates whether this scheduled<br/> # query rule check if storage is configured. Value should be true or false.<br/> # The default is false.<br/> workspace_alerts_storage_enabled = optional(bool)<br/> # (Optional) Specifies the flag which indicates whether the provided<br/> # query should be validated or not. The default is false.<br/> skip_query_validation = optional(bool)<br/> }))</pre> | `{}` | no |
| <a name="input_custom_metric_alerts"></a> [custom\_metric\_alerts](#input\_custom\_metric\_alerts) | Map of name = criteria objects | <pre>map(object({<br/> # criteria.*.aggregation to be one of [Average Count Minimum Maximum Total]<br/> aggregation = string<br/> # "Insights.Container/pods" "Insights.Container/nodes"<br/> metric_namespace = string<br/> metric_name = string<br/> # criteria.0.operator to be one of [Equals NotEquals GreaterThan GreaterThanOrEqual LessThan LessThanOrEqual]<br/> operator = string<br/> threshold = number<br/> # Possible values are PT1M, PT5M, PT15M, PT30M and PT1H<br/> frequency = string<br/> # Possible values are PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H and P1D.<br/> window_size = string<br/> # Skip metrics validation<br/> skip_metric_validation = optional(bool, false)<br/><br/> dimension = list(object(<br/> {<br/> name = string<br/> operator = string<br/> values = list(string)<br/> }<br/> ))<br/> }))</pre> | `{}` | no |
| <a name="input_default_metric_alerts"></a> [default\_metric\_alerts](#input\_default\_metric\_alerts) | Map of name = criteria objects | <pre>map(object({<br/> # criteria.*.aggregation to be one of [Average Count Minimum Maximum Total]<br/> aggregation = string<br/> # (Optional) Specifies the description of the scheduled metric rule.<br/> description = optional(string)<br/> # "Insights.Container/pods" "Insights.Container/nodes"<br/> metric_namespace = string<br/> metric_name = string<br/> # criteria.0.operator to be one of [Equals NotEquals GreaterThan GreaterThanOrEqual LessThan LessThanOrEqual]<br/> operator = string<br/> threshold = number<br/> # Possible values are 0, 1, 2, 3 and 4. Defaults to 3.<br/> severity = optional(number)<br/> # Possible values are PT1M, PT5M, PT15M, PT30M and PT1H<br/> frequency = string<br/> # Possible values are PT1M, PT5M, PT15M, PT30M, PT1H, PT6H, PT12H and P1D.<br/> window_size = string<br/> # Skip metrics validation<br/> skip_metric_validation = optional(bool, false)<br/><br/><br/> dimension = list(object(<br/> {<br/> name = string<br/> operator = string<br/> values = list(string)<br/> }<br/> ))<br/> }))</pre> | <pre>{<br/> "node_cpu_usage_percentage": {<br/> "aggregation": "Average",<br/> "description": "High node cpu usage",<br/> "dimension": [<br/> {<br/> "name": "node",<br/> "operator": "Include",<br/> "values": [<br/> "*"<br/> ]<br/> }<br/> ],<br/> "frequency": "PT15M",<br/> "metric_name": "node_cpu_usage_percentage",<br/> "metric_namespace": "Microsoft.ContainerService/managedClusters",<br/> "operator": "GreaterThan",<br/> "severity": 2,<br/> "threshold": 80,<br/> "window_size": "PT1H"<br/> },<br/> "node_memory_working_set_percentage": {<br/> "aggregation": "Average",<br/> "description": "High node memory usage",<br/> "dimension": [<br/> {<br/> "name": "node",<br/> "operator": "Include",<br/> "values": [<br/> "*"<br/> ]<br/> }<br/> ],<br/> "frequency": "PT15M",<br/> "metric_name": "node_memory_working_set_percentage",<br/> "metric_namespace": "Microsoft.ContainerService/managedClusters",<br/> "operator": "GreaterThan",<br/> "severity": 2,<br/> "threshold": 80,<br/> "window_size": "PT1H"<br/> },<br/> "pods_failed": {<br/> "aggregation": "Average",<br/> "description": "Pod state phase failed",<br/> "dimension": [<br/> {<br/> "name": "phase",<br/> "operator": "Include",<br/> "values": [<br/> "Failed"<br/> ]<br/> },<br/> {<br/> "name": "namespace",<br/> "operator": "Include",<br/> "values": [<br/> "*"<br/> ]<br/> }<br/> ],<br/> "frequency": "PT15M",<br/> "metric_name": "kube_pod_status_phase",<br/> "metric_namespace": "Microsoft.ContainerService/managedClusters",<br/> "operator": "GreaterThan",<br/> "severity": 1,<br/> "threshold": 0,<br/> "window_size": "PT1H"<br/> }<br/>}</pre> | no |
Expand Down
11 changes: 11 additions & 0 deletions kubernetes_opencosts/00_data.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

data "azurerm_kubernetes_cluster" "aks" {
name = var.aks_name
resource_group_name = var.aks_rg_name
}

data "kubernetes_namespace" "monitoring" {
metadata {
name = var.kubernetes_namespace
}
}
119 changes: 119 additions & 0 deletions kubernetes_opencosts/01_main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
locals {
env_short = substr(var.env, 0, 1)
location = data.azurerm_kubernetes_cluster.aks.location
}

resource "azurerm_role_definition" "open_cost_role" {
name = "${var.project}-${local.env_short}-${local.location}-OpenCostRole"
scope = data.azurerm_subscription.current.id
description = "Rate Card query role"
permissions {
actions = [
"Microsoft.Compute/virtualMachines/vmSizes/read",
"Microsoft.Resources/subscriptions/locations/read",
"Microsoft.Resources/providers/read",
"Microsoft.ContainerService/containerServices/read",
"Microsoft.Commerce/RateCard/read"
]
not_actions = []
}
assignable_scopes = [
data.azurerm_subscription.current.id
]
}

# Create an Azure User-Assigned Managed Identity (UAMI)
resource "azurerm_user_assigned_identity" "opencost_identity" {
name = "${var.project}-${local.env_short}-${local.location}-opencost-managed-identity"
location = local.location
resource_group_name = data.azurerm_kubernetes_cluster.aks.resource_group_name
}

# Assign role to UAMI
resource "azurerm_role_assignment" "opencost_identity_role" {
principal_id = azurerm_user_assigned_identity.opencost_identity.principal_id
role_definition_name = azurerm_role_definition.open_cost_role.name
scope = data.azurerm_subscription.current.id
}

# Identity Details
output "managed_identity_details" {
description = "Dettagli dell'identità gestita User-Assigned per OpenCost"
value = jsonencode({
identity_id = azurerm_user_assigned_identity.opencost_identity.id
principal_id = azurerm_user_assigned_identity.opencost_identity.principal_id
client_id = azurerm_user_assigned_identity.opencost_identity.client_id
subscription = data.azurerm_subscription.current.id
tenant = data.azurerm_client_config.current.tenant_id
})
}

# Kubernetes Secret configs and identity
resource "kubernetes_secret" "azure_managed_identity_refs" {
metadata {
name = "azure-managed-identity"
namespace = data.kubernetes_namespace.monitoring.metadata[0].name
}

data = {
"client-id" = azurerm_user_assigned_identity.opencost_identity.client_id
"principal-id" = azurerm_user_assigned_identity.opencost_identity.principal_id
"identity-id" = azurerm_user_assigned_identity.opencost_identity.id
"tenant-id" = data.azurerm_client_config.current.tenant_id
}

type = "Opaque"
}

# # Helm deployment for "prometheus-opencost-exporter"
resource "helm_release" "prometheus_opencost_exporter" {
name = "prometheus-opencost-exporter"
namespace = data.kubernetes_namespace.monitoring.metadata[0].name
chart = "prometheus-opencost-exporter"
repository = "https://prometheus-community.github.io/helm-charts"
version = "0.1.1" # Adjust the version as needed

# Set additional values for the Helm chart if required
set {
name = "extraVolumes[0].name"
value = "azure-managed-identity-secret"
}

set {
name = "extraVolumes[0].secret.secretName"
value = kubernetes_secret.azure_managed_identity_refs.metadata[0].name
}

set {
name = "opencost.exporter.extraVolumeMounts[0].mountPath"
value = "/var/secrets"
}

set {
name = "opencost.exporter.extraVolumeMounts[0].name"
value = "azure-managed-identity-secret"
}

set {
name = "opencost.prometheus.external.url"
value = var.prometheus_config.external_url
}

set {
name = "opencost.prometheus.internal.namespaceName"
value = var.prometheus_config.namespace
}
set {
name = "opencost.prometheus.internal.port"
value = var.prometheus_config.service_port
}
set {
name = "opencost.prometheus.internal.serviceName"
value = var.prometheus_config.service_name
}

set {
name = "metrics.serviceMonitor.enabled"
value = "true"
}
}
59 changes: 59 additions & 0 deletions kubernetes_opencosts/99_variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
variable "project" {
type = string
default = "cstar"
validation {
condition = (
length(var.project) <= 6
)
error_message = "Max length is 6 chars."
}
}

variable "env" {
type = string
validation {
condition = (
length(var.env) <= 3
)
error_message = "Max length is 3 chars."
}
}

# AKS Variables
###################

variable "aks_name" {
type = string
description = "(Required) Name of AKS cluster in Azure"
}

variable "aks_rg_name" {
type = string
description = "(Required) Name of AKS cluster resource group in Azure"
}

variable "kubernetes_namespace" {
type = string
default = "monitoring"
}

# Prometheus variables
########################

variable "prometheus_config" {
type = object({
service_port = string
external_url = optional(string, "")
namespace = string
service_name = string
chart_version = optional(string, "1.42.3")
})
description = "Configuration object for Prometheus deployment, including chart version, optional external URL, namespace, service name, service port, and other related settings."
default = {
namespace = "monitoring"
service_name = "prometheus-service"
service_port = 9090
chart_version = "1.42.3"
external_url = ""
}
}
27 changes: 27 additions & 0 deletions kubernetes_opencosts/99_versions.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
terraform {
required_version = ">= 1.3.0"

required_providers {
helm = {
source = "hashicorp/helm"
version = ">= 2.0.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "<= 2.33.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "<= 3.116.0"
}
}
}

provider "azurerm" {
features {}
# Configuration options
}

data "azurerm_subscription" "current" {}

data "azurerm_client_config" "current" {}
Loading

0 comments on commit b9dd50d

Please sign in to comment.