prometheus_k8s/v0/prometheus_scrape.py

   1 # Copyright 2023 Canonical Ltd.
   2 #
   3 # Licensed under the Apache License, Version 2.0 (the "License");
   4 # you may not use this file except in compliance with the License.
   5 # You may obtain a copy of the License at
   6 #
   7 # http://www.apache.org/licenses/LICENSE-2.0
   8 #
   9 # Unless required by applicable law or agreed to in writing, software
  10 # distributed under the License is distributed on an "AS IS" BASIS,
  11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12 # See the License for the specific language governing permissions and
  13 # limitations under the License.
  14
  15 """Prometheus Scrape Library.
  16
  17 ## Overview
  18
  19 This document explains how to integrate with the Prometheus charm
  20 for the purpose of providing a metrics endpoint to Prometheus. It
  21 also explains how alternative implementations of the Prometheus charms
  22 may maintain the same interface and be backward compatible with all
  23 currently integrated charms. Finally this document is the
  24 authoritative reference on the structure of relation data that is
  25 shared between Prometheus charms and any other charm that intends to
  26 provide a scrape target for Prometheus.
  27
  28 ## Source code
  29
  30 Source code can be found on GitHub at:
  31  https://github.com/canonical/prometheus-k8s-operator/tree/main/lib/charms/prometheus_k8s
  32
  33 ## Dependencies
  34
  35 Using this library requires you to fetch the juju_topology library from
  36 [observability-libs](https://charmhub.io/observability-libs/libraries/juju_topology).
  37
  38 `charmcraft fetch-lib charms.observability_libs.v0.juju_topology`
  39
  40 ## Provider Library Usage
  41
  42 This Prometheus charm interacts with its scrape targets using its
  43 charm library. Charms seeking to expose metric endpoints for the
  44 Prometheus charm, must do so using the `MetricsEndpointProvider`
  45 object from this charm library. For the simplest use cases, using the
  46 `MetricsEndpointProvider` object only requires instantiating it,
  47 typically in the constructor of your charm (the one which exposes a
  48 metrics endpoint). The `MetricsEndpointProvider` constructor requires
  49 the name of the relation over which a scrape target (metrics endpoint)
  50 is exposed to the Prometheus charm. This relation must use the
  51 `prometheus_scrape` interface. By default address of the metrics
  52 endpoint is set to the unit IP address, by each unit of the
  53 `MetricsEndpointProvider` charm. These units set their address in
  54 response to the `PebbleReady` event of each container in the unit,
  55 since container restarts of Kubernetes charms can result in change of
  56 IP addresses. The default name for the metrics endpoint relation is
  57 `metrics-endpoint`. It is strongly recommended to use the same
  58 relation name for consistency across charms and doing so obviates the
  59 need for an additional constructor argument. The
  60 `MetricsEndpointProvider` object may be instantiated as follows
  61
  62     from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointProvider
  63
  64     def __init__(self, *args):
  65         super().__init__(*args)
  66         ...
  67         self.metrics_endpoint = MetricsEndpointProvider(self)
  68         ...
  69
  70 Note that the first argument (`self`) to `MetricsEndpointProvider` is
  71 always a reference to the parent (scrape target) charm.
  72
  73 An instantiated `MetricsEndpointProvider` object will ensure that each
  74 unit of its parent charm, is a scrape target for the
  75 `MetricsEndpointConsumer` (Prometheus) charm. By default
  76 `MetricsEndpointProvider` assumes each unit of the consumer charm
  77 exports its metrics at a path given by `/metrics` on port 80. These
  78 defaults may be changed by providing the `MetricsEndpointProvider`
  79 constructor an optional argument (`jobs`) that represents a
  80 Prometheus scrape job specification using Python standard data
  81 structures. This job specification is a subset of Prometheus' own
  82 [scrape
  83 configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
  84 format but represented using Python data structures. More than one job
  85 may be provided using the `jobs` argument. Hence `jobs` accepts a list
  86 of dictionaries where each dictionary represents one `<scrape_config>`
  87 object as described in the Prometheus documentation. The currently
  88 supported configuration subset is: `job_name`, `metrics_path`,
  89 `static_configs`
  90
  91 Suppose it is required to change the port on which scraped metrics are
  92 exposed to 8000. This may be done by providing the following data
  93 structure as the value of `jobs`.
  94
  95 ```
  96 [
  97     {
  98         "static_configs": [
  99             {
 100                 "targets": ["*:8000"]
 101             }
 102         ]
 103     }
 104 ]
 105 ```
 106
 107 The wildcard ("*") host specification implies that the scrape targets
 108 will automatically be set to the host addresses advertised by each
 109 unit of the consumer charm.
 110
 111 It is also possible to change the metrics path and scrape multiple
 112 ports, for example
 113
 114 ```
 115 [
 116     {
 117         "metrics_path": "/my-metrics-path",
 118         "static_configs": [
 119             {
 120                 "targets": ["*:8000", "*:8081"],
 121             }
 122         ]
 123     }
 124 ]
 125 ```
 126
 127 More complex scrape configurations are possible. For example
 128
 129 ```
 130 [
 131     {
 132         "static_configs": [
 133             {
 134                 "targets": ["10.1.32.215:7000", "*:8000"],
 135                 "labels": {
 136                     "some_key": "some-value"
 137                 }
 138             }
 139         ]
 140     }
 141 ]
 142 ```
 143
 144 This example scrapes the target "10.1.32.215" at port 7000 in addition
 145 to scraping each unit at port 8000. There is however one difference
 146 between wildcard targets (specified using "*") and fully qualified
 147 targets (such as "10.1.32.215"). The Prometheus charm automatically
 148 associates labels with metrics generated by each target. These labels
 149 localise the source of metrics within the Juju topology by specifying
 150 its "model name", "model UUID", "application name" and "unit
 151 name". However unit name is associated only with wildcard targets but
 152 not with fully qualified targets.
 153
 154 Multiple jobs with different metrics paths and labels are allowed, but
 155 each job must be given a unique name:
 156
 157 ```
 158 [
 159     {
 160         "job_name": "my-first-job",
 161         "metrics_path": "one-path",
 162         "static_configs": [
 163             {
 164                 "targets": ["*:7000"],
 165                 "labels": {
 166                     "some_key": "some-value"
 167                 }
 168             }
 169         ]
 170     },
 171     {
 172         "job_name": "my-second-job",
 173         "metrics_path": "another-path",
 174         "static_configs": [
 175             {
 176                 "targets": ["*:8000"],
 177                 "labels": {
 178                     "some_other_key": "some-other-value"
 179                 }
 180             }
 181         ]
 182     }
 183 ]
 184 ```
 185
 186 **Important:** `job_name` should be a fixed string (e.g. hardcoded literal).
 187 For instance, if you include variable elements, like your `unit.name`, it may break
 188 the continuity of the metrics time series gathered by Prometheus when the leader unit
 189 changes (e.g. on upgrade or rescale).
 190
 191 Additionally, it is also technically possible, but **strongly discouraged**, to
 192 configure the following scrape-related settings, which behave as described by the
 193 [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config):
 194
 195 - `static_configs`
 196 - `scrape_interval`
 197 - `scrape_timeout`
 198 - `proxy_url`
 199 - `relabel_configs`
 200 - `metrics_relabel_configs`
 201 - `sample_limit`
 202 - `label_limit`
 203 - `label_name_length_limit`
 204 - `label_value_length_limit`
 205
 206 The settings above are supported by the `prometheus_scrape` library only for the sake of
 207 specialized facilities like the [Prometheus Scrape Config](https://charmhub.io/prometheus-scrape-config-k8s)
 208 charm. Virtually no charms should use these settings, and charmers definitely **should not**
 209 expose them to the Juju administrator via configuration options.
 210
 211 ## Consumer Library Usage
 212
 213 The `MetricsEndpointConsumer` object may be used by Prometheus
 214 charms to manage relations with their scrape targets. For this
 215 purposes a Prometheus charm needs to do two things
 216
 217 1. Instantiate the `MetricsEndpointConsumer` object by providing it a
 218 reference to the parent (Prometheus) charm and optionally the name of
 219 the relation that the Prometheus charm uses to interact with scrape
 220 targets. This relation must confirm to the `prometheus_scrape`
 221 interface and it is strongly recommended that this relation be named
 222 `metrics-endpoint` which is its default value.
 223
 224 For example a Prometheus charm may instantiate the
 225 `MetricsEndpointConsumer` in its constructor as follows
 226
 227     from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointConsumer
 228
 229     def __init__(self, *args):
 230         super().__init__(*args)
 231         ...
 232         self.metrics_consumer = MetricsEndpointConsumer(self)
 233         ...
 234
 235 2. A Prometheus charm also needs to respond to the
 236 `TargetsChangedEvent` event of the `MetricsEndpointConsumer` by adding itself as
 237 an observer for these events, as in
 238
 239     self.framework.observe(
 240         self.metrics_consumer.on.targets_changed,
 241         self._on_scrape_targets_changed,
 242     )
 243
 244 In responding to the `TargetsChangedEvent` event the Prometheus
 245 charm must update the Prometheus configuration so that any new scrape
 246 targets are added and/or old ones removed from the list of scraped
 247 endpoints. For this purpose the `MetricsEndpointConsumer` object
 248 exposes a `jobs()` method that returns a list of scrape jobs. Each
 249 element of this list is the Prometheus scrape configuration for that
 250 job. In order to update the Prometheus configuration, the Prometheus
 251 charm needs to replace the current list of jobs with the list provided
 252 by `jobs()` as follows
 253
 254     def _on_scrape_targets_changed(self, event):
 255         ...
 256         scrape_jobs = self.metrics_consumer.jobs()
 257         for job in scrape_jobs:
 258             prometheus_scrape_config.append(job)
 259         ...
 260
 261 ## Alerting Rules
 262
 263 This charm library also supports gathering alerting rules from all
 264 related `MetricsEndpointProvider` charms and enabling corresponding alerts within the
 265 Prometheus charm.  Alert rules are automatically gathered by `MetricsEndpointProvider`
 266 charms when using this library, from a directory conventionally named
 267 `prometheus_alert_rules`. This directory must reside at the top level
 268 in the `src` folder of the consumer charm. Each file in this directory
 269 is assumed to be in one of two formats:
 270 - the official prometheus alert rule format, conforming to the
 271 [Prometheus docs](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
 272 - a single rule format, which is a simplified subset of the official format,
 273 comprising a single alert rule per file, using the same YAML fields.
 274
 275 The file name must have one of the following extensions:
 276 - `.rule`
 277 - `.rules`
 278 - `.yml`
 279 - `.yaml`
 280
 281 An example of the contents of such a file in the custom single rule
 282 format is shown below.
 283
 284 ```
 285 alert: HighRequestLatency
 286 expr: job:request_latency_seconds:mean5m{my_key=my_value} > 0.5
 287 for: 10m
 288 labels:
 289   severity: Medium
 290   type: HighLatency
 291 annotations:
 292   summary: High request latency for {{ $labels.instance }}.
 293 ```
 294
 295 The `MetricsEndpointProvider` will read all available alert rules and
 296 also inject "filtering labels" into the alert expressions. The
 297 filtering labels ensure that alert rules are localised to the metrics
 298 provider charm's Juju topology (application, model and its UUID). Such
 299 a topology filter is essential to ensure that alert rules submitted by
 300 one provider charm generates alerts only for that same charm. When
 301 alert rules are embedded in a charm, and the charm is deployed as a
 302 Juju application, the alert rules from that application have their
 303 expressions automatically updated to filter for metrics coming from
 304 the units of that application alone. This remove risk of spurious
 305 evaluation, e.g., when you have multiple deployments of the same charm
 306 monitored by the same Prometheus.
 307
 308 Not all alerts one may want to specify can be embedded in a
 309 charm. Some alert rules will be specific to a user's use case. This is
 310 the case, for example, of alert rules that are based on business
 311 constraints, like expecting a certain amount of requests to a specific
 312 API every five minutes. Such alert rules can be specified via the
 313 [COS Config Charm](https://charmhub.io/cos-configuration-k8s),
 314 which allows importing alert rules and other settings like dashboards
 315 from a Git repository.
 316
 317 Gathering alert rules and generating rule files within the Prometheus
 318 charm is easily done using the `alerts()` method of
 319 `MetricsEndpointConsumer`. Alerts generated by Prometheus will
 320 automatically include Juju topology labels in the alerts. These labels
 321 indicate the source of the alert. The following labels are
 322 automatically included with each alert
 323
 324 - `juju_model`
 325 - `juju_model_uuid`
 326 - `juju_application`
 327
 328 ## Relation Data
 329
 330 The Prometheus charm uses both application and unit relation data to
 331 obtain information regarding its scrape jobs, alert rules and scrape
 332 targets. This relation data is in JSON format and it closely resembles
 333 the YAML structure of Prometheus [scrape configuration]
 334 (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config).
 335
 336 Units of Metrics provider charms advertise their names and addresses
 337 over unit relation data using the `prometheus_scrape_unit_name` and
 338 `prometheus_scrape_unit_address` keys. While the `scrape_metadata`,
 339 `scrape_jobs` and `alert_rules` keys in application relation data
 340 of Metrics provider charms hold eponymous information.
 341
 342 """  # noqa: W505
 343
 344 import copy
 345 import hashlib
 346 import ipaddress
 347 import json
 348 import logging
 349 import os
 350 import platform
 351 import re
 352 import socket
 353 import subprocess
 354 import tempfile
 355 from collections import defaultdict
 356 from pathlib import Path
 357 from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 358 from urllib.error import HTTPError, URLError
 359 from urllib.parse import urlparse
 360 from urllib.request import urlopen
 361
 362 import yaml
 363 from charms.observability_libs.v0.juju_topology import JujuTopology
 364 from ops.charm import CharmBase, RelationRole
 365 from ops.framework import (
 366     BoundEvent,
 367     EventBase,
 368     EventSource,
 369     Object,
 370     ObjectEvents,
 371     StoredDict,
 372     StoredList,
 373     StoredState,
 374 )
 375 from ops.model import Relation
 376
 377 # The unique Charmhub library identifier, never change it
 378 LIBID = "bc84295fef5f4049878f07b131968ee2"
 379
 380 # Increment this major API version when introducing breaking changes
 381 LIBAPI = 0
 382
 383 # Increment this PATCH version before using `charmcraft publish-lib` or reset
 384 # to 0 if you are raising the major API version
 385 LIBPATCH = 36
 386
 387 logger = logging.getLogger(__name__)
 388
 389
 390 ALLOWED_KEYS = {
 391     "job_name",
 392     "metrics_path",
 393     "static_configs",
 394     "scrape_interval",
 395     "scrape_timeout",
 396     "proxy_url",
 397     "relabel_configs",
 398     "metrics_relabel_configs",
 399     "sample_limit",
 400     "label_limit",
 401     "label_name_length_limit",
 402     "label_value_length_limit",
 403     "scheme",
 404     "basic_auth",
 405     "tls_config",
 406 }
 407 DEFAULT_JOB = {
 408     "metrics_path": "/metrics",
 409     "static_configs": [{"targets": ["*:80"]}],
 410 }
 411
 412
 413 DEFAULT_RELATION_NAME = "metrics-endpoint"
 414 RELATION_INTERFACE_NAME = "prometheus_scrape"
 415
 416 DEFAULT_ALERT_RULES_RELATIVE_PATH = "./src/prometheus_alert_rules"
 417
 418
 419 class PrometheusConfig:
 420     """A namespace for utility functions for manipulating the prometheus config dict."""
 421
 422     # relabel instance labels so that instance identifiers are globally unique
 423     # stable over unit recreation
 424     topology_relabel_config = {
 425         "source_labels": ["juju_model", "juju_model_uuid", "juju_application"],
 426         "separator": "_",
 427         "target_label": "instance",
 428         "regex": "(.*)",
 429     }
 430
 431     topology_relabel_config_wildcard = {
 432         "source_labels": ["juju_model", "juju_model_uuid", "juju_application", "juju_unit"],
 433         "separator": "_",
 434         "target_label": "instance",
 435         "regex": "(.*)",
 436     }
 437
 438     @staticmethod
 439     def sanitize_scrape_config(job: dict) -> dict:
 440         """Restrict permissible scrape configuration options.
 441
 442         If job is empty then a default job is returned. The
 443         default job is
 444
 445         ```
 446         {
 447             "metrics_path": "/metrics",
 448             "static_configs": [{"targets": ["*:80"]}],
 449         }
 450         ```
 451
 452         Args:
 453             job: a dict containing a single Prometheus job
 454                 specification.
 455
 456         Returns:
 457             a dictionary containing a sanitized job specification.
 458         """
 459         sanitized_job = DEFAULT_JOB.copy()
 460         sanitized_job.update({key: value for key, value in job.items() if key in ALLOWED_KEYS})
 461         return sanitized_job
 462
 463     @staticmethod
 464     def sanitize_scrape_configs(scrape_configs: List[dict]) -> List[dict]:
 465         """A vectorized version of `sanitize_scrape_config`."""
 466         return [PrometheusConfig.sanitize_scrape_config(job) for job in scrape_configs]
 467
 468     @staticmethod
 469     def prefix_job_names(scrape_configs: List[dict], prefix: str) -> List[dict]:
 470         """Adds the given prefix to all the job names in the given scrape_configs list."""
 471         modified_scrape_configs = []
 472         for scrape_config in scrape_configs:
 473             job_name = scrape_config.get("job_name")
 474             modified = scrape_config.copy()
 475             modified["job_name"] = prefix + "_" + job_name if job_name else prefix
 476             modified_scrape_configs.append(modified)
 477
 478         return modified_scrape_configs
 479
 480     @staticmethod
 481     def expand_wildcard_targets_into_individual_jobs(
 482         scrape_jobs: List[dict],
 483         hosts: Dict[str, Tuple[str, str]],
 484         topology: Optional[JujuTopology] = None,
 485     ) -> List[dict]:
 486         """Extract wildcard hosts from the given scrape_configs list into separate jobs.
 487
 488         Args:
 489             scrape_jobs: list of scrape jobs.
 490             hosts: a dictionary mapping host names to host address for
 491                 all units of the relation for which this job configuration
 492                 must be constructed.
 493             topology: optional arg for adding topology labels to scrape targets.
 494         """
 495         # hosts = self._relation_hosts(relation)
 496
 497         modified_scrape_jobs = []
 498         for job in scrape_jobs:
 499             static_configs = job.get("static_configs")
 500             if not static_configs:
 501                 continue
 502
 503             # When a single unit specified more than one wildcard target, then they are expanded
 504             # into a static_config per target
 505             non_wildcard_static_configs = []
 506
 507             for static_config in static_configs:
 508                 targets = static_config.get("targets")
 509                 if not targets:
 510                     continue
 511
 512                 # All non-wildcard targets remain in the same static_config
 513                 non_wildcard_targets = []
 514
 515                 # All wildcard targets are extracted to a job per unit. If multiple wildcard
 516                 # targets are specified, they remain in the same static_config (per unit).
 517                 wildcard_targets = []
 518
 519                 for target in targets:
 520                     match = re.compile(r"\*(?:(:\d+))?").match(target)
 521                     if match:
 522                         # This is a wildcard target.
 523                         # Need to expand into separate jobs and remove it from this job here
 524                         wildcard_targets.append(target)
 525                     else:
 526                         # This is not a wildcard target. Copy it over into its own static_config.
 527                         non_wildcard_targets.append(target)
 528
 529                 # All non-wildcard targets remain in the same static_config
 530                 if non_wildcard_targets:
 531                     non_wildcard_static_config = static_config.copy()
 532                     non_wildcard_static_config["targets"] = non_wildcard_targets
 533
 534                     if topology:
 535                         # When non-wildcard targets (aka fully qualified hostnames) are specified,
 536                         # there is no reliable way to determine the name (Juju topology unit name)
 537                         # for such a target. Therefore labeling with Juju topology, excluding the
 538                         # unit name.
 539                         non_wildcard_static_config["labels"] = {
 540                             **non_wildcard_static_config.get("labels", {}),
 541                             **topology.label_matcher_dict,
 542                         }
 543
 544                     non_wildcard_static_configs.append(non_wildcard_static_config)
 545
 546                 # Extract wildcard targets into individual jobs
 547                 if wildcard_targets:
 548                     for unit_name, (unit_hostname, unit_path) in hosts.items():
 549                         modified_job = job.copy()
 550                         modified_job["static_configs"] = [static_config.copy()]
 551                         modified_static_config = modified_job["static_configs"][0]
 552                         modified_static_config["targets"] = [
 553                             target.replace("*", unit_hostname) for target in wildcard_targets
 554                         ]
 555
 556                         unit_num = unit_name.split("/")[-1]
 557                         job_name = modified_job.get("job_name", "unnamed-job") + "-" + unit_num
 558                         modified_job["job_name"] = job_name
 559                         modified_job["metrics_path"] = unit_path + (
 560                             job.get("metrics_path") or "/metrics"
 561                         )
 562
 563                         if topology:
 564                             # Add topology labels
 565                             modified_static_config["labels"] = {
 566                                 **modified_static_config.get("labels", {}),
 567                                 **topology.label_matcher_dict,
 568                                 **{"juju_unit": unit_name},
 569                             }
 570
 571                             # Instance relabeling for topology should be last in order.
 572                             modified_job["relabel_configs"] = modified_job.get(
 573                                 "relabel_configs", []
 574                             ) + [PrometheusConfig.topology_relabel_config_wildcard]
 575
 576                         modified_scrape_jobs.append(modified_job)
 577
 578             if non_wildcard_static_configs:
 579                 modified_job = job.copy()
 580                 modified_job["static_configs"] = non_wildcard_static_configs
 581                 modified_job["metrics_path"] = modified_job.get("metrics_path") or "/metrics"
 582
 583                 if topology:
 584                     # Instance relabeling for topology should be last in order.
 585                     modified_job["relabel_configs"] = modified_job.get("relabel_configs", []) + [
 586                         PrometheusConfig.topology_relabel_config
 587                     ]
 588
 589                 modified_scrape_jobs.append(modified_job)
 590
 591         return modified_scrape_jobs
 592
 593     @staticmethod
 594     def render_alertmanager_static_configs(alertmanagers: List[str]):
 595         """Render the alertmanager static_configs section from a list of URLs.
 596
 597         Each target must be in the hostname:port format, and prefixes are specified in a separate
 598         key. Therefore, with ingress in place, would need to extract the path into the
 599         `path_prefix` key, which is higher up in the config hierarchy.
 600
 601         https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
 602
 603         Args:
 604             alertmanagers: List of alertmanager URLs.
 605
 606         Returns:
 607             A dict representation for the static_configs section.
 608         """
 609         # Make sure it's a valid url so urlparse could parse it.
 610         scheme = re.compile(r"^https?://")
 611         sanitized = [am if scheme.search(am) else "http://" + am for am in alertmanagers]
 612
 613         # Create a mapping from paths to netlocs
 614         # Group alertmanager targets into a dictionary of lists:
 615         # {path: [netloc1, netloc2]}
 616         paths = defaultdict(list)  # type: Dict[str, List[str]]
 617         for parsed in map(urlparse, sanitized):
 618             path = parsed.path or "/"
 619             paths[path].append(parsed.netloc)
 620
 621         return {
 622             "alertmanagers": [
 623                 {"path_prefix": path_prefix, "static_configs": [{"targets": netlocs}]}
 624                 for path_prefix, netlocs in paths.items()
 625             ]
 626         }
 627
 628
 629 class RelationNotFoundError(Exception):
 630     """Raised if there is no relation with the given name is found."""
 631
 632     def __init__(self, relation_name: str):
 633         self.relation_name = relation_name
 634         self.message = "No relation named '{}' found".format(relation_name)
 635
 636         super().__init__(self.message)
 637
 638
 639 class RelationInterfaceMismatchError(Exception):
 640     """Raised if the relation with the given name has a different interface."""
 641
 642     def __init__(
 643         self,
 644         relation_name: str,
 645         expected_relation_interface: str,
 646         actual_relation_interface: str,
 647     ):
 648         self.relation_name = relation_name
 649         self.expected_relation_interface = expected_relation_interface
 650         self.actual_relation_interface = actual_relation_interface
 651         self.message = (
 652             "The '{}' relation has '{}' as interface rather than the expected '{}'".format(
 653                 relation_name, actual_relation_interface, expected_relation_interface
 654             )
 655         )
 656
 657         super().__init__(self.message)
 658
 659
 660 class RelationRoleMismatchError(Exception):
 661     """Raised if the relation with the given name has a different role."""
 662
 663     def __init__(
 664         self,
 665         relation_name: str,
 666         expected_relation_role: RelationRole,
 667         actual_relation_role: RelationRole,
 668     ):
 669         self.relation_name = relation_name
 670         self.expected_relation_interface = expected_relation_role
 671         self.actual_relation_role = actual_relation_role
 672         self.message = "The '{}' relation has role '{}' rather than the expected '{}'".format(
 673             relation_name, repr(actual_relation_role), repr(expected_relation_role)
 674         )
 675
 676         super().__init__(self.message)
 677
 678
 679 class InvalidAlertRuleEvent(EventBase):
 680     """Event emitted when alert rule files are not parsable.
 681
 682     Enables us to set a clear status on the provider.
 683     """
 684
 685     def __init__(self, handle, errors: str = "", valid: bool = False):
 686         super().__init__(handle)
 687         self.errors = errors
 688         self.valid = valid
 689
 690     def snapshot(self) -> Dict:
 691         """Save alert rule information."""
 692         return {
 693             "valid": self.valid,
 694             "errors": self.errors,
 695         }
 696
 697     def restore(self, snapshot):
 698         """Restore alert rule information."""
 699         self.valid = snapshot["valid"]
 700         self.errors = snapshot["errors"]
 701
 702
 703 class InvalidScrapeJobEvent(EventBase):
 704     """Event emitted when alert rule files are not valid."""
 705
 706     def __init__(self, handle, errors: str = ""):
 707         super().__init__(handle)
 708         self.errors = errors
 709
 710     def snapshot(self) -> Dict:
 711         """Save error information."""
 712         return {"errors": self.errors}
 713
 714     def restore(self, snapshot):
 715         """Restore error information."""
 716         self.errors = snapshot["errors"]
 717
 718
 719 class MetricsEndpointProviderEvents(ObjectEvents):
 720     """Events raised by :class:`InvalidAlertRuleEvent`s."""
 721
 722     alert_rule_status_changed = EventSource(InvalidAlertRuleEvent)
 723     invalid_scrape_job = EventSource(InvalidScrapeJobEvent)
 724
 725
 726 def _type_convert_stored(obj):
 727     """Convert Stored* to their appropriate types, recursively."""
 728     if isinstance(obj, StoredList):
 729         return list(map(_type_convert_stored, obj))
 730     if isinstance(obj, StoredDict):
 731         rdict = {}  # type: Dict[Any, Any]
 732         for k in obj.keys():
 733             rdict[k] = _type_convert_stored(obj[k])
 734         return rdict
 735     return obj
 736
 737
 738 def _validate_relation_by_interface_and_direction(
 739     charm: CharmBase,
 740     relation_name: str,
 741     expected_relation_interface: str,
 742     expected_relation_role: RelationRole,
 743 ):
 744     """Verifies that a relation has the necessary characteristics.
 745
 746     Verifies that the `relation_name` provided: (1) exists in metadata.yaml,
 747     (2) declares as interface the interface name passed as `relation_interface`
 748     and (3) has the right "direction", i.e., it is a relation that `charm`
 749     provides or requires.
 750
 751     Args:
 752         charm: a `CharmBase` object to scan for the matching relation.
 753         relation_name: the name of the relation to be verified.
 754         expected_relation_interface: the interface name to be matched by the
 755             relation named `relation_name`.
 756         expected_relation_role: whether the `relation_name` must be either
 757             provided or required by `charm`.
 758
 759     Raises:
 760         RelationNotFoundError: If there is no relation in the charm's metadata.yaml
 761             with the same name as provided via `relation_name` argument.
 762         RelationInterfaceMismatchError: The relation with the same name as provided
 763             via `relation_name` argument does not have the same relation interface
 764             as specified via the `expected_relation_interface` argument.
 765         RelationRoleMismatchError: If the relation with the same name as provided
 766             via `relation_name` argument does not have the same role as specified
 767             via the `expected_relation_role` argument.
 768     """
 769     if relation_name not in charm.meta.relations:
 770         raise RelationNotFoundError(relation_name)
 771
 772     relation = charm.meta.relations[relation_name]
 773
 774     actual_relation_interface = relation.interface_name
 775     if actual_relation_interface != expected_relation_interface:
 776         raise RelationInterfaceMismatchError(
 777             relation_name, expected_relation_interface, actual_relation_interface
 778         )
 779
 780     if expected_relation_role == RelationRole.provides:
 781         if relation_name not in charm.meta.provides:
 782             raise RelationRoleMismatchError(
 783                 relation_name, RelationRole.provides, RelationRole.requires
 784             )
 785     elif expected_relation_role == RelationRole.requires:
 786         if relation_name not in charm.meta.requires:
 787             raise RelationRoleMismatchError(
 788                 relation_name, RelationRole.requires, RelationRole.provides
 789             )
 790     else:
 791         raise Exception("Unexpected RelationDirection: {}".format(expected_relation_role))
 792
 793
 794 class InvalidAlertRulePathError(Exception):
 795     """Raised if the alert rules folder cannot be found or is otherwise invalid."""
 796
 797     def __init__(
 798         self,
 799         alert_rules_absolute_path: Path,
 800         message: str,
 801     ):
 802         self.alert_rules_absolute_path = alert_rules_absolute_path
 803         self.message = message
 804
 805         super().__init__(self.message)
 806
 807
 808 def _is_official_alert_rule_format(rules_dict: dict) -> bool:
 809     """Are alert rules in the upstream format as supported by Prometheus.
 810
 811     Alert rules in dictionary format are in "official" form if they
 812     contain a "groups" key, since this implies they contain a list of
 813     alert rule groups.
 814
 815     Args:
 816         rules_dict: a set of alert rules in Python dictionary format
 817
 818     Returns:
 819         True if alert rules are in official Prometheus file format.
 820     """
 821     return "groups" in rules_dict
 822
 823
 824 def _is_single_alert_rule_format(rules_dict: dict) -> bool:
 825     """Are alert rules in single rule format.
 826
 827     The Prometheus charm library supports reading of alert rules in a
 828     custom format that consists of a single alert rule per file. This
 829     does not conform to the official Prometheus alert rule file format
 830     which requires that each alert rules file consists of a list of
 831     alert rule groups and each group consists of a list of alert
 832     rules.
 833
 834     Alert rules in dictionary form are considered to be in single rule
 835     format if in the least it contains two keys corresponding to the
 836     alert rule name and alert expression.
 837
 838     Returns:
 839         True if alert rule is in single rule file format.
 840     """
 841     # one alert rule per file
 842     return set(rules_dict) >= {"alert", "expr"}
 843
 844
 845 class AlertRules:
 846     """Utility class for amalgamating prometheus alert rule files and injecting juju topology.
 847
 848     An `AlertRules` object supports aggregating alert rules from files and directories in both
 849     official and single rule file formats using the `add_path()` method. All the alert rules
 850     read are annotated with Juju topology labels and amalgamated into a single data structure
 851     in the form of a Python dictionary using the `as_dict()` method. Such a dictionary can be
 852     easily dumped into JSON format and exchanged over relation data. The dictionary can also
 853     be dumped into YAML format and written directly into an alert rules file that is read by
 854     Prometheus. Note that multiple `AlertRules` objects must not be written into the same file,
 855     since Prometheus allows only a single list of alert rule groups per alert rules file.
 856
 857     The official Prometheus format is a YAML file conforming to the Prometheus documentation
 858     (https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).
 859     The custom single rule format is a subsection of the official YAML, having a single alert
 860     rule, effectively "one alert per file".
 861     """
 862
 863     # This class uses the following terminology for the various parts of a rule file:
 864     # - alert rules file: the entire groups[] yaml, including the "groups:" key.
 865     # - alert groups (plural): the list of groups[] (a list, i.e. no "groups:" key) - it is a list
 866     #   of dictionaries that have the "name" and "rules" keys.
 867     # - alert group (singular): a single dictionary that has the "name" and "rules" keys.
 868     # - alert rules (plural): all the alerts in a given alert group - a list of dictionaries with
 869     #   the "alert" and "expr" keys.
 870     # - alert rule (singular): a single dictionary that has the "alert" and "expr" keys.
 871
 872     def __init__(self, topology: Optional[JujuTopology] = None):
 873         """Build and alert rule object.
 874
 875         Args:
 876             topology: an optional `JujuTopology` instance that is used to annotate all alert rules.
 877         """
 878         self.topology = topology
 879         self.tool = CosTool(None)
 880         self.alert_groups = []  # type: List[dict]
 881
 882     def _from_file(self, root_path: Path, file_path: Path) -> List[dict]:
 883         """Read a rules file from path, injecting juju topology.
 884
 885         Args:
 886             root_path: full path to the root rules folder (used only for generating group name)
 887             file_path: full path to a *.rule file.
 888
 889         Returns:
 890             A list of dictionaries representing the rules file, if file is valid (the structure is
 891             formed by `yaml.safe_load` of the file); an empty list otherwise.
 892         """
 893         with file_path.open() as rf:
 894             # Load a list of rules from file then add labels and filters
 895             try:
 896                 rule_file = yaml.safe_load(rf)
 897
 898             except Exception as e:
 899                 logger.error("Failed to read alert rules from %s: %s", file_path.name, e)
 900                 return []
 901
 902             if not rule_file:
 903                 logger.warning("Empty rules file: %s", file_path.name)
 904                 return []
 905             if not isinstance(rule_file, dict):
 906                 logger.error("Invalid rules file (must be a dict): %s", file_path.name)
 907                 return []
 908             if _is_official_alert_rule_format(rule_file):
 909                 alert_groups = rule_file["groups"]
 910             elif _is_single_alert_rule_format(rule_file):
 911                 # convert to list of alert groups
 912                 # group name is made up from the file name
 913                 alert_groups = [{"name": file_path.stem, "rules": [rule_file]}]
 914             else:
 915                 # invalid/unsupported
 916                 logger.error("Invalid rules file: %s", file_path.name)
 917                 return []
 918
 919             # update rules with additional metadata
 920             for alert_group in alert_groups:
 921                 # update group name with topology and sub-path
 922                 alert_group["name"] = self._group_name(
 923                     str(root_path),
 924                     str(file_path),
 925                     alert_group["name"],
 926                 )
 927
 928                 # add "juju_" topology labels
 929                 for alert_rule in alert_group["rules"]:
 930                     if "labels" not in alert_rule:
 931                         alert_rule["labels"] = {}
 932
 933                     if self.topology:
 934                         alert_rule["labels"].update(self.topology.label_matcher_dict)
 935                         # insert juju topology filters into a prometheus alert rule
 936                         alert_rule["expr"] = self.tool.inject_label_matchers(
 937                             re.sub(r"%%juju_topology%%,?", "", alert_rule["expr"]),
 938                             self.topology.label_matcher_dict,
 939                         )
 940
 941             return alert_groups
 942
 943     def _group_name(self, root_path: str, file_path: str, group_name: str) -> str:
 944         """Generate group name from path and topology.
 945
 946         The group name is made up of the relative path between the root dir_path, the file path,
 947         and topology identifier.
 948
 949         Args:
 950             root_path: path to the root rules dir.
 951             file_path: path to rule file.
 952             group_name: original group name to keep as part of the new augmented group name
 953
 954         Returns:
 955             New group name, augmented by juju topology and relative path.
 956         """
 957         rel_path = os.path.relpath(os.path.dirname(file_path), root_path)
 958         rel_path = "" if rel_path == "." else rel_path.replace(os.path.sep, "_")
 959
 960         # Generate group name:
 961         #  - name, from juju topology
 962         #  - suffix, from the relative path of the rule file;
 963         group_name_parts = [self.topology.identifier] if self.topology else []
 964         group_name_parts.extend([rel_path, group_name, "alerts"])
 965         # filter to remove empty strings
 966         return "_".join(filter(None, group_name_parts))
 967
 968     @classmethod
 969     def _multi_suffix_glob(
 970         cls, dir_path: Path, suffixes: List[str], recursive: bool = True
 971     ) -> list:
 972         """Helper function for getting all files in a directory that have a matching suffix.
 973
 974         Args:
 975             dir_path: path to the directory to glob from.
 976             suffixes: list of suffixes to include in the glob (items should begin with a period).
 977             recursive: a flag indicating whether a glob is recursive (nested) or not.
 978
 979         Returns:
 980             List of files in `dir_path` that have one of the suffixes specified in `suffixes`.
 981         """
 982         all_files_in_dir = dir_path.glob("**/*" if recursive else "*")
 983         return list(filter(lambda f: f.is_file() and f.suffix in suffixes, all_files_in_dir))
 984
 985     def _from_dir(self, dir_path: Path, recursive: bool) -> List[dict]:
 986         """Read all rule files in a directory.
 987
 988         All rules from files for the same directory are loaded into a single
 989         group. The generated name of this group includes juju topology.
 990         By default, only the top directory is scanned; for nested scanning, pass `recursive=True`.
 991
 992         Args:
 993             dir_path: directory containing *.rule files (alert rules without groups).
 994             recursive: flag indicating whether to scan for rule files recursively.
 995
 996         Returns:
 997             a list of dictionaries representing prometheus alert rule groups, each dictionary
 998             representing an alert group (structure determined by `yaml.safe_load`).
 999         """
1000         alert_groups = []  # type: List[dict]
1001
1002         # Gather all alerts into a list of groups
1003         for file_path in self._multi_suffix_glob(
1004             dir_path, [".rule", ".rules", ".yml", ".yaml"], recursive
1005         ):
1006             alert_groups_from_file = self._from_file(dir_path, file_path)
1007             if alert_groups_from_file:
1008                 logger.debug("Reading alert rule from %s", file_path)
1009                 alert_groups.extend(alert_groups_from_file)
1010
1011         return alert_groups
1012
1013     def add_path(self, path: str, *, recursive: bool = False) -> None:
1014         """Add rules from a dir path.
1015
1016         All rules from files are aggregated into a data structure representing a single rule file.
1017         All group names are augmented with juju topology.
1018
1019         Args:
1020             path: either a rules file or a dir of rules files.
1021             recursive: whether to read files recursively or not (no impact if `path` is a file).
1022
1023         Returns:
1024             True if path was added else False.
1025         """
1026         path = Path(path)  # type: Path
1027         if path.is_dir():
1028             self.alert_groups.extend(self._from_dir(path, recursive))
1029         elif path.is_file():
1030             self.alert_groups.extend(self._from_file(path.parent, path))
1031         else:
1032             logger.debug("Alert rules path does not exist: %s", path)
1033
1034     def as_dict(self) -> dict:
1035         """Return standard alert rules file in dict representation.
1036
1037         Returns:
1038             a dictionary containing a single list of alert rule groups.
1039             The list of alert rule groups is provided as value of the
1040             "groups" dictionary key.
1041         """
1042         return {"groups": self.alert_groups} if self.alert_groups else {}
1043
1044
1045 class TargetsChangedEvent(EventBase):
1046     """Event emitted when Prometheus scrape targets change."""
1047
1048     def __init__(self, handle, relation_id):
1049         super().__init__(handle)
1050         self.relation_id = relation_id
1051
1052     def snapshot(self):
1053         """Save scrape target relation information."""
1054         return {"relation_id": self.relation_id}
1055
1056     def restore(self, snapshot):
1057         """Restore scrape target relation information."""
1058         self.relation_id = snapshot["relation_id"]
1059
1060
1061 class MonitoringEvents(ObjectEvents):
1062     """Event descriptor for events raised by `MetricsEndpointConsumer`."""
1063
1064     targets_changed = EventSource(TargetsChangedEvent)
1065
1066
1067 class MetricsEndpointConsumer(Object):
1068     """A Prometheus based Monitoring service."""
1069
1070     on = MonitoringEvents()
1071
1072     def __init__(self, charm: CharmBase, relation_name: str = DEFAULT_RELATION_NAME):
1073         """A Prometheus based Monitoring service.
1074
1075         Args:
1076             charm: a `CharmBase` instance that manages this
1077                 instance of the Prometheus service.
1078             relation_name: an optional string name of the relation between `charm`
1079                 and the Prometheus charmed service. The default is "metrics-endpoint".
1080                 It is strongly advised not to change the default, so that people
1081                 deploying your charm will have a consistent experience with all
1082                 other charms that consume metrics endpoints.
1083
1084         Raises:
1085             RelationNotFoundError: If there is no relation in the charm's metadata.yaml
1086                 with the same name as provided via `relation_name` argument.
1087             RelationInterfaceMismatchError: The relation with the same name as provided
1088                 via `relation_name` argument does not have the `prometheus_scrape` relation
1089                 interface.
1090             RelationRoleMismatchError: If the relation with the same name as provided
1091                 via `relation_name` argument does not have the `RelationRole.requires`
1092                 role.
1093         """
1094         _validate_relation_by_interface_and_direction(
1095             charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.requires
1096         )
1097
1098         super().__init__(charm, relation_name)
1099         self._charm = charm
1100         self._relation_name = relation_name
1101         self._tool = CosTool(self._charm)
1102         events = self._charm.on[relation_name]
1103         self.framework.observe(events.relation_changed, self._on_metrics_provider_relation_changed)
1104         self.framework.observe(
1105             events.relation_departed, self._on_metrics_provider_relation_departed
1106         )
1107
1108     def _on_metrics_provider_relation_changed(self, event):
1109         """Handle changes with related metrics providers.
1110
1111         Anytime there are changes in relations between Prometheus
1112         and metrics provider charms the Prometheus charm is informed,
1113         through a `TargetsChangedEvent` event. The Prometheus charm can
1114         then choose to update its scrape configuration.
1115
1116         Args:
1117             event: a `CharmEvent` in response to which the Prometheus
1118                 charm must update its scrape configuration.
1119         """
1120         rel_id = event.relation.id
1121
1122         self.on.targets_changed.emit(relation_id=rel_id)
1123
1124     def _on_metrics_provider_relation_departed(self, event):
1125         """Update job config when a metrics provider departs.
1126
1127         When a metrics provider departs the Prometheus charm is informed
1128         through a `TargetsChangedEvent` event so that it can update its
1129         scrape configuration to ensure that the departed metrics provider
1130         is removed from the list of scrape jobs and
1131
1132         Args:
1133             event: a `CharmEvent` that indicates a metrics provider
1134                unit has departed.
1135         """
1136         rel_id = event.relation.id
1137         self.on.targets_changed.emit(relation_id=rel_id)
1138
1139     def jobs(self) -> list:
1140         """Fetch the list of scrape jobs.
1141
1142         Returns:
1143             A list consisting of all the static scrape configurations
1144             for each related `MetricsEndpointProvider` that has specified
1145             its scrape targets.
1146         """
1147         scrape_jobs = []
1148
1149         for relation in self._charm.model.relations[self._relation_name]:
1150             static_scrape_jobs = self._static_scrape_config(relation)
1151             if static_scrape_jobs:
1152                 # Duplicate job names will cause validate_scrape_jobs to fail.
1153                 # Therefore we need to dedupe here and after all jobs are collected.
1154                 static_scrape_jobs = _dedupe_job_names(static_scrape_jobs)
1155                 try:
1156                     self._tool.validate_scrape_jobs(static_scrape_jobs)
1157                 except subprocess.CalledProcessError as e:
1158                     if self._charm.unit.is_leader():
1159                         data = json.loads(relation.data[self._charm.app].get("event", "{}"))
1160                         data["scrape_job_errors"] = str(e)
1161                         relation.data[self._charm.app]["event"] = json.dumps(data)
1162                 else:
1163                     scrape_jobs.extend(static_scrape_jobs)
1164
1165         scrape_jobs = _dedupe_job_names(scrape_jobs)
1166
1167         return scrape_jobs
1168
1169     @property
1170     def alerts(self) -> dict:
1171         """Fetch alerts for all relations.
1172
1173         A Prometheus alert rules file consists of a list of "groups". Each
1174         group consists of a list of alerts (`rules`) that are sequentially
1175         executed. This method returns all the alert rules provided by each
1176         related metrics provider charm. These rules may be used to generate a
1177         separate alert rules file for each relation since the returned list
1178         of alert groups are indexed by that relations Juju topology identifier.
1179         The Juju topology identifier string includes substrings that identify
1180         alert rule related metadata such as the Juju model, model UUID and the
1181         application name from where the alert rule originates. Since this
1182         topology identifier is globally unique, it may be used for instance as
1183         the name for the file into which the list of alert rule groups are
1184         written. For each relation, the structure of data returned is a dictionary
1185         representation of a standard prometheus rules file:
1186
1187         {"groups": [{"name": ...}, ...]}
1188
1189         per official prometheus documentation
1190         https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
1191
1192         The value of the `groups` key is such that it may be used to generate
1193         a Prometheus alert rules file directly using `yaml.dump` but the
1194         `groups` key itself must be included as this is required by Prometheus.
1195
1196         For example the list of alert rule groups returned by this method may
1197         be written into files consumed by Prometheus as follows
1198
1199         ```
1200         for topology_identifier, alert_rule_groups in self.metrics_consumer.alerts().items():
1201             filename = "juju_" + topology_identifier + ".rules"
1202             path = os.path.join(PROMETHEUS_RULES_DIR, filename)
1203             rules = yaml.safe_dump(alert_rule_groups)
1204             container.push(path, rules, make_dirs=True)
1205         ```
1206
1207         Returns:
1208             A dictionary mapping the Juju topology identifier of the source charm to
1209             its list of alert rule groups.
1210         """
1211         alerts = {}  # type: Dict[str, dict] # mapping b/w juju identifiers and alert rule files
1212         for relation in self._charm.model.relations[self._relation_name]:
1213             if not relation.units or not relation.app:
1214                 continue
1215
1216             alert_rules = json.loads(relation.data[relation.app].get("alert_rules", "{}"))
1217             if not alert_rules:
1218                 continue
1219
1220             alert_rules = self._inject_alert_expr_labels(alert_rules)
1221
1222             identifier, topology = self._get_identifier_by_alert_rules(alert_rules)
1223             if not topology:
1224                 try:
1225                     scrape_metadata = json.loads(relation.data[relation.app]["scrape_metadata"])
1226                     identifier = JujuTopology.from_dict(scrape_metadata).identifier
1227                     alerts[identifier] = self._tool.apply_label_matchers(alert_rules)  # type: ignore
1228
1229                 except KeyError as e:
1230                     logger.debug(
1231                         "Relation %s has no 'scrape_metadata': %s",
1232                         relation.id,
1233                         e,
1234                     )
1235
1236             if not identifier:
1237                 logger.error(
1238                     "Alert rules were found but no usable group or identifier was present."
1239                 )
1240                 continue
1241
1242             alerts[identifier] = alert_rules
1243
1244             _, errmsg = self._tool.validate_alert_rules(alert_rules)
1245             if errmsg:
1246                 if alerts[identifier]:
1247                     del alerts[identifier]
1248                 if self._charm.unit.is_leader():
1249                     data = json.loads(relation.data[self._charm.app].get("event", "{}"))
1250                     data["errors"] = errmsg
1251                     relation.data[self._charm.app]["event"] = json.dumps(data)
1252                 continue
1253
1254         return alerts
1255
1256     def _get_identifier_by_alert_rules(
1257         self, rules: dict
1258     ) -> Tuple[Union[str, None], Union[JujuTopology, None]]:
1259         """Determine an appropriate dict key for alert rules.
1260
1261         The key is used as the filename when writing alerts to disk, so the structure
1262         and uniqueness is important.
1263
1264         Args:
1265             rules: a dict of alert rules
1266         Returns:
1267             A tuple containing an identifier, if found, and a JujuTopology, if it could
1268             be constructed.
1269         """
1270         if "groups" not in rules:
1271             logger.debug("No alert groups were found in relation data")
1272             return None, None
1273
1274         # Construct an ID based on what's in the alert rules if they have labels
1275         for group in rules["groups"]:
1276             try:
1277                 labels = group["rules"][0]["labels"]
1278                 topology = JujuTopology(
1279                     # Don't try to safely get required constructor fields. There's already
1280                     # a handler for KeyErrors
1281                     model_uuid=labels["juju_model_uuid"],
1282                     model=labels["juju_model"],
1283                     application=labels["juju_application"],
1284                     unit=labels.get("juju_unit", ""),
1285                     charm_name=labels.get("juju_charm", ""),
1286                 )
1287                 return topology.identifier, topology
1288             except KeyError:
1289                 logger.debug("Alert rules were found but no usable labels were present")
1290                 continue
1291
1292         logger.warning(
1293             "No labeled alert rules were found, and no 'scrape_metadata' "
1294             "was available. Using the alert group name as filename."
1295         )
1296         try:
1297             for group in rules["groups"]:
1298                 return group["name"], None
1299         except KeyError:
1300             logger.debug("No group name was found to use as identifier")
1301
1302         return None, None
1303
1304     def _inject_alert_expr_labels(self, rules: Dict[str, Any]) -> Dict[str, Any]:
1305         """Iterate through alert rules and inject topology into expressions.
1306
1307         Args:
1308             rules: a dict of alert rules
1309         """
1310         if "groups" not in rules:
1311             return rules
1312
1313         modified_groups = []
1314         for group in rules["groups"]:
1315             # Copy off rules, so we don't modify an object we're iterating over
1316             rules_copy = group["rules"]
1317             for idx, rule in enumerate(rules_copy):
1318                 labels = rule.get("labels")
1319
1320                 if labels:
1321                     try:
1322                         topology = JujuTopology(
1323                             # Don't try to safely get required constructor fields. There's already
1324                             # a handler for KeyErrors
1325                             model_uuid=labels["juju_model_uuid"],
1326                             model=labels["juju_model"],
1327                             application=labels["juju_application"],
1328                             unit=labels.get("juju_unit", ""),
1329                             charm_name=labels.get("juju_charm", ""),
1330                         )
1331
1332                         # Inject topology and put it back in the list
1333                         rule["expr"] = self._tool.inject_label_matchers(
1334                             re.sub(r"%%juju_topology%%,?", "", rule["expr"]),
1335                             topology.label_matcher_dict,
1336                         )
1337                     except KeyError:
1338                         # Some required JujuTopology key is missing. Just move on.
1339                         pass
1340
1341                     group["rules"][idx] = rule
1342
1343             modified_groups.append(group)
1344
1345         rules["groups"] = modified_groups
1346         return rules
1347
1348     def _static_scrape_config(self, relation) -> list:
1349         """Generate the static scrape configuration for a single relation.
1350
1351         If the relation data includes `scrape_metadata` then the value
1352         of this key is used to annotate the scrape jobs with Juju
1353         Topology labels before returning them.
1354
1355         Args:
1356             relation: an `ops.model.Relation` object whose static
1357                 scrape configuration is required.
1358
1359         Returns:
1360             A list (possibly empty) of scrape jobs. Each job is a
1361             valid Prometheus scrape configuration for that job,
1362             represented as a Python dictionary.
1363         """
1364         if not relation.units:
1365             return []
1366
1367         scrape_jobs = json.loads(relation.data[relation.app].get("scrape_jobs", "[]"))
1368
1369         if not scrape_jobs:
1370             return []
1371
1372         scrape_metadata = json.loads(relation.data[relation.app].get("scrape_metadata", "{}"))
1373
1374         if not scrape_metadata:
1375             return scrape_jobs
1376
1377         topology = JujuTopology.from_dict(scrape_metadata)
1378
1379         job_name_prefix = "juju_{}_prometheus_scrape".format(topology.identifier)
1380         scrape_jobs = PrometheusConfig.prefix_job_names(scrape_jobs, job_name_prefix)
1381         scrape_jobs = PrometheusConfig.sanitize_scrape_configs(scrape_jobs)
1382
1383         hosts = self._relation_hosts(relation)
1384
1385         scrape_jobs = PrometheusConfig.expand_wildcard_targets_into_individual_jobs(
1386             scrape_jobs, hosts, topology
1387         )
1388
1389         return scrape_jobs
1390
1391     def _relation_hosts(self, relation: Relation) -> Dict[str, Tuple[str, str]]:
1392         """Returns a mapping from unit names to (address, path) tuples, for the given relation."""
1393         hosts = {}
1394         for unit in relation.units:
1395             # TODO deprecate and remove unit.name
1396             unit_name = relation.data[unit].get("prometheus_scrape_unit_name") or unit.name
1397             # TODO deprecate and remove "prometheus_scrape_host"
1398             unit_address = relation.data[unit].get(
1399                 "prometheus_scrape_unit_address"
1400             ) or relation.data[unit].get("prometheus_scrape_host")
1401             unit_path = relation.data[unit].get("prometheus_scrape_unit_path", "")
1402             if unit_name and unit_address:
1403                 hosts.update({unit_name: (unit_address, unit_path)})
1404
1405         return hosts
1406
1407     def _target_parts(self, target) -> list:
1408         """Extract host and port from a wildcard target.
1409
1410         Args:
1411             target: a string specifying a scrape target. A
1412               scrape target is expected to have the format
1413               "host:port". The host part may be a wildcard
1414               "*" and the port part can be missing (along
1415               with ":") in which case port is set to 80.
1416
1417         Returns:
1418             a list with target host and port as in [host, port]
1419         """
1420         if ":" in target:
1421             parts = target.split(":")
1422         else:
1423             parts = [target, "80"]
1424
1425         return parts
1426
1427
1428 def _dedupe_job_names(jobs: List[dict]):
1429     """Deduplicate a list of dicts by appending a hash to the value of the 'job_name' key.
1430
1431     Additionally, fully de-duplicate any identical jobs.
1432
1433     Args:
1434         jobs: A list of prometheus scrape jobs
1435     """
1436     jobs_copy = copy.deepcopy(jobs)
1437
1438     # Convert to a dict with job names as keys
1439     # I think this line is O(n^2) but it should be okay given the list sizes
1440     jobs_dict = {
1441         job["job_name"]: list(filter(lambda x: x["job_name"] == job["job_name"], jobs_copy))
1442         for job in jobs_copy
1443     }
1444
1445     # If multiple jobs have the same name, convert the name to "name_<hash-of-job>"
1446     for key in jobs_dict:
1447         if len(jobs_dict[key]) > 1:
1448             for job in jobs_dict[key]:
1449                 job_json = json.dumps(job)
1450                 hashed = hashlib.sha256(job_json.encode()).hexdigest()
1451                 job["job_name"] = "{}_{}".format(job["job_name"], hashed)
1452     new_jobs = []
1453     for key in jobs_dict:
1454         new_jobs.extend(list(jobs_dict[key]))
1455
1456     # Deduplicate jobs which are equal
1457     # Again this in O(n^2) but it should be okay
1458     deduped_jobs = []
1459     seen = []
1460     for job in new_jobs:
1461         job_json = json.dumps(job)
1462         hashed = hashlib.sha256(job_json.encode()).hexdigest()
1463         if hashed in seen:
1464             continue
1465         seen.append(hashed)
1466         deduped_jobs.append(job)
1467
1468     return deduped_jobs
1469
1470
1471 def _resolve_dir_against_charm_path(charm: CharmBase, *path_elements: str) -> str:
1472     """Resolve the provided path items against the directory of the main file.
1473
1474     Look up the directory of the `main.py` file being executed. This is normally
1475     going to be the charm.py file of the charm including this library. Then, resolve
1476     the provided path elements and, if the result path exists and is a directory,
1477     return its absolute path; otherwise, raise en exception.
1478
1479     Raises:
1480         InvalidAlertRulePathError, if the path does not exist or is not a directory.
1481     """
1482     charm_dir = Path(str(charm.charm_dir))
1483     if not charm_dir.exists() or not charm_dir.is_dir():
1484         # Operator Framework does not currently expose a robust
1485         # way to determine the top level charm source directory
1486         # that is consistent across deployed charms and unit tests
1487         # Hence for unit tests the current working directory is used
1488         # TODO: updated this logic when the following ticket is resolved
1489         # https://github.com/canonical/operator/issues/643
1490         charm_dir = Path(os.getcwd())
1491
1492     alerts_dir_path = charm_dir.absolute().joinpath(*path_elements)
1493
1494     if not alerts_dir_path.exists():
1495         raise InvalidAlertRulePathError(alerts_dir_path, "directory does not exist")
1496     if not alerts_dir_path.is_dir():
1497         raise InvalidAlertRulePathError(alerts_dir_path, "is not a directory")
1498
1499     return str(alerts_dir_path)
1500
1501
1502 class MetricsEndpointProvider(Object):
1503     """A metrics endpoint for Prometheus."""
1504
1505     on = MetricsEndpointProviderEvents()
1506
1507     def __init__(
1508         self,
1509         charm,
1510         relation_name: str = DEFAULT_RELATION_NAME,
1511         jobs=None,
1512         alert_rules_path: str = DEFAULT_ALERT_RULES_RELATIVE_PATH,
1513         refresh_event: Optional[Union[BoundEvent, List[BoundEvent]]] = None,
1514         external_url: str = "",
1515         lookaside_jobs_callable: Optional[Callable] = None,
1516     ):
1517         """Construct a metrics provider for a Prometheus charm.
1518
1519         If your charm exposes a Prometheus metrics endpoint, the
1520         `MetricsEndpointProvider` object enables your charm to easily
1521         communicate how to reach that metrics endpoint.
1522
1523         By default, a charm instantiating this object has the metrics
1524         endpoints of each of its units scraped by the related Prometheus
1525         charms. The scraped metrics are automatically tagged by the
1526         Prometheus charms with Juju topology data via the
1527         `juju_model_name`, `juju_model_uuid`, `juju_application_name`
1528         and `juju_unit` labels. To support such tagging `MetricsEndpointProvider`
1529         automatically forwards scrape metadata to a `MetricsEndpointConsumer`
1530         (Prometheus charm).
1531
1532         Scrape targets provided by `MetricsEndpointProvider` can be
1533         customized when instantiating this object. For example in the
1534         case of a charm exposing the metrics endpoint for each of its
1535         units on port 8080 and the `/metrics` path, the
1536         `MetricsEndpointProvider` can be instantiated as follows:
1537
1538             self.metrics_endpoint_provider = MetricsEndpointProvider(
1539                 self,
1540                 jobs=[{
1541                     "static_configs": [{"targets": ["*:8080"]}],
1542                 }])
1543
1544         The notation `*:<port>` means "scrape each unit of this charm on port
1545         `<port>`.
1546
1547         In case the metrics endpoints are not on the standard `/metrics` path,
1548         a custom path can be specified as follows:
1549
1550             self.metrics_endpoint_provider = MetricsEndpointProvider(
1551                 self,
1552                 jobs=[{
1553                     "metrics_path": "/my/strange/metrics/path",
1554                     "static_configs": [{"targets": ["*:8080"]}],
1555                 }])
1556
1557         Note how the `jobs` argument is a list: this allows you to expose multiple
1558         combinations of paths "metrics_path" and "static_configs" in case your charm
1559         exposes multiple endpoints, which could happen, for example, when you have
1560         multiple workload containers, with applications in each needing to be scraped.
1561         The structure of the objects in the `jobs` list is one-to-one with the
1562         `scrape_config` configuration item of Prometheus' own configuration (see
1563         https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
1564         ), but with only a subset of the fields allowed. The permitted fields are
1565         listed in `ALLOWED_KEYS` object in this charm library module.
1566
1567         It is also possible to specify alert rules. By default, this library will look
1568         into the `<charm_parent_dir>/prometheus_alert_rules`, which in a standard charm
1569         layouts resolves to `src/prometheus_alert_rules`. Each alert rule goes into a
1570         separate `*.rule` file. If the syntax of a rule is invalid,
1571         the  `MetricsEndpointProvider` logs an error and does not load the particular
1572         rule.
1573
1574         To avoid false positives and negatives in the evaluation of alert rules,
1575         all ingested alert rule expressions are automatically qualified using Juju
1576         Topology filters. This ensures that alert rules provided by your charm, trigger
1577         alerts based only on data scrapped from your charm. For example an alert rule
1578         such as the following
1579
1580             alert: UnitUnavailable
1581             expr: up < 1
1582             for: 0m
1583
1584         will be automatically transformed into something along the lines of the following
1585
1586             alert: UnitUnavailable
1587             expr: up{juju_model=<model>, juju_model_uuid=<uuid-prefix>, juju_application=<app>} < 1
1588             for: 0m
1589
1590         An attempt will be made to validate alert rules prior to loading them into Prometheus.
1591         If they are invalid, an event will be emitted from this object which charms can respond
1592         to in order to set a meaningful status for administrators.
1593
1594         This can be observed via `consumer.on.alert_rule_status_changed` which contains:
1595             - The error(s) encountered when validating as `errors`
1596             - A `valid` attribute, which can be used to reset the state of charms if alert rules
1597               are updated via another mechanism (e.g. `cos-config`) and refreshed.
1598
1599         Args:
1600             charm: a `CharmBase` object that manages this
1601                 `MetricsEndpointProvider` object. Typically, this is
1602                 `self` in the instantiating class.
1603             relation_name: an optional string name of the relation between `charm`
1604                 and the Prometheus charmed service. The default is "metrics-endpoint".
1605                 It is strongly advised not to change the default, so that people
1606                 deploying your charm will have a consistent experience with all
1607                 other charms that provide metrics endpoints.
1608             jobs: an optional list of dictionaries where each
1609                 dictionary represents the Prometheus scrape
1610                 configuration for a single job. When not provided, a
1611                 default scrape configuration is provided for the
1612                 `/metrics` endpoint polling all units of the charm on port `80`
1613                 using the `MetricsEndpointProvider` object.
1614             alert_rules_path: an optional path for the location of alert rules
1615                 files.  Defaults to "./prometheus_alert_rules",
1616                 resolved relative to the directory hosting the charm entry file.
1617                 The alert rules are automatically updated on charm upgrade.
1618             refresh_event: an optional bound event or list of bound events which
1619                 will be observed to re-set scrape job data (IP address and others)
1620             external_url: an optional argument that represents an external url that
1621                 can be generated by an Ingress or a Proxy.
1622             lookaside_jobs_callable: an optional `Callable` which should be invoked
1623                 when the job configuration is built as a secondary mapping. The callable
1624                 should return a `List[Dict]` which is syntactically identical to the
1625                 `jobs` parameter, but can be updated out of step initialization of
1626                 this library without disrupting the 'global' job spec.
1627
1628         Raises:
1629             RelationNotFoundError: If there is no relation in the charm's metadata.yaml
1630                 with the same name as provided via `relation_name` argument.
1631             RelationInterfaceMismatchError: The relation with the same name as provided
1632                 via `relation_name` argument does not have the `prometheus_scrape` relation
1633                 interface.
1634             RelationRoleMismatchError: If the relation with the same name as provided
1635                 via `relation_name` argument does not have the `RelationRole.provides`
1636                 role.
1637         """
1638         _validate_relation_by_interface_and_direction(
1639             charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.provides
1640         )
1641
1642         try:
1643             alert_rules_path = _resolve_dir_against_charm_path(charm, alert_rules_path)
1644         except InvalidAlertRulePathError as e:
1645             logger.debug(
1646                 "Invalid Prometheus alert rules folder at %s: %s",
1647                 e.alert_rules_absolute_path,
1648                 e.message,
1649             )
1650
1651         super().__init__(charm, relation_name)
1652         self.topology = JujuTopology.from_charm(charm)
1653
1654         self._charm = charm
1655         self._alert_rules_path = alert_rules_path
1656         self._relation_name = relation_name
1657         # sanitize job configurations to the supported subset of parameters
1658         jobs = [] if jobs is None else jobs
1659         self._jobs = PrometheusConfig.sanitize_scrape_configs(jobs)
1660
1661         if external_url:
1662             external_url = (
1663                 external_url if urlparse(external_url).scheme else ("http://" + external_url)
1664             )
1665         self.external_url = external_url
1666         self._lookaside_jobs = lookaside_jobs_callable
1667
1668         events = self._charm.on[self._relation_name]
1669         self.framework.observe(events.relation_changed, self._on_relation_changed)
1670
1671         if not refresh_event:
1672             # FIXME remove once podspec charms are verified.
1673             # `self.set_scrape_job_spec()` is called every re-init so this should not be needed.
1674             if len(self._charm.meta.containers) == 1:
1675                 if "kubernetes" in self._charm.meta.series:
1676                     # This is a podspec charm
1677                     refresh_event = [self._charm.on.update_status]
1678                 else:
1679                     # This is a sidecar/pebble charm
1680                     container = list(self._charm.meta.containers.values())[0]
1681                     refresh_event = [self._charm.on[container.name.replace("-", "_")].pebble_ready]
1682             else:
1683                 logger.warning(
1684                     "%d containers are present in metadata.yaml and "
1685                     "refresh_event was not specified. Defaulting to update_status. "
1686                     "Metrics IP may not be set in a timely fashion.",
1687                     len(self._charm.meta.containers),
1688                 )
1689                 refresh_event = [self._charm.on.update_status]
1690
1691         else:
1692             if not isinstance(refresh_event, list):
1693                 refresh_event = [refresh_event]
1694
1695         self.framework.observe(events.relation_joined, self.set_scrape_job_spec)
1696         for ev in refresh_event:
1697             self.framework.observe(ev, self.set_scrape_job_spec)
1698
1699     def _on_relation_changed(self, event):
1700         """Check for alert rule messages in the relation data before moving on."""
1701         if self._charm.unit.is_leader():
1702             ev = json.loads(event.relation.data[event.app].get("event", "{}"))
1703
1704             if ev:
1705                 valid = bool(ev.get("valid", True))
1706                 errors = ev.get("errors", "")
1707
1708                 if valid and not errors:
1709                     self.on.alert_rule_status_changed.emit(valid=valid)
1710                 else:
1711                     self.on.alert_rule_status_changed.emit(valid=valid, errors=errors)
1712
1713                 scrape_errors = ev.get("scrape_job_errors", None)
1714                 if scrape_errors:
1715                     self.on.invalid_scrape_job.emit(errors=scrape_errors)
1716
1717     def update_scrape_job_spec(self, jobs):
1718         """Update scrape job specification."""
1719         self._jobs = PrometheusConfig.sanitize_scrape_configs(jobs)
1720         self.set_scrape_job_spec()
1721
1722     def set_scrape_job_spec(self, _=None):
1723         """Ensure scrape target information is made available to prometheus.
1724
1725         When a metrics provider charm is related to a prometheus charm, the
1726         metrics provider sets specification and metadata related to its own
1727         scrape configuration. This information is set using Juju application
1728         data. In addition, each of the consumer units also sets its own
1729         host address in Juju unit relation data.
1730         """
1731         self._set_unit_ip()
1732
1733         if not self._charm.unit.is_leader():
1734             return
1735
1736         alert_rules = AlertRules(topology=self.topology)
1737         alert_rules.add_path(self._alert_rules_path, recursive=True)
1738         alert_rules_as_dict = alert_rules.as_dict()
1739
1740         for relation in self._charm.model.relations[self._relation_name]:
1741             relation.data[self._charm.app]["scrape_metadata"] = json.dumps(self._scrape_metadata)
1742             relation.data[self._charm.app]["scrape_jobs"] = json.dumps(self._scrape_jobs)
1743
1744             if alert_rules_as_dict:
1745                 # Update relation data with the string representation of the rule file.
1746                 # Juju topology is already included in the "scrape_metadata" field above.
1747                 # The consumer side of the relation uses this information to name the rules file
1748                 # that is written to the filesystem.
1749                 relation.data[self._charm.app]["alert_rules"] = json.dumps(alert_rules_as_dict)
1750
1751     def _set_unit_ip(self, _=None):
1752         """Set unit host address.
1753
1754         Each time a metrics provider charm container is restarted it updates its own
1755         host address in the unit relation data for the prometheus charm.
1756
1757         The only argument specified is an event, and it ignored. This is for expediency
1758         to be able to use this method as an event handler, although no access to the
1759         event is actually needed.
1760         """
1761         for relation in self._charm.model.relations[self._relation_name]:
1762             unit_ip = str(self._charm.model.get_binding(relation).network.bind_address)
1763
1764             # TODO store entire url in relation data, instead of only select url parts.
1765
1766             if self.external_url:
1767                 parsed = urlparse(self.external_url)
1768                 unit_address = parsed.hostname
1769                 path = parsed.path
1770             elif self._is_valid_unit_address(unit_ip):
1771                 unit_address = unit_ip
1772                 path = ""
1773             else:
1774                 unit_address = socket.getfqdn()
1775                 path = ""
1776
1777             relation.data[self._charm.unit]["prometheus_scrape_unit_address"] = unit_address
1778             relation.data[self._charm.unit]["prometheus_scrape_unit_path"] = path
1779             relation.data[self._charm.unit]["prometheus_scrape_unit_name"] = str(
1780                 self._charm.model.unit.name
1781             )
1782
1783     def _is_valid_unit_address(self, address: str) -> bool:
1784         """Validate a unit address.
1785
1786         At present only IP address validation is supported, but
1787         this may be extended to DNS addresses also, as needed.
1788
1789         Args:
1790             address: a string representing a unit address
1791         """
1792         try:
1793             _ = ipaddress.ip_address(address)
1794         except ValueError:
1795             return False
1796
1797         return True
1798
1799     @property
1800     def _scrape_jobs(self) -> list:
1801         """Fetch list of scrape jobs.
1802
1803         Returns:
1804            A list of dictionaries, where each dictionary specifies a
1805            single scrape job for Prometheus.
1806         """
1807         jobs = self._jobs if self._jobs else [DEFAULT_JOB]
1808         if callable(self._lookaside_jobs):
1809             return jobs + PrometheusConfig.sanitize_scrape_configs(self._lookaside_jobs())
1810         return jobs
1811
1812     @property
1813     def _scrape_metadata(self) -> dict:
1814         """Generate scrape metadata.
1815
1816         Returns:
1817             Scrape configuration metadata for this metrics provider charm.
1818         """
1819         return self.topology.as_dict()
1820
1821
1822 class PrometheusRulesProvider(Object):
1823     """Forward rules to Prometheus.
1824
1825     This object may be used to forward rules to Prometheus. At present it only supports
1826     forwarding alert rules. This is unlike :class:`MetricsEndpointProvider`, which
1827     is used for forwarding both scrape targets and associated alert rules. This object
1828     is typically used when there is a desire to forward rules that apply globally (across
1829     all deployed charms and units) rather than to a single charm. All rule files are
1830     forwarded using the same 'prometheus_scrape' interface that is also used by
1831     `MetricsEndpointProvider`.
1832
1833     Args:
1834         charm: A charm instance that `provides` a relation with the `prometheus_scrape` interface.
1835         relation_name: Name of the relation in `metadata.yaml` that
1836             has the `prometheus_scrape` interface.
1837         dir_path: Root directory for the collection of rule files.
1838         recursive: Whether to scan for rule files recursively.
1839     """
1840
1841     def __init__(
1842         self,
1843         charm: CharmBase,
1844         relation_name: str = DEFAULT_RELATION_NAME,
1845         dir_path: str = DEFAULT_ALERT_RULES_RELATIVE_PATH,
1846         recursive=True,
1847     ):
1848         super().__init__(charm, relation_name)
1849         self._charm = charm
1850         self._relation_name = relation_name
1851         self._recursive = recursive
1852
1853         try:
1854             dir_path = _resolve_dir_against_charm_path(charm, dir_path)
1855         except InvalidAlertRulePathError as e:
1856             logger.debug(
1857                 "Invalid Prometheus alert rules folder at %s: %s",
1858                 e.alert_rules_absolute_path,
1859                 e.message,
1860             )
1861         self.dir_path = dir_path
1862
1863         events = self._charm.on[self._relation_name]
1864         event_sources = [
1865             events.relation_joined,
1866             events.relation_changed,
1867             self._charm.on.leader_elected,
1868             self._charm.on.upgrade_charm,
1869         ]
1870
1871         for event_source in event_sources:
1872             self.framework.observe(event_source, self._update_relation_data)
1873
1874     def _reinitialize_alert_rules(self):
1875         """Reloads alert rules and updates all relations."""
1876         self._update_relation_data(None)
1877
1878     def _update_relation_data(self, _):
1879         """Update application relation data with alert rules for all relations."""
1880         if not self._charm.unit.is_leader():
1881             return
1882
1883         alert_rules = AlertRules()
1884         alert_rules.add_path(self.dir_path, recursive=self._recursive)
1885         alert_rules_as_dict = alert_rules.as_dict()
1886
1887         logger.info("Updating relation data with rule files from disk")
1888         for relation in self._charm.model.relations[self._relation_name]:
1889             relation.data[self._charm.app]["alert_rules"] = json.dumps(
1890                 alert_rules_as_dict,
1891                 sort_keys=True,  # sort, to prevent unnecessary relation_changed events
1892             )
1893
1894
1895 class MetricsEndpointAggregator(Object):
1896     """Aggregate metrics from multiple scrape targets.
1897
1898     `MetricsEndpointAggregator` collects scrape target information from one
1899     or more related charms and forwards this to a `MetricsEndpointConsumer`
1900     charm, which may be in a different Juju model. However, it is
1901     essential that `MetricsEndpointAggregator` itself resides in the same
1902     model as its scrape targets, as this is currently the only way to
1903     ensure in Juju that the `MetricsEndpointAggregator` will be able to
1904     determine the model name and uuid of the scrape targets.
1905
1906     `MetricsEndpointAggregator` should be used in place of
1907     `MetricsEndpointProvider` in the following two use cases:
1908
1909     1. Integrating one or more scrape targets that do not support the
1910     `prometheus_scrape` interface.
1911
1912     2. Integrating one or more scrape targets through cross model
1913     relations. Although the [Scrape Config Operator](https://charmhub.io/cos-configuration-k8s)
1914     may also be used for the purpose of supporting cross model
1915     relations.
1916
1917     Using `MetricsEndpointAggregator` to build a Prometheus charm client
1918     only requires instantiating it. Instantiating
1919     `MetricsEndpointAggregator` is similar to `MetricsEndpointProvider` except
1920     that it requires specifying the names of three relations: the
1921     relation with scrape targets, the relation for alert rules, and
1922     that with the Prometheus charms. For example
1923
1924     ```python
1925     self._aggregator = MetricsEndpointAggregator(
1926         self,
1927         {
1928             "prometheus": "monitoring",
1929             "scrape_target": "prometheus-target",
1930             "alert_rules": "prometheus-rules"
1931         }
1932     )
1933     ```
1934
1935     `MetricsEndpointAggregator` assumes that each unit of a scrape target
1936     sets in its unit-level relation data two entries with keys
1937     "hostname" and "port". If it is required to integrate with charms
1938     that do not honor these assumptions, it is always possible to
1939     derive from `MetricsEndpointAggregator` overriding the `_get_targets()`
1940     method, which is responsible for aggregating the unit name, host
1941     address ("hostname") and port of the scrape target.
1942     `MetricsEndpointAggregator` also assumes that each unit of a
1943     scrape target sets in its unit-level relation data a key named
1944     "groups". The value of this key is expected to be the string
1945     representation of list of Prometheus Alert rules in YAML format.
1946     An example of a single such alert rule is
1947
1948     ```yaml
1949     - alert: HighRequestLatency
1950       expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
1951       for: 10m
1952       labels:
1953         severity: page
1954       annotations:
1955         summary: High request latency
1956     ```
1957
1958     Once again if it is required to integrate with charms that do not
1959     honour these assumptions about alert rules then an object derived
1960     from `MetricsEndpointAggregator` may be used by overriding the
1961     `_get_alert_rules()` method.
1962
1963     `MetricsEndpointAggregator` ensures that Prometheus scrape job
1964     specifications and alert rules are annotated with Juju topology
1965     information, just like `MetricsEndpointProvider` and
1966     `MetricsEndpointConsumer` do.
1967
1968     By default, `MetricsEndpointAggregator` ensures that Prometheus
1969     "instance" labels refer to Juju topology. This ensures that
1970     instance labels are stable over unit recreation. While it is not
1971     advisable to change this option, if required it can be done by
1972     setting the "relabel_instance" keyword argument to `False` when
1973     constructing an aggregator object.
1974     """
1975
1976     _stored = StoredState()
1977
1978     def __init__(
1979         self,
1980         charm,
1981         relation_names: Optional[dict] = None,
1982         relabel_instance=True,
1983         resolve_addresses=False,
1984     ):
1985         """Construct a `MetricsEndpointAggregator`.
1986
1987         Args:
1988             charm: a `CharmBase` object that manages this
1989                 `MetricsEndpointAggregator` object. Typically, this is
1990                 `self` in the instantiating class.
1991             relation_names: a dictionary with three keys. The value
1992                 of the "scrape_target" and "alert_rules" keys are
1993                 the relation names over which scrape job and alert rule
1994                 information is gathered by this `MetricsEndpointAggregator`.
1995                 And the value of the "prometheus" key is the name of
1996                 the relation with a `MetricsEndpointConsumer` such as
1997                 the Prometheus charm.
1998             relabel_instance: A boolean flag indicating if Prometheus
1999                 scrape job "instance" labels must refer to Juju Topology.
2000             resolve_addresses: A boolean flag indiccating if the aggregator
2001                 should attempt to perform DNS lookups of targets and append
2002                 a `dns_name` label
2003         """
2004         self._charm = charm
2005
2006         relation_names = relation_names or {}
2007
2008         self._prometheus_relation = relation_names.get(
2009             "prometheus", "downstream-prometheus-scrape"
2010         )
2011         self._target_relation = relation_names.get("scrape_target", "prometheus-target")
2012         self._alert_rules_relation = relation_names.get("alert_rules", "prometheus-rules")
2013
2014         super().__init__(charm, self._prometheus_relation)
2015         self._stored.set_default(jobs=[], alert_rules=[])
2016
2017         self._relabel_instance = relabel_instance
2018         self._resolve_addresses = resolve_addresses
2019
2020         # manage Prometheus charm relation events
2021         prometheus_events = self._charm.on[self._prometheus_relation]
2022         self.framework.observe(prometheus_events.relation_joined, self._set_prometheus_data)
2023
2024         # manage list of Prometheus scrape jobs from related scrape targets
2025         target_events = self._charm.on[self._target_relation]
2026         self.framework.observe(target_events.relation_changed, self._on_prometheus_targets_changed)
2027         self.framework.observe(
2028             target_events.relation_departed, self._on_prometheus_targets_departed
2029         )
2030
2031         # manage alert rules for Prometheus from related scrape targets
2032         alert_rule_events = self._charm.on[self._alert_rules_relation]
2033         self.framework.observe(alert_rule_events.relation_changed, self._on_alert_rules_changed)
2034         self.framework.observe(alert_rule_events.relation_departed, self._on_alert_rules_departed)
2035
2036     def _set_prometheus_data(self, event):
2037         """Ensure every new Prometheus instances is updated.
2038
2039         Any time a new Prometheus unit joins the relation with
2040         `MetricsEndpointAggregator`, that Prometheus unit is provided
2041         with the complete set of existing scrape jobs and alert rules.
2042         """
2043         if not self._charm.unit.is_leader():
2044             return
2045
2046         jobs = [] + _type_convert_stored(
2047             self._stored.jobs
2048         )  # list of scrape jobs, one per relation
2049         for relation in self.model.relations[self._target_relation]:
2050             targets = self._get_targets(relation)
2051             if targets and relation.app:
2052                 jobs.append(self._static_scrape_job(targets, relation.app.name))
2053
2054         groups = [] + _type_convert_stored(self._stored.alert_rules)  # list of alert rule groups
2055         for relation in self.model.relations[self._alert_rules_relation]:
2056             unit_rules = self._get_alert_rules(relation)
2057             if unit_rules and relation.app:
2058                 appname = relation.app.name
2059                 rules = self._label_alert_rules(unit_rules, appname)
2060                 group = {"name": self.group_name(appname), "rules": rules}
2061                 groups.append(group)
2062
2063         event.relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
2064         event.relation.data[self._charm.app]["alert_rules"] = json.dumps({"groups": groups})
2065
2066     def _on_prometheus_targets_changed(self, event):
2067         """Update scrape jobs in response to scrape target changes.
2068
2069         When there is any change in relation data with any scrape
2070         target, the Prometheus scrape job, for that specific target is
2071         updated.
2072         """
2073         targets = self._get_targets(event.relation)
2074         if not targets:
2075             return
2076
2077         # new scrape job for the relation that has changed
2078         self.set_target_job_data(targets, event.relation.app.name)
2079
2080     def set_target_job_data(self, targets: dict, app_name: str, **kwargs) -> None:
2081         """Update scrape jobs in response to scrape target changes.
2082
2083         When there is any change in relation data with any scrape
2084         target, the Prometheus scrape job, for that specific target is
2085         updated. Additionally, if this method is called manually, do the
2086         same.
2087
2088         Args:
2089             targets: a `dict` containing target information
2090             app_name: a `str` identifying the application
2091             kwargs: a `dict` of the extra arguments passed to the function
2092         """
2093         if not self._charm.unit.is_leader():
2094             return
2095
2096         # new scrape job for the relation that has changed
2097         updated_job = self._static_scrape_job(targets, app_name, **kwargs)
2098
2099         for relation in self.model.relations[self._prometheus_relation]:
2100             jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]"))
2101             # list of scrape jobs that have not changed
2102             jobs = [job for job in jobs if updated_job["job_name"] != job["job_name"]]
2103             jobs.append(updated_job)
2104             relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
2105
2106             if not _type_convert_stored(self._stored.jobs) == jobs:
2107                 self._stored.jobs = jobs
2108
2109     def _on_prometheus_targets_departed(self, event):
2110         """Remove scrape jobs when a target departs.
2111
2112         Any time a scrape target departs, any Prometheus scrape job
2113         associated with that specific scrape target is removed.
2114         """
2115         job_name = self._job_name(event.relation.app.name)
2116         unit_name = event.unit.name
2117         self.remove_prometheus_jobs(job_name, unit_name)
2118
2119     def remove_prometheus_jobs(self, job_name: str, unit_name: Optional[str] = ""):
2120         """Given a job name and unit name, remove scrape jobs associated.
2121
2122         The `unit_name` parameter is used for automatic, relation data bag-based
2123         generation, where the unit name in labels can be used to ensure that jobs with
2124         similar names (which are generated via the app name when scanning relation data
2125         bags) are not accidentally removed, as their unit name labels will differ.
2126         For NRPE, the job name is calculated from an ID sent via the NRPE relation, and is
2127         sufficient to uniquely identify the target.
2128         """
2129         if not self._charm.unit.is_leader():
2130             return
2131
2132         for relation in self.model.relations[self._prometheus_relation]:
2133             jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]"))
2134             if not jobs:
2135                 continue
2136
2137             changed_job = [j for j in jobs if j.get("job_name") == job_name]
2138             if not changed_job:
2139                 continue
2140             changed_job = changed_job[0]
2141
2142             # list of scrape jobs that have not changed
2143             jobs = [job for job in jobs if job.get("job_name") != job_name]
2144
2145             # list of scrape jobs for units of the same application that still exist
2146             configs_kept = [
2147                 config
2148                 for config in changed_job["static_configs"]  # type: ignore
2149                 if config.get("labels", {}).get("juju_unit") != unit_name
2150             ]
2151
2152             if configs_kept:
2153                 changed_job["static_configs"] = configs_kept  # type: ignore
2154                 jobs.append(changed_job)
2155
2156             relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
2157
2158             if not _type_convert_stored(self._stored.jobs) == jobs:
2159                 self._stored.jobs = jobs
2160
2161     def _job_name(self, appname) -> str:
2162         """Construct a scrape job name.
2163
2164         Each relation has its own unique scrape job name. All units in
2165         the relation are scraped as part of the same scrape job.
2166
2167         Args:
2168             appname: string name of a related application.
2169
2170         Returns:
2171             a string Prometheus scrape job name for the application.
2172         """
2173         return "juju_{}_{}_{}_prometheus_scrape".format(
2174             self.model.name, self.model.uuid[:7], appname
2175         )
2176
2177     def _get_targets(self, relation) -> dict:
2178         """Fetch scrape targets for a relation.
2179
2180         Scrape target information is returned for each unit in the
2181         relation. This information contains the unit name, network
2182         hostname (or address) for that unit, and port on which a
2183         metrics endpoint is exposed in that unit.
2184
2185         Args:
2186             relation: an `ops.model.Relation` object for which scrape
2187                 targets are required.
2188
2189         Returns:
2190             a dictionary whose keys are names of the units in the
2191             relation. There values associated with each key is itself
2192             a dictionary of the form
2193             ```
2194             {"hostname": hostname, "port": port}
2195             ```
2196         """
2197         targets = {}
2198         for unit in relation.units:
2199             port = relation.data[unit].get("port", 80)
2200             hostname = relation.data[unit].get("hostname")
2201             if hostname:
2202                 targets.update({unit.name: {"hostname": hostname, "port": port}})
2203
2204         return targets
2205
2206     def _static_scrape_job(self, targets, application_name, **kwargs) -> dict:
2207         """Construct a static scrape job for an application.
2208
2209         Args:
2210             targets: a dictionary providing hostname and port for all
2211                 scrape target. The keys of this dictionary are unit
2212                 names. Values corresponding to these keys are
2213                 themselves a dictionary with keys "hostname" and
2214                 "port".
2215             application_name: a string name of the application for
2216                 which this static scrape job is being constructed.
2217             kwargs: a `dict` of the extra arguments passed to the function
2218
2219         Returns:
2220             A dictionary corresponding to a Prometheus static scrape
2221             job configuration for one application. The returned
2222             dictionary may be transformed into YAML and appended to
2223             the list of any existing list of Prometheus static configs.
2224         """
2225         juju_model = self.model.name
2226         juju_model_uuid = self.model.uuid
2227
2228         job = {
2229             "job_name": self._job_name(application_name),
2230             "static_configs": [
2231                 {
2232                     "targets": ["{}:{}".format(target["hostname"], target["port"])],
2233                     "labels": {
2234                         "juju_model": juju_model,
2235                         "juju_model_uuid": juju_model_uuid,
2236                         "juju_application": application_name,
2237                         "juju_unit": unit_name,
2238                         "host": target["hostname"],
2239                         # Expanding this will merge the dicts and replace the
2240                         # topology labels if any were present/found
2241                         **self._static_config_extra_labels(target),
2242                     },
2243                 }
2244                 for unit_name, target in targets.items()
2245             ],
2246             "relabel_configs": self._relabel_configs + kwargs.get("relabel_configs", []),
2247         }
2248         job.update(kwargs.get("updates", {}))
2249
2250         return job
2251
2252     def _static_config_extra_labels(self, target: Dict[str, str]) -> Dict[str, str]:
2253         """Build a list of extra static config parameters, if specified."""
2254         extra_info = {}
2255
2256         if self._resolve_addresses:
2257             try:
2258                 dns_name = socket.gethostbyaddr(target["hostname"])[0]
2259             except OSError:
2260                 logger.debug("Could not perform DNS lookup for %s", target["hostname"])
2261                 dns_name = target["hostname"]
2262             extra_info["dns_name"] = dns_name
2263         label_re = re.compile(r'(?P<label>juju.*?)="(?P<value>.*?)",?')
2264
2265         try:
2266             with urlopen(f'http://{target["hostname"]}:{target["port"]}/metrics') as resp:
2267                 data = resp.read().decode("utf-8").splitlines()
2268                 for metric in data:
2269                     for match in label_re.finditer(metric):
2270                         extra_info[match.group("label")] = match.group("value")
2271         except (HTTPError, URLError, OSError, ConnectionResetError, Exception) as e:
2272             logger.debug("Could not scrape target: %s", e)
2273         return extra_info
2274
2275     @property
2276     def _relabel_configs(self) -> list:
2277         """Create Juju topology relabeling configuration.
2278
2279         Using Juju topology for instance labels ensures that these
2280         labels are stable across unit recreation.
2281
2282         Returns:
2283             a list of Prometheus relabeling configurations. Each item in
2284             this list is one relabel configuration.
2285         """
2286         return (
2287             [
2288                 {
2289                     "source_labels": [
2290                         "juju_model",
2291                         "juju_model_uuid",
2292                         "juju_application",
2293                         "juju_unit",
2294                     ],
2295                     "separator": "_",
2296                     "target_label": "instance",
2297                     "regex": "(.*)",
2298                 }
2299             ]
2300             if self._relabel_instance
2301             else []
2302         )
2303
2304     def _on_alert_rules_changed(self, event):
2305         """Update alert rules in response to scrape target changes.
2306
2307         When there is any change in alert rule relation data for any
2308         scrape target, the list of alert rules for that specific
2309         target is updated.
2310         """
2311         unit_rules = self._get_alert_rules(event.relation)
2312         if not unit_rules:
2313             return
2314
2315         app_name = event.relation.app.name
2316         self.set_alert_rule_data(app_name, unit_rules)
2317
2318     def set_alert_rule_data(self, name: str, unit_rules: dict, label_rules: bool = True) -> None:
2319         """Update alert rule data.
2320
2321         The unit rules should be a dict, which is has additional Juju topology labels added. For
2322         rules generated by the NRPE exporter, they are pre-labeled so lookups can be performed.
2323         """
2324         if not self._charm.unit.is_leader():
2325             return
2326
2327         if label_rules:
2328             rules = self._label_alert_rules(unit_rules, name)
2329         else:
2330             rules = [unit_rules]
2331         updated_group = {"name": self.group_name(name), "rules": rules}
2332
2333         for relation in self.model.relations[self._prometheus_relation]:
2334             alert_rules = json.loads(relation.data[self._charm.app].get("alert_rules", "{}"))
2335             groups = alert_rules.get("groups", [])
2336             # list of alert rule groups that have not changed
2337             for group in groups:
2338                 if group["name"] == updated_group["name"]:
2339                     group["rules"] = [r for r in group["rules"] if r not in updated_group["rules"]]
2340                     group["rules"].extend(updated_group["rules"])
2341
2342             if updated_group["name"] not in [g["name"] for g in groups]:
2343                 groups.append(updated_group)
2344             relation.data[self._charm.app]["alert_rules"] = json.dumps({"groups": groups})
2345
2346             if not _type_convert_stored(self._stored.alert_rules) == groups:
2347                 self._stored.alert_rules = groups
2348
2349     def _on_alert_rules_departed(self, event):
2350         """Remove alert rules for departed targets.
2351
2352         Any time a scrape target departs any alert rules associated
2353         with that specific scrape target is removed.
2354         """
2355         group_name = self.group_name(event.relation.app.name)
2356         unit_name = event.unit.name
2357         self.remove_alert_rules(group_name, unit_name)
2358
2359     def remove_alert_rules(self, group_name: str, unit_name: str) -> None:
2360         """Remove an alert rule group from relation data."""
2361         if not self._charm.unit.is_leader():
2362             return
2363
2364         for relation in self.model.relations[self._prometheus_relation]:
2365             alert_rules = json.loads(relation.data[self._charm.app].get("alert_rules", "{}"))
2366             if not alert_rules:
2367                 continue
2368
2369             groups = alert_rules.get("groups", [])
2370             if not groups:
2371                 continue
2372
2373             changed_group = [group for group in groups if group["name"] == group_name]
2374             if not changed_group:
2375                 continue
2376             changed_group = changed_group[0]
2377
2378             # list of alert rule groups that have not changed
2379             groups = [group for group in groups if group["name"] != group_name]
2380
2381             # list of alert rules not associated with departing unit
2382             rules_kept = [
2383                 rule
2384                 for rule in changed_group.get("rules")  # type: ignore
2385                 if rule.get("labels").get("juju_unit") != unit_name
2386             ]
2387
2388             if rules_kept:
2389                 changed_group["rules"] = rules_kept  # type: ignore
2390                 groups.append(changed_group)
2391
2392             relation.data[self._charm.app]["alert_rules"] = (
2393                 json.dumps({"groups": groups}) if groups else "{}"
2394             )
2395
2396             if not _type_convert_stored(self._stored.alert_rules) == groups:
2397                 self._stored.alert_rules = groups
2398
2399     def _get_alert_rules(self, relation) -> dict:
2400         """Fetch alert rules for a relation.
2401
2402         Each unit of the related scrape target may have its own
2403         associated alert rules. Alert rules for all units are returned
2404         indexed by unit name.
2405
2406         Args:
2407             relation: an `ops.model.Relation` object for which alert
2408                 rules are required.
2409
2410         Returns:
2411             a dictionary whose keys are names of the units in the
2412             relation. There values associated with each key is a list
2413             of alert rules. Each rule is in dictionary format. The
2414             structure "rule dictionary" corresponds to single
2415             Prometheus alert rule.
2416         """
2417         rules = {}
2418         for unit in relation.units:
2419             unit_rules = yaml.safe_load(relation.data[unit].get("groups", ""))
2420             if unit_rules:
2421                 rules.update({unit.name: unit_rules})
2422
2423         return rules
2424
2425     def group_name(self, unit_name: str) -> str:
2426         """Construct name for an alert rule group.
2427
2428         Each unit in a relation may define its own alert rules. All
2429         rules, for all units in a relation are grouped together and
2430         given a single alert rule group name.
2431
2432         Args:
2433             unit_name: string name of a related application.
2434
2435         Returns:
2436             a string Prometheus alert rules group name for the unit.
2437         """
2438         unit_name = re.sub(r"/", "_", unit_name)
2439         return "juju_{}_{}_{}_alert_rules".format(self.model.name, self.model.uuid[:7], unit_name)
2440
2441     def _label_alert_rules(self, unit_rules, app_name: str) -> list:
2442         """Apply juju topology labels to alert rules.
2443
2444         Args:
2445             unit_rules: a list of alert rules, where each rule is in
2446                 dictionary format.
2447             app_name: a string name of the application to which the
2448                 alert rules belong.
2449
2450         Returns:
2451             a list of alert rules with Juju topology labels.
2452         """
2453         labeled_rules = []
2454         for unit_name, rules in unit_rules.items():
2455             for rule in rules:
2456                 # the new JujuTopology removed this, so build it up by hand
2457                 matchers = {
2458                     "juju_{}".format(k): v
2459                     for k, v in JujuTopology(self.model.name, self.model.uuid, app_name, unit_name)
2460                     .as_dict(excluded_keys=["charm_name"])
2461                     .items()
2462                 }
2463                 rule["labels"].update(matchers.items())
2464                 labeled_rules.append(rule)
2465
2466         return labeled_rules
2467
2468
2469 class CosTool:
2470     """Uses cos-tool to inject label matchers into alert rule expressions and validate rules."""
2471
2472     _path = None
2473     _disabled = False
2474
2475     def __init__(self, charm):
2476         self._charm = charm
2477
2478     @property
2479     def path(self):
2480         """Lazy lookup of the path of cos-tool."""
2481         if self._disabled:
2482             return None
2483         if not self._path:
2484             self._path = self._get_tool_path()
2485             if not self._path:
2486                 logger.debug("Skipping injection of juju topology as label matchers")
2487                 self._disabled = True
2488         return self._path
2489
2490     def apply_label_matchers(self, rules) -> dict:
2491         """Will apply label matchers to the expression of all alerts in all supplied groups."""
2492         if not self.path:
2493             return rules
2494         for group in rules["groups"]:
2495             rules_in_group = group.get("rules", [])
2496             for rule in rules_in_group:
2497                 topology = {}
2498                 # if the user for some reason has provided juju_unit, we'll need to honor it
2499                 # in most cases, however, this will be empty
2500                 for label in [
2501                     "juju_model",
2502                     "juju_model_uuid",
2503                     "juju_application",
2504                     "juju_charm",
2505                     "juju_unit",
2506                 ]:
2507                     if label in rule["labels"]:
2508                         topology[label] = rule["labels"][label]
2509
2510                 rule["expr"] = self.inject_label_matchers(rule["expr"], topology)
2511         return rules
2512
2513     def validate_alert_rules(self, rules: dict) -> Tuple[bool, str]:
2514         """Will validate correctness of alert rules, returning a boolean and any errors."""
2515         if not self.path:
2516             logger.debug("`cos-tool` unavailable. Not validating alert correctness.")
2517             return True, ""
2518
2519         with tempfile.TemporaryDirectory() as tmpdir:
2520             rule_path = Path(tmpdir + "/validate_rule.yaml")
2521             rule_path.write_text(yaml.dump(rules))
2522
2523             args = [str(self.path), "validate", str(rule_path)]
2524             # noinspection PyBroadException
2525             try:
2526                 self._exec(args)
2527                 return True, ""
2528             except subprocess.CalledProcessError as e:
2529                 logger.debug("Validating the rules failed: %s", e.output)
2530                 return False, ", ".join(
2531                     [
2532                         line
2533                         for line in e.output.decode("utf8").splitlines()
2534                         if "error validating" in line
2535                     ]
2536                 )
2537
2538     def validate_scrape_jobs(self, jobs: list) -> bool:
2539         """Validate scrape jobs using cos-tool."""
2540         if not self.path:
2541             logger.debug("`cos-tool` unavailable. Not validating scrape jobs.")
2542             return True
2543         conf = {"scrape_configs": jobs}
2544         with tempfile.NamedTemporaryFile() as tmpfile:
2545             with open(tmpfile.name, "w") as f:
2546                 f.write(yaml.safe_dump(conf))
2547             try:
2548                 self._exec([str(self.path), "validate-config", tmpfile.name])
2549             except subprocess.CalledProcessError as e:
2550                 logger.error("Validating scrape jobs failed: {}".format(e.output))
2551                 raise
2552         return True
2553
2554     def inject_label_matchers(self, expression, topology) -> str:
2555         """Add label matchers to an expression."""
2556         if not topology:
2557             return expression
2558         if not self.path:
2559             logger.debug("`cos-tool` unavailable. Leaving expression unchanged: %s", expression)
2560             return expression
2561         args = [str(self.path), "transform"]
2562         args.extend(
2563             ["--label-matcher={}={}".format(key, value) for key, value in topology.items()]
2564         )
2565
2566         args.extend(["{}".format(expression)])
2567         # noinspection PyBroadException
2568         try:
2569             return self._exec(args)
2570         except subprocess.CalledProcessError as e:
2571             logger.debug('Applying the expression failed: "%s", falling back to the original', e)
2572             return expression
2573
2574     def _get_tool_path(self) -> Optional[Path]:
2575         arch = platform.machine()
2576         arch = "amd64" if arch == "x86_64" else arch
2577         res = "cos-tool-{}".format(arch)
2578         try:
2579             path = Path(res).resolve()
2580             path.chmod(0o777)
2581             return path
2582         except NotImplementedError:
2583             logger.debug("System lacks support for chmod")
2584         except FileNotFoundError:
2585             logger.debug('Could not locate cos-tool at: "{}"'.format(res))
2586         return None
2587
2588     def _exec(self, cmd) -> str:
2589         result = subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
2590         return result.stdout.decode("utf-8").strip()