1 # Copyright 2023 Canonical Ltd.
3 # Licensed under the Apache License, Version 2.0 (the "License");
4 # you may not use this file except in compliance with the License.
5 # You may obtain a copy of the License at
7 # http://www.apache.org/licenses/LICENSE-2.0
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
15 """Prometheus Scrape Library.
19 This document explains how to integrate with the Prometheus charm
20 for the purpose of providing a metrics endpoint to Prometheus. It
21 also explains how alternative implementations of the Prometheus charms
22 may maintain the same interface and be backward compatible with all
23 currently integrated charms. Finally this document is the
24 authoritative reference on the structure of relation data that is
25 shared between Prometheus charms and any other charm that intends to
26 provide a scrape target for Prometheus.
30 Source code can be found on GitHub at:
31 https://github.com/canonical/prometheus-k8s-operator/tree/main/lib/charms/prometheus_k8s
35 Using this library requires you to fetch the juju_topology library from
36 [observability-libs](https://charmhub.io/observability-libs/libraries/juju_topology).
38 `charmcraft fetch-lib charms.observability_libs.v0.juju_topology`
40 ## Provider Library Usage
42 This Prometheus charm interacts with its scrape targets using its
43 charm library. Charms seeking to expose metric endpoints for the
44 Prometheus charm, must do so using the `MetricsEndpointProvider`
45 object from this charm library. For the simplest use cases, using the
46 `MetricsEndpointProvider` object only requires instantiating it,
47 typically in the constructor of your charm (the one which exposes a
48 metrics endpoint). The `MetricsEndpointProvider` constructor requires
49 the name of the relation over which a scrape target (metrics endpoint)
50 is exposed to the Prometheus charm. This relation must use the
51 `prometheus_scrape` interface. By default address of the metrics
52 endpoint is set to the unit IP address, by each unit of the
53 `MetricsEndpointProvider` charm. These units set their address in
54 response to the `PebbleReady` event of each container in the unit,
55 since container restarts of Kubernetes charms can result in change of
56 IP addresses. The default name for the metrics endpoint relation is
57 `metrics-endpoint`. It is strongly recommended to use the same
58 relation name for consistency across charms and doing so obviates the
59 need for an additional constructor argument. The
60 `MetricsEndpointProvider` object may be instantiated as follows
62 from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointProvider
64 def __init__(self, *args):
65 super().__init__(*args)
67 self.metrics_endpoint = MetricsEndpointProvider(self)
70 Note that the first argument (`self`) to `MetricsEndpointProvider` is
71 always a reference to the parent (scrape target) charm.
73 An instantiated `MetricsEndpointProvider` object will ensure that each
74 unit of its parent charm, is a scrape target for the
75 `MetricsEndpointConsumer` (Prometheus) charm. By default
76 `MetricsEndpointProvider` assumes each unit of the consumer charm
77 exports its metrics at a path given by `/metrics` on port 80. These
78 defaults may be changed by providing the `MetricsEndpointProvider`
79 constructor an optional argument (`jobs`) that represents a
80 Prometheus scrape job specification using Python standard data
81 structures. This job specification is a subset of Prometheus' own
83 configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
84 format but represented using Python data structures. More than one job
85 may be provided using the `jobs` argument. Hence `jobs` accepts a list
86 of dictionaries where each dictionary represents one `<scrape_config>`
87 object as described in the Prometheus documentation. The currently
88 supported configuration subset is: `job_name`, `metrics_path`,
91 Suppose it is required to change the port on which scraped metrics are
92 exposed to 8000. This may be done by providing the following data
93 structure as the value of `jobs`.
100 "targets": ["*:8000"]
107 The wildcard ("*") host specification implies that the scrape targets
108 will automatically be set to the host addresses advertised by each
109 unit of the consumer charm.
111 It is also possible to change the metrics path and scrape multiple
117 "metrics_path": "/my-metrics-path",
120 "targets": ["*:8000", "*:8081"],
127 More complex scrape configurations are possible. For example
134 "targets": ["10.1.32.215:7000", "*:8000"],
136 "some_key": "some-value"
144 This example scrapes the target "10.1.32.215" at port 7000 in addition
145 to scraping each unit at port 8000. There is however one difference
146 between wildcard targets (specified using "*") and fully qualified
147 targets (such as "10.1.32.215"). The Prometheus charm automatically
148 associates labels with metrics generated by each target. These labels
149 localise the source of metrics within the Juju topology by specifying
150 its "model name", "model UUID", "application name" and "unit
151 name". However unit name is associated only with wildcard targets but
152 not with fully qualified targets.
154 Multiple jobs with different metrics paths and labels are allowed, but
155 each job must be given a unique name:
160 "job_name": "my-first-job",
161 "metrics_path": "one-path",
164 "targets": ["*:7000"],
166 "some_key": "some-value"
172 "job_name": "my-second-job",
173 "metrics_path": "another-path",
176 "targets": ["*:8000"],
178 "some_other_key": "some-other-value"
186 **Important:** `job_name` should be a fixed string (e.g. hardcoded literal).
187 For instance, if you include variable elements, like your `unit.name`, it may break
188 the continuity of the metrics time series gathered by Prometheus when the leader unit
189 changes (e.g. on upgrade or rescale).
191 Additionally, it is also technically possible, but **strongly discouraged**, to
192 configure the following scrape-related settings, which behave as described by the
193 [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config):
200 - `metrics_relabel_configs`
203 - `label_name_length_limit`
204 - `label_value_length_limit`
206 The settings above are supported by the `prometheus_scrape` library only for the sake of
207 specialized facilities like the [Prometheus Scrape Config](https://charmhub.io/prometheus-scrape-config-k8s)
208 charm. Virtually no charms should use these settings, and charmers definitely **should not**
209 expose them to the Juju administrator via configuration options.
211 ## Consumer Library Usage
213 The `MetricsEndpointConsumer` object may be used by Prometheus
214 charms to manage relations with their scrape targets. For this
215 purposes a Prometheus charm needs to do two things
217 1. Instantiate the `MetricsEndpointConsumer` object by providing it a
218 reference to the parent (Prometheus) charm and optionally the name of
219 the relation that the Prometheus charm uses to interact with scrape
220 targets. This relation must confirm to the `prometheus_scrape`
221 interface and it is strongly recommended that this relation be named
222 `metrics-endpoint` which is its default value.
224 For example a Prometheus charm may instantiate the
225 `MetricsEndpointConsumer` in its constructor as follows
227 from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointConsumer
229 def __init__(self, *args):
230 super().__init__(*args)
232 self.metrics_consumer = MetricsEndpointConsumer(self)
235 2. A Prometheus charm also needs to respond to the
236 `TargetsChangedEvent` event of the `MetricsEndpointConsumer` by adding itself as
237 an observer for these events, as in
239 self.framework.observe(
240 self.metrics_consumer.on.targets_changed,
241 self._on_scrape_targets_changed,
244 In responding to the `TargetsChangedEvent` event the Prometheus
245 charm must update the Prometheus configuration so that any new scrape
246 targets are added and/or old ones removed from the list of scraped
247 endpoints. For this purpose the `MetricsEndpointConsumer` object
248 exposes a `jobs()` method that returns a list of scrape jobs. Each
249 element of this list is the Prometheus scrape configuration for that
250 job. In order to update the Prometheus configuration, the Prometheus
251 charm needs to replace the current list of jobs with the list provided
252 by `jobs()` as follows
254 def _on_scrape_targets_changed(self, event):
256 scrape_jobs = self.metrics_consumer.jobs()
257 for job in scrape_jobs:
258 prometheus_scrape_config.append(job)
263 This charm library also supports gathering alerting rules from all
264 related `MetricsEndpointProvider` charms and enabling corresponding alerts within the
265 Prometheus charm. Alert rules are automatically gathered by `MetricsEndpointProvider`
266 charms when using this library, from a directory conventionally named
267 `prometheus_alert_rules`. This directory must reside at the top level
268 in the `src` folder of the consumer charm. Each file in this directory
269 is assumed to be in one of two formats:
270 - the official prometheus alert rule format, conforming to the
271 [Prometheus docs](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
272 - a single rule format, which is a simplified subset of the official format,
273 comprising a single alert rule per file, using the same YAML fields.
275 The file name must have one of the following extensions:
281 An example of the contents of such a file in the custom single rule
282 format is shown below.
285 alert: HighRequestLatency
286 expr: job:request_latency_seconds:mean5m{my_key=my_value} > 0.5
292 summary: High request latency for {{ $labels.instance }}.
295 The `MetricsEndpointProvider` will read all available alert rules and
296 also inject "filtering labels" into the alert expressions. The
297 filtering labels ensure that alert rules are localised to the metrics
298 provider charm's Juju topology (application, model and its UUID). Such
299 a topology filter is essential to ensure that alert rules submitted by
300 one provider charm generates alerts only for that same charm. When
301 alert rules are embedded in a charm, and the charm is deployed as a
302 Juju application, the alert rules from that application have their
303 expressions automatically updated to filter for metrics coming from
304 the units of that application alone. This remove risk of spurious
305 evaluation, e.g., when you have multiple deployments of the same charm
306 monitored by the same Prometheus.
308 Not all alerts one may want to specify can be embedded in a
309 charm. Some alert rules will be specific to a user's use case. This is
310 the case, for example, of alert rules that are based on business
311 constraints, like expecting a certain amount of requests to a specific
312 API every five minutes. Such alert rules can be specified via the
313 [COS Config Charm](https://charmhub.io/cos-configuration-k8s),
314 which allows importing alert rules and other settings like dashboards
315 from a Git repository.
317 Gathering alert rules and generating rule files within the Prometheus
318 charm is easily done using the `alerts()` method of
319 `MetricsEndpointConsumer`. Alerts generated by Prometheus will
320 automatically include Juju topology labels in the alerts. These labels
321 indicate the source of the alert. The following labels are
322 automatically included with each alert
330 The Prometheus charm uses both application and unit relation data to
331 obtain information regarding its scrape jobs, alert rules and scrape
332 targets. This relation data is in JSON format and it closely resembles
333 the YAML structure of Prometheus [scrape configuration]
334 (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config).
336 Units of Metrics provider charms advertise their names and addresses
337 over unit relation data using the `prometheus_scrape_unit_name` and
338 `prometheus_scrape_unit_address` keys. While the `scrape_metadata`,
339 `scrape_jobs` and `alert_rules` keys in application relation data
340 of Metrics provider charms hold eponymous information.
355 from collections
import defaultdict
356 from pathlib
import Path
357 from typing
import Any
, Callable
, Dict
, List
, Optional
, Tuple
, Union
358 from urllib
.error
import HTTPError
, URLError
359 from urllib
.parse
import urlparse
360 from urllib
.request
import urlopen
363 from charms
.observability_libs
.v0
.juju_topology
import JujuTopology
364 from ops
.charm
import CharmBase
, RelationRole
365 from ops
.framework
import (
375 from ops
.model
import Relation
377 # The unique Charmhub library identifier, never change it
378 LIBID
= "bc84295fef5f4049878f07b131968ee2"
380 # Increment this major API version when introducing breaking changes
383 # Increment this PATCH version before using `charmcraft publish-lib` or reset
384 # to 0 if you are raising the major API version
387 logger
= logging
.getLogger(__name__
)
398 "metrics_relabel_configs",
401 "label_name_length_limit",
402 "label_value_length_limit",
408 "metrics_path": "/metrics",
409 "static_configs": [{"targets": ["*:80"]}],
413 DEFAULT_RELATION_NAME
= "metrics-endpoint"
414 RELATION_INTERFACE_NAME
= "prometheus_scrape"
416 DEFAULT_ALERT_RULES_RELATIVE_PATH
= "./src/prometheus_alert_rules"
419 class PrometheusConfig
:
420 """A namespace for utility functions for manipulating the prometheus config dict."""
422 # relabel instance labels so that instance identifiers are globally unique
423 # stable over unit recreation
424 topology_relabel_config
= {
425 "source_labels": ["juju_model", "juju_model_uuid", "juju_application"],
427 "target_label": "instance",
431 topology_relabel_config_wildcard
= {
432 "source_labels": ["juju_model", "juju_model_uuid", "juju_application", "juju_unit"],
434 "target_label": "instance",
439 def sanitize_scrape_config(job
: dict) -> dict:
440 """Restrict permissible scrape configuration options.
442 If job is empty then a default job is returned. The
447 "metrics_path": "/metrics",
448 "static_configs": [{"targets": ["*:80"]}],
453 job: a dict containing a single Prometheus job
457 a dictionary containing a sanitized job specification.
459 sanitized_job
= DEFAULT_JOB
.copy()
460 sanitized_job
.update({key
: value
for key
, value
in job
.items() if key
in ALLOWED_KEYS
})
464 def sanitize_scrape_configs(scrape_configs
: List
[dict]) -> List
[dict]:
465 """A vectorized version of `sanitize_scrape_config`."""
466 return [PrometheusConfig
.sanitize_scrape_config(job
) for job
in scrape_configs
]
469 def prefix_job_names(scrape_configs
: List
[dict], prefix
: str) -> List
[dict]:
470 """Adds the given prefix to all the job names in the given scrape_configs list."""
471 modified_scrape_configs
= []
472 for scrape_config
in scrape_configs
:
473 job_name
= scrape_config
.get("job_name")
474 modified
= scrape_config
.copy()
475 modified
["job_name"] = prefix
+ "_" + job_name
if job_name
else prefix
476 modified_scrape_configs
.append(modified
)
478 return modified_scrape_configs
481 def expand_wildcard_targets_into_individual_jobs(
482 scrape_jobs
: List
[dict],
483 hosts
: Dict
[str, Tuple
[str, str]],
484 topology
: Optional
[JujuTopology
] = None,
486 """Extract wildcard hosts from the given scrape_configs list into separate jobs.
489 scrape_jobs: list of scrape jobs.
490 hosts: a dictionary mapping host names to host address for
491 all units of the relation for which this job configuration
493 topology: optional arg for adding topology labels to scrape targets.
495 # hosts = self._relation_hosts(relation)
497 modified_scrape_jobs
= []
498 for job
in scrape_jobs
:
499 static_configs
= job
.get("static_configs")
500 if not static_configs
:
503 # When a single unit specified more than one wildcard target, then they are expanded
504 # into a static_config per target
505 non_wildcard_static_configs
= []
507 for static_config
in static_configs
:
508 targets
= static_config
.get("targets")
512 # All non-wildcard targets remain in the same static_config
513 non_wildcard_targets
= []
515 # All wildcard targets are extracted to a job per unit. If multiple wildcard
516 # targets are specified, they remain in the same static_config (per unit).
517 wildcard_targets
= []
519 for target
in targets
:
520 match
= re
.compile(r
"\*(?:(:\d+))?").match(target
)
522 # This is a wildcard target.
523 # Need to expand into separate jobs and remove it from this job here
524 wildcard_targets
.append(target
)
526 # This is not a wildcard target. Copy it over into its own static_config.
527 non_wildcard_targets
.append(target
)
529 # All non-wildcard targets remain in the same static_config
530 if non_wildcard_targets
:
531 non_wildcard_static_config
= static_config
.copy()
532 non_wildcard_static_config
["targets"] = non_wildcard_targets
535 # When non-wildcard targets (aka fully qualified hostnames) are specified,
536 # there is no reliable way to determine the name (Juju topology unit name)
537 # for such a target. Therefore labeling with Juju topology, excluding the
539 non_wildcard_static_config
["labels"] = {
540 **non_wildcard_static_config
.get("labels", {}),
541 **topology
.label_matcher_dict
,
544 non_wildcard_static_configs
.append(non_wildcard_static_config
)
546 # Extract wildcard targets into individual jobs
548 for unit_name
, (unit_hostname
, unit_path
) in hosts
.items():
549 modified_job
= job
.copy()
550 modified_job
["static_configs"] = [static_config
.copy()]
551 modified_static_config
= modified_job
["static_configs"][0]
552 modified_static_config
["targets"] = [
553 target
.replace("*", unit_hostname
) for target
in wildcard_targets
556 unit_num
= unit_name
.split("/")[-1]
557 job_name
= modified_job
.get("job_name", "unnamed-job") + "-" + unit_num
558 modified_job
["job_name"] = job_name
559 modified_job
["metrics_path"] = unit_path
+ (
560 job
.get("metrics_path") or "/metrics"
564 # Add topology labels
565 modified_static_config
["labels"] = {
566 **modified_static_config
.get("labels", {}),
567 **topology
.label_matcher_dict
,
568 **{"juju_unit": unit_name
},
571 # Instance relabeling for topology should be last in order.
572 modified_job
["relabel_configs"] = modified_job
.get(
573 "relabel_configs", []
574 ) + [PrometheusConfig
.topology_relabel_config_wildcard
]
576 modified_scrape_jobs
.append(modified_job
)
578 if non_wildcard_static_configs
:
579 modified_job
= job
.copy()
580 modified_job
["static_configs"] = non_wildcard_static_configs
581 modified_job
["metrics_path"] = modified_job
.get("metrics_path") or "/metrics"
584 # Instance relabeling for topology should be last in order.
585 modified_job
["relabel_configs"] = modified_job
.get("relabel_configs", []) + [
586 PrometheusConfig
.topology_relabel_config
589 modified_scrape_jobs
.append(modified_job
)
591 return modified_scrape_jobs
594 def render_alertmanager_static_configs(alertmanagers
: List
[str]):
595 """Render the alertmanager static_configs section from a list of URLs.
597 Each target must be in the hostname:port format, and prefixes are specified in a separate
598 key. Therefore, with ingress in place, would need to extract the path into the
599 `path_prefix` key, which is higher up in the config hierarchy.
601 https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
604 alertmanagers: List of alertmanager URLs.
607 A dict representation for the static_configs section.
609 # Make sure it's a valid url so urlparse could parse it.
610 scheme
= re
.compile(r
"^https?://")
611 sanitized
= [am
if scheme
.search(am
) else "http://" + am
for am
in alertmanagers
]
613 # Create a mapping from paths to netlocs
614 # Group alertmanager targets into a dictionary of lists:
615 # {path: [netloc1, netloc2]}
616 paths
= defaultdict(list) # type: Dict[str, List[str]]
617 for parsed
in map(urlparse
, sanitized
):
618 path
= parsed
.path
or "/"
619 paths
[path
].append(parsed
.netloc
)
623 {"path_prefix": path_prefix
, "static_configs": [{"targets": netlocs
}]}
624 for path_prefix
, netlocs
in paths
.items()
629 class RelationNotFoundError(Exception):
630 """Raised if there is no relation with the given name is found."""
632 def __init__(self
, relation_name
: str):
633 self
.relation_name
= relation_name
634 self
.message
= "No relation named '{}' found".format(relation_name
)
636 super().__init
__(self
.message
)
639 class RelationInterfaceMismatchError(Exception):
640 """Raised if the relation with the given name has a different interface."""
645 expected_relation_interface
: str,
646 actual_relation_interface
: str,
648 self
.relation_name
= relation_name
649 self
.expected_relation_interface
= expected_relation_interface
650 self
.actual_relation_interface
= actual_relation_interface
652 "The '{}' relation has '{}' as interface rather than the expected '{}'".format(
653 relation_name
, actual_relation_interface
, expected_relation_interface
657 super().__init
__(self
.message
)
660 class RelationRoleMismatchError(Exception):
661 """Raised if the relation with the given name has a different role."""
666 expected_relation_role
: RelationRole
,
667 actual_relation_role
: RelationRole
,
669 self
.relation_name
= relation_name
670 self
.expected_relation_interface
= expected_relation_role
671 self
.actual_relation_role
= actual_relation_role
672 self
.message
= "The '{}' relation has role '{}' rather than the expected '{}'".format(
673 relation_name
, repr(actual_relation_role
), repr(expected_relation_role
)
676 super().__init
__(self
.message
)
679 class InvalidAlertRuleEvent(EventBase
):
680 """Event emitted when alert rule files are not parsable.
682 Enables us to set a clear status on the provider.
685 def __init__(self
, handle
, errors
: str = "", valid
: bool = False):
686 super().__init
__(handle
)
690 def snapshot(self
) -> Dict
:
691 """Save alert rule information."""
694 "errors": self
.errors
,
697 def restore(self
, snapshot
):
698 """Restore alert rule information."""
699 self
.valid
= snapshot
["valid"]
700 self
.errors
= snapshot
["errors"]
703 class InvalidScrapeJobEvent(EventBase
):
704 """Event emitted when alert rule files are not valid."""
706 def __init__(self
, handle
, errors
: str = ""):
707 super().__init
__(handle
)
710 def snapshot(self
) -> Dict
:
711 """Save error information."""
712 return {"errors": self
.errors
}
714 def restore(self
, snapshot
):
715 """Restore error information."""
716 self
.errors
= snapshot
["errors"]
719 class MetricsEndpointProviderEvents(ObjectEvents
):
720 """Events raised by :class:`InvalidAlertRuleEvent`s."""
722 alert_rule_status_changed
= EventSource(InvalidAlertRuleEvent
)
723 invalid_scrape_job
= EventSource(InvalidScrapeJobEvent
)
726 def _type_convert_stored(obj
):
727 """Convert Stored* to their appropriate types, recursively."""
728 if isinstance(obj
, StoredList
):
729 return list(map(_type_convert_stored
, obj
))
730 if isinstance(obj
, StoredDict
):
731 rdict
= {} # type: Dict[Any, Any]
733 rdict
[k
] = _type_convert_stored(obj
[k
])
738 def _validate_relation_by_interface_and_direction(
741 expected_relation_interface
: str,
742 expected_relation_role
: RelationRole
,
744 """Verifies that a relation has the necessary characteristics.
746 Verifies that the `relation_name` provided: (1) exists in metadata.yaml,
747 (2) declares as interface the interface name passed as `relation_interface`
748 and (3) has the right "direction", i.e., it is a relation that `charm`
749 provides or requires.
752 charm: a `CharmBase` object to scan for the matching relation.
753 relation_name: the name of the relation to be verified.
754 expected_relation_interface: the interface name to be matched by the
755 relation named `relation_name`.
756 expected_relation_role: whether the `relation_name` must be either
757 provided or required by `charm`.
760 RelationNotFoundError: If there is no relation in the charm's metadata.yaml
761 with the same name as provided via `relation_name` argument.
762 RelationInterfaceMismatchError: The relation with the same name as provided
763 via `relation_name` argument does not have the same relation interface
764 as specified via the `expected_relation_interface` argument.
765 RelationRoleMismatchError: If the relation with the same name as provided
766 via `relation_name` argument does not have the same role as specified
767 via the `expected_relation_role` argument.
769 if relation_name
not in charm
.meta
.relations
:
770 raise RelationNotFoundError(relation_name
)
772 relation
= charm
.meta
.relations
[relation_name
]
774 actual_relation_interface
= relation
.interface_name
775 if actual_relation_interface
!= expected_relation_interface
:
776 raise RelationInterfaceMismatchError(
777 relation_name
, expected_relation_interface
, actual_relation_interface
780 if expected_relation_role
== RelationRole
.provides
:
781 if relation_name
not in charm
.meta
.provides
:
782 raise RelationRoleMismatchError(
783 relation_name
, RelationRole
.provides
, RelationRole
.requires
785 elif expected_relation_role
== RelationRole
.requires
:
786 if relation_name
not in charm
.meta
.requires
:
787 raise RelationRoleMismatchError(
788 relation_name
, RelationRole
.requires
, RelationRole
.provides
791 raise Exception("Unexpected RelationDirection: {}".format(expected_relation_role
))
794 class InvalidAlertRulePathError(Exception):
795 """Raised if the alert rules folder cannot be found or is otherwise invalid."""
799 alert_rules_absolute_path
: Path
,
802 self
.alert_rules_absolute_path
= alert_rules_absolute_path
803 self
.message
= message
805 super().__init
__(self
.message
)
808 def _is_official_alert_rule_format(rules_dict
: dict) -> bool:
809 """Are alert rules in the upstream format as supported by Prometheus.
811 Alert rules in dictionary format are in "official" form if they
812 contain a "groups" key, since this implies they contain a list of
816 rules_dict: a set of alert rules in Python dictionary format
819 True if alert rules are in official Prometheus file format.
821 return "groups" in rules_dict
824 def _is_single_alert_rule_format(rules_dict
: dict) -> bool:
825 """Are alert rules in single rule format.
827 The Prometheus charm library supports reading of alert rules in a
828 custom format that consists of a single alert rule per file. This
829 does not conform to the official Prometheus alert rule file format
830 which requires that each alert rules file consists of a list of
831 alert rule groups and each group consists of a list of alert
834 Alert rules in dictionary form are considered to be in single rule
835 format if in the least it contains two keys corresponding to the
836 alert rule name and alert expression.
839 True if alert rule is in single rule file format.
841 # one alert rule per file
842 return set(rules_dict
) >= {"alert", "expr"}
846 """Utility class for amalgamating prometheus alert rule files and injecting juju topology.
848 An `AlertRules` object supports aggregating alert rules from files and directories in both
849 official and single rule file formats using the `add_path()` method. All the alert rules
850 read are annotated with Juju topology labels and amalgamated into a single data structure
851 in the form of a Python dictionary using the `as_dict()` method. Such a dictionary can be
852 easily dumped into JSON format and exchanged over relation data. The dictionary can also
853 be dumped into YAML format and written directly into an alert rules file that is read by
854 Prometheus. Note that multiple `AlertRules` objects must not be written into the same file,
855 since Prometheus allows only a single list of alert rule groups per alert rules file.
857 The official Prometheus format is a YAML file conforming to the Prometheus documentation
858 (https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).
859 The custom single rule format is a subsection of the official YAML, having a single alert
860 rule, effectively "one alert per file".
863 # This class uses the following terminology for the various parts of a rule file:
864 # - alert rules file: the entire groups[] yaml, including the "groups:" key.
865 # - alert groups (plural): the list of groups[] (a list, i.e. no "groups:" key) - it is a list
866 # of dictionaries that have the "name" and "rules" keys.
867 # - alert group (singular): a single dictionary that has the "name" and "rules" keys.
868 # - alert rules (plural): all the alerts in a given alert group - a list of dictionaries with
869 # the "alert" and "expr" keys.
870 # - alert rule (singular): a single dictionary that has the "alert" and "expr" keys.
872 def __init__(self
, topology
: Optional
[JujuTopology
] = None):
873 """Build and alert rule object.
876 topology: an optional `JujuTopology` instance that is used to annotate all alert rules.
878 self
.topology
= topology
879 self
.tool
= CosTool(None)
880 self
.alert_groups
= [] # type: List[dict]
882 def _from_file(self
, root_path
: Path
, file_path
: Path
) -> List
[dict]:
883 """Read a rules file from path, injecting juju topology.
886 root_path: full path to the root rules folder (used only for generating group name)
887 file_path: full path to a *.rule file.
890 A list of dictionaries representing the rules file, if file is valid (the structure is
891 formed by `yaml.safe_load` of the file); an empty list otherwise.
893 with file_path
.open() as rf
:
894 # Load a list of rules from file then add labels and filters
896 rule_file
= yaml
.safe_load(rf
)
898 except Exception as e
:
899 logger
.error("Failed to read alert rules from %s: %s", file_path
.name
, e
)
903 logger
.warning("Empty rules file: %s", file_path
.name
)
905 if not isinstance(rule_file
, dict):
906 logger
.error("Invalid rules file (must be a dict): %s", file_path
.name
)
908 if _is_official_alert_rule_format(rule_file
):
909 alert_groups
= rule_file
["groups"]
910 elif _is_single_alert_rule_format(rule_file
):
911 # convert to list of alert groups
912 # group name is made up from the file name
913 alert_groups
= [{"name": file_path
.stem
, "rules": [rule_file
]}]
915 # invalid/unsupported
916 logger
.error("Invalid rules file: %s", file_path
.name
)
919 # update rules with additional metadata
920 for alert_group
in alert_groups
:
921 # update group name with topology and sub-path
922 alert_group
["name"] = self
._group
_name
(
928 # add "juju_" topology labels
929 for alert_rule
in alert_group
["rules"]:
930 if "labels" not in alert_rule
:
931 alert_rule
["labels"] = {}
934 alert_rule
["labels"].update(self
.topology
.label_matcher_dict
)
935 # insert juju topology filters into a prometheus alert rule
936 alert_rule
["expr"] = self
.tool
.inject_label_matchers(
937 re
.sub(r
"%%juju_topology%%,?", "", alert_rule
["expr"]),
938 self
.topology
.label_matcher_dict
,
943 def _group_name(self
, root_path
: str, file_path
: str, group_name
: str) -> str:
944 """Generate group name from path and topology.
946 The group name is made up of the relative path between the root dir_path, the file path,
947 and topology identifier.
950 root_path: path to the root rules dir.
951 file_path: path to rule file.
952 group_name: original group name to keep as part of the new augmented group name
955 New group name, augmented by juju topology and relative path.
957 rel_path
= os
.path
.relpath(os
.path
.dirname(file_path
), root_path
)
958 rel_path
= "" if rel_path
== "." else rel_path
.replace(os
.path
.sep
, "_")
960 # Generate group name:
961 # - name, from juju topology
962 # - suffix, from the relative path of the rule file;
963 group_name_parts
= [self
.topology
.identifier
] if self
.topology
else []
964 group_name_parts
.extend([rel_path
, group_name
, "alerts"])
965 # filter to remove empty strings
966 return "_".join(filter(None, group_name_parts
))
969 def _multi_suffix_glob(
970 cls
, dir_path
: Path
, suffixes
: List
[str], recursive
: bool = True
972 """Helper function for getting all files in a directory that have a matching suffix.
975 dir_path: path to the directory to glob from.
976 suffixes: list of suffixes to include in the glob (items should begin with a period).
977 recursive: a flag indicating whether a glob is recursive (nested) or not.
980 List of files in `dir_path` that have one of the suffixes specified in `suffixes`.
982 all_files_in_dir
= dir_path
.glob("**/*" if recursive
else "*")
983 return list(filter(lambda f
: f
.is_file() and f
.suffix
in suffixes
, all_files_in_dir
))
985 def _from_dir(self
, dir_path
: Path
, recursive
: bool) -> List
[dict]:
986 """Read all rule files in a directory.
988 All rules from files for the same directory are loaded into a single
989 group. The generated name of this group includes juju topology.
990 By default, only the top directory is scanned; for nested scanning, pass `recursive=True`.
993 dir_path: directory containing *.rule files (alert rules without groups).
994 recursive: flag indicating whether to scan for rule files recursively.
997 a list of dictionaries representing prometheus alert rule groups, each dictionary
998 representing an alert group (structure determined by `yaml.safe_load`).
1000 alert_groups
= [] # type: List[dict]
1002 # Gather all alerts into a list of groups
1003 for file_path
in self
._multi
_suffix
_glob
(
1004 dir_path
, [".rule", ".rules", ".yml", ".yaml"], recursive
1006 alert_groups_from_file
= self
._from
_file
(dir_path
, file_path
)
1007 if alert_groups_from_file
:
1008 logger
.debug("Reading alert rule from %s", file_path
)
1009 alert_groups
.extend(alert_groups_from_file
)
1013 def add_path(self
, path
: str, *, recursive
: bool = False) -> None:
1014 """Add rules from a dir path.
1016 All rules from files are aggregated into a data structure representing a single rule file.
1017 All group names are augmented with juju topology.
1020 path: either a rules file or a dir of rules files.
1021 recursive: whether to read files recursively or not (no impact if `path` is a file).
1024 True if path was added else False.
1026 path
= Path(path
) # type: Path
1028 self
.alert_groups
.extend(self
._from
_dir
(path
, recursive
))
1029 elif path
.is_file():
1030 self
.alert_groups
.extend(self
._from
_file
(path
.parent
, path
))
1032 logger
.debug("Alert rules path does not exist: %s", path
)
1034 def as_dict(self
) -> dict:
1035 """Return standard alert rules file in dict representation.
1038 a dictionary containing a single list of alert rule groups.
1039 The list of alert rule groups is provided as value of the
1040 "groups" dictionary key.
1042 return {"groups": self
.alert_groups
} if self
.alert_groups
else {}
1045 class TargetsChangedEvent(EventBase
):
1046 """Event emitted when Prometheus scrape targets change."""
1048 def __init__(self
, handle
, relation_id
):
1049 super().__init
__(handle
)
1050 self
.relation_id
= relation_id
1053 """Save scrape target relation information."""
1054 return {"relation_id": self
.relation_id
}
1056 def restore(self
, snapshot
):
1057 """Restore scrape target relation information."""
1058 self
.relation_id
= snapshot
["relation_id"]
1061 class MonitoringEvents(ObjectEvents
):
1062 """Event descriptor for events raised by `MetricsEndpointConsumer`."""
1064 targets_changed
= EventSource(TargetsChangedEvent
)
1067 class MetricsEndpointConsumer(Object
):
1068 """A Prometheus based Monitoring service."""
1070 on
= MonitoringEvents()
1072 def __init__(self
, charm
: CharmBase
, relation_name
: str = DEFAULT_RELATION_NAME
):
1073 """A Prometheus based Monitoring service.
1076 charm: a `CharmBase` instance that manages this
1077 instance of the Prometheus service.
1078 relation_name: an optional string name of the relation between `charm`
1079 and the Prometheus charmed service. The default is "metrics-endpoint".
1080 It is strongly advised not to change the default, so that people
1081 deploying your charm will have a consistent experience with all
1082 other charms that consume metrics endpoints.
1085 RelationNotFoundError: If there is no relation in the charm's metadata.yaml
1086 with the same name as provided via `relation_name` argument.
1087 RelationInterfaceMismatchError: The relation with the same name as provided
1088 via `relation_name` argument does not have the `prometheus_scrape` relation
1090 RelationRoleMismatchError: If the relation with the same name as provided
1091 via `relation_name` argument does not have the `RelationRole.requires`
1094 _validate_relation_by_interface_and_direction(
1095 charm
, relation_name
, RELATION_INTERFACE_NAME
, RelationRole
.requires
1098 super().__init
__(charm
, relation_name
)
1100 self
._relation
_name
= relation_name
1101 self
._tool
= CosTool(self
._charm
)
1102 events
= self
._charm
.on
[relation_name
]
1103 self
.framework
.observe(events
.relation_changed
, self
._on
_metrics
_provider
_relation
_changed
)
1104 self
.framework
.observe(
1105 events
.relation_departed
, self
._on
_metrics
_provider
_relation
_departed
1108 def _on_metrics_provider_relation_changed(self
, event
):
1109 """Handle changes with related metrics providers.
1111 Anytime there are changes in relations between Prometheus
1112 and metrics provider charms the Prometheus charm is informed,
1113 through a `TargetsChangedEvent` event. The Prometheus charm can
1114 then choose to update its scrape configuration.
1117 event: a `CharmEvent` in response to which the Prometheus
1118 charm must update its scrape configuration.
1120 rel_id
= event
.relation
.id
1122 self
.on
.targets_changed
.emit(relation_id
=rel_id
)
1124 def _on_metrics_provider_relation_departed(self
, event
):
1125 """Update job config when a metrics provider departs.
1127 When a metrics provider departs the Prometheus charm is informed
1128 through a `TargetsChangedEvent` event so that it can update its
1129 scrape configuration to ensure that the departed metrics provider
1130 is removed from the list of scrape jobs and
1133 event: a `CharmEvent` that indicates a metrics provider
1136 rel_id
= event
.relation
.id
1137 self
.on
.targets_changed
.emit(relation_id
=rel_id
)
1139 def jobs(self
) -> list:
1140 """Fetch the list of scrape jobs.
1143 A list consisting of all the static scrape configurations
1144 for each related `MetricsEndpointProvider` that has specified
1149 for relation
in self
._charm
.model
.relations
[self
._relation
_name
]:
1150 static_scrape_jobs
= self
._static
_scrape
_config
(relation
)
1151 if static_scrape_jobs
:
1152 # Duplicate job names will cause validate_scrape_jobs to fail.
1153 # Therefore we need to dedupe here and after all jobs are collected.
1154 static_scrape_jobs
= _dedupe_job_names(static_scrape_jobs
)
1156 self
._tool
.validate_scrape_jobs(static_scrape_jobs
)
1157 except subprocess
.CalledProcessError
as e
:
1158 if self
._charm
.unit
.is_leader():
1159 data
= json
.loads(relation
.data
[self
._charm
.app
].get("event", "{}"))
1160 data
["scrape_job_errors"] = str(e
)
1161 relation
.data
[self
._charm
.app
]["event"] = json
.dumps(data
)
1163 scrape_jobs
.extend(static_scrape_jobs
)
1165 scrape_jobs
= _dedupe_job_names(scrape_jobs
)
1170 def alerts(self
) -> dict:
1171 """Fetch alerts for all relations.
1173 A Prometheus alert rules file consists of a list of "groups". Each
1174 group consists of a list of alerts (`rules`) that are sequentially
1175 executed. This method returns all the alert rules provided by each
1176 related metrics provider charm. These rules may be used to generate a
1177 separate alert rules file for each relation since the returned list
1178 of alert groups are indexed by that relations Juju topology identifier.
1179 The Juju topology identifier string includes substrings that identify
1180 alert rule related metadata such as the Juju model, model UUID and the
1181 application name from where the alert rule originates. Since this
1182 topology identifier is globally unique, it may be used for instance as
1183 the name for the file into which the list of alert rule groups are
1184 written. For each relation, the structure of data returned is a dictionary
1185 representation of a standard prometheus rules file:
1187 {"groups": [{"name": ...}, ...]}
1189 per official prometheus documentation
1190 https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
1192 The value of the `groups` key is such that it may be used to generate
1193 a Prometheus alert rules file directly using `yaml.dump` but the
1194 `groups` key itself must be included as this is required by Prometheus.
1196 For example the list of alert rule groups returned by this method may
1197 be written into files consumed by Prometheus as follows
1200 for topology_identifier, alert_rule_groups in self.metrics_consumer.alerts().items():
1201 filename = "juju_" + topology_identifier + ".rules"
1202 path = os.path.join(PROMETHEUS_RULES_DIR, filename)
1203 rules = yaml.safe_dump(alert_rule_groups)
1204 container.push(path, rules, make_dirs=True)
1208 A dictionary mapping the Juju topology identifier of the source charm to
1209 its list of alert rule groups.
1211 alerts
= {} # type: Dict[str, dict] # mapping b/w juju identifiers and alert rule files
1212 for relation
in self
._charm
.model
.relations
[self
._relation
_name
]:
1213 if not relation
.units
or not relation
.app
:
1216 alert_rules
= json
.loads(relation
.data
[relation
.app
].get("alert_rules", "{}"))
1220 alert_rules
= self
._inject
_alert
_expr
_labels
(alert_rules
)
1222 identifier
, topology
= self
._get
_identifier
_by
_alert
_rules
(alert_rules
)
1225 scrape_metadata
= json
.loads(relation
.data
[relation
.app
]["scrape_metadata"])
1226 identifier
= JujuTopology
.from_dict(scrape_metadata
).identifier
1227 alerts
[identifier
] = self
._tool
.apply_label_matchers(alert_rules
) # type: ignore
1229 except KeyError as e
:
1231 "Relation %s has no 'scrape_metadata': %s",
1238 "Alert rules were found but no usable group or identifier was present."
1242 alerts
[identifier
] = alert_rules
1244 _
, errmsg
= self
._tool
.validate_alert_rules(alert_rules
)
1246 if alerts
[identifier
]:
1247 del alerts
[identifier
]
1248 if self
._charm
.unit
.is_leader():
1249 data
= json
.loads(relation
.data
[self
._charm
.app
].get("event", "{}"))
1250 data
["errors"] = errmsg
1251 relation
.data
[self
._charm
.app
]["event"] = json
.dumps(data
)
1256 def _get_identifier_by_alert_rules(
1258 ) -> Tuple
[Union
[str, None], Union
[JujuTopology
, None]]:
1259 """Determine an appropriate dict key for alert rules.
1261 The key is used as the filename when writing alerts to disk, so the structure
1262 and uniqueness is important.
1265 rules: a dict of alert rules
1267 A tuple containing an identifier, if found, and a JujuTopology, if it could
1270 if "groups" not in rules
:
1271 logger
.debug("No alert groups were found in relation data")
1274 # Construct an ID based on what's in the alert rules if they have labels
1275 for group
in rules
["groups"]:
1277 labels
= group
["rules"][0]["labels"]
1278 topology
= JujuTopology(
1279 # Don't try to safely get required constructor fields. There's already
1280 # a handler for KeyErrors
1281 model_uuid
=labels
["juju_model_uuid"],
1282 model
=labels
["juju_model"],
1283 application
=labels
["juju_application"],
1284 unit
=labels
.get("juju_unit", ""),
1285 charm_name
=labels
.get("juju_charm", ""),
1287 return topology
.identifier
, topology
1289 logger
.debug("Alert rules were found but no usable labels were present")
1293 "No labeled alert rules were found, and no 'scrape_metadata' "
1294 "was available. Using the alert group name as filename."
1297 for group
in rules
["groups"]:
1298 return group
["name"], None
1300 logger
.debug("No group name was found to use as identifier")
1304 def _inject_alert_expr_labels(self
, rules
: Dict
[str, Any
]) -> Dict
[str, Any
]:
1305 """Iterate through alert rules and inject topology into expressions.
1308 rules: a dict of alert rules
1310 if "groups" not in rules
:
1313 modified_groups
= []
1314 for group
in rules
["groups"]:
1315 # Copy off rules, so we don't modify an object we're iterating over
1316 rules_copy
= group
["rules"]
1317 for idx
, rule
in enumerate(rules_copy
):
1318 labels
= rule
.get("labels")
1322 topology
= JujuTopology(
1323 # Don't try to safely get required constructor fields. There's already
1324 # a handler for KeyErrors
1325 model_uuid
=labels
["juju_model_uuid"],
1326 model
=labels
["juju_model"],
1327 application
=labels
["juju_application"],
1328 unit
=labels
.get("juju_unit", ""),
1329 charm_name
=labels
.get("juju_charm", ""),
1332 # Inject topology and put it back in the list
1333 rule
["expr"] = self
._tool
.inject_label_matchers(
1334 re
.sub(r
"%%juju_topology%%,?", "", rule
["expr"]),
1335 topology
.label_matcher_dict
,
1338 # Some required JujuTopology key is missing. Just move on.
1341 group
["rules"][idx
] = rule
1343 modified_groups
.append(group
)
1345 rules
["groups"] = modified_groups
1348 def _static_scrape_config(self
, relation
) -> list:
1349 """Generate the static scrape configuration for a single relation.
1351 If the relation data includes `scrape_metadata` then the value
1352 of this key is used to annotate the scrape jobs with Juju
1353 Topology labels before returning them.
1356 relation: an `ops.model.Relation` object whose static
1357 scrape configuration is required.
1360 A list (possibly empty) of scrape jobs. Each job is a
1361 valid Prometheus scrape configuration for that job,
1362 represented as a Python dictionary.
1364 if not relation
.units
:
1367 scrape_jobs
= json
.loads(relation
.data
[relation
.app
].get("scrape_jobs", "[]"))
1372 scrape_metadata
= json
.loads(relation
.data
[relation
.app
].get("scrape_metadata", "{}"))
1374 if not scrape_metadata
:
1377 topology
= JujuTopology
.from_dict(scrape_metadata
)
1379 job_name_prefix
= "juju_{}_prometheus_scrape".format(topology
.identifier
)
1380 scrape_jobs
= PrometheusConfig
.prefix_job_names(scrape_jobs
, job_name_prefix
)
1381 scrape_jobs
= PrometheusConfig
.sanitize_scrape_configs(scrape_jobs
)
1383 hosts
= self
._relation
_hosts
(relation
)
1385 scrape_jobs
= PrometheusConfig
.expand_wildcard_targets_into_individual_jobs(
1386 scrape_jobs
, hosts
, topology
1391 def _relation_hosts(self
, relation
: Relation
) -> Dict
[str, Tuple
[str, str]]:
1392 """Returns a mapping from unit names to (address, path) tuples, for the given relation."""
1394 for unit
in relation
.units
:
1395 # TODO deprecate and remove unit.name
1396 unit_name
= relation
.data
[unit
].get("prometheus_scrape_unit_name") or unit
.name
1397 # TODO deprecate and remove "prometheus_scrape_host"
1398 unit_address
= relation
.data
[unit
].get(
1399 "prometheus_scrape_unit_address"
1400 ) or relation
.data
[unit
].get("prometheus_scrape_host")
1401 unit_path
= relation
.data
[unit
].get("prometheus_scrape_unit_path", "")
1402 if unit_name
and unit_address
:
1403 hosts
.update({unit_name
: (unit_address
, unit_path
)})
1407 def _target_parts(self
, target
) -> list:
1408 """Extract host and port from a wildcard target.
1411 target: a string specifying a scrape target. A
1412 scrape target is expected to have the format
1413 "host:port". The host part may be a wildcard
1414 "*" and the port part can be missing (along
1415 with ":") in which case port is set to 80.
1418 a list with target host and port as in [host, port]
1421 parts
= target
.split(":")
1423 parts
= [target
, "80"]
1428 def _dedupe_job_names(jobs
: List
[dict]):
1429 """Deduplicate a list of dicts by appending a hash to the value of the 'job_name' key.
1431 Additionally, fully de-duplicate any identical jobs.
1434 jobs: A list of prometheus scrape jobs
1436 jobs_copy
= copy
.deepcopy(jobs
)
1438 # Convert to a dict with job names as keys
1439 # I think this line is O(n^2) but it should be okay given the list sizes
1441 job
["job_name"]: list(filter(lambda x
: x
["job_name"] == job
["job_name"], jobs_copy
))
1442 for job
in jobs_copy
1445 # If multiple jobs have the same name, convert the name to "name_<hash-of-job>"
1446 for key
in jobs_dict
:
1447 if len(jobs_dict
[key
]) > 1:
1448 for job
in jobs_dict
[key
]:
1449 job_json
= json
.dumps(job
)
1450 hashed
= hashlib
.sha256(job_json
.encode()).hexdigest()
1451 job
["job_name"] = "{}_{}".format(job
["job_name"], hashed
)
1453 for key
in jobs_dict
:
1454 new_jobs
.extend(list(jobs_dict
[key
]))
1456 # Deduplicate jobs which are equal
1457 # Again this in O(n^2) but it should be okay
1460 for job
in new_jobs
:
1461 job_json
= json
.dumps(job
)
1462 hashed
= hashlib
.sha256(job_json
.encode()).hexdigest()
1466 deduped_jobs
.append(job
)
1471 def _resolve_dir_against_charm_path(charm
: CharmBase
, *path_elements
: str) -> str:
1472 """Resolve the provided path items against the directory of the main file.
1474 Look up the directory of the `main.py` file being executed. This is normally
1475 going to be the charm.py file of the charm including this library. Then, resolve
1476 the provided path elements and, if the result path exists and is a directory,
1477 return its absolute path; otherwise, raise en exception.
1480 InvalidAlertRulePathError, if the path does not exist or is not a directory.
1482 charm_dir
= Path(str(charm
.charm_dir
))
1483 if not charm_dir
.exists() or not charm_dir
.is_dir():
1484 # Operator Framework does not currently expose a robust
1485 # way to determine the top level charm source directory
1486 # that is consistent across deployed charms and unit tests
1487 # Hence for unit tests the current working directory is used
1488 # TODO: updated this logic when the following ticket is resolved
1489 # https://github.com/canonical/operator/issues/643
1490 charm_dir
= Path(os
.getcwd())
1492 alerts_dir_path
= charm_dir
.absolute().joinpath(*path_elements
)
1494 if not alerts_dir_path
.exists():
1495 raise InvalidAlertRulePathError(alerts_dir_path
, "directory does not exist")
1496 if not alerts_dir_path
.is_dir():
1497 raise InvalidAlertRulePathError(alerts_dir_path
, "is not a directory")
1499 return str(alerts_dir_path
)
1502 class MetricsEndpointProvider(Object
):
1503 """A metrics endpoint for Prometheus."""
1505 on
= MetricsEndpointProviderEvents()
1510 relation_name
: str = DEFAULT_RELATION_NAME
,
1512 alert_rules_path
: str = DEFAULT_ALERT_RULES_RELATIVE_PATH
,
1513 refresh_event
: Optional
[Union
[BoundEvent
, List
[BoundEvent
]]] = None,
1514 external_url
: str = "",
1515 lookaside_jobs_callable
: Optional
[Callable
] = None,
1517 """Construct a metrics provider for a Prometheus charm.
1519 If your charm exposes a Prometheus metrics endpoint, the
1520 `MetricsEndpointProvider` object enables your charm to easily
1521 communicate how to reach that metrics endpoint.
1523 By default, a charm instantiating this object has the metrics
1524 endpoints of each of its units scraped by the related Prometheus
1525 charms. The scraped metrics are automatically tagged by the
1526 Prometheus charms with Juju topology data via the
1527 `juju_model_name`, `juju_model_uuid`, `juju_application_name`
1528 and `juju_unit` labels. To support such tagging `MetricsEndpointProvider`
1529 automatically forwards scrape metadata to a `MetricsEndpointConsumer`
1532 Scrape targets provided by `MetricsEndpointProvider` can be
1533 customized when instantiating this object. For example in the
1534 case of a charm exposing the metrics endpoint for each of its
1535 units on port 8080 and the `/metrics` path, the
1536 `MetricsEndpointProvider` can be instantiated as follows:
1538 self.metrics_endpoint_provider = MetricsEndpointProvider(
1541 "static_configs": [{"targets": ["*:8080"]}],
1544 The notation `*:<port>` means "scrape each unit of this charm on port
1547 In case the metrics endpoints are not on the standard `/metrics` path,
1548 a custom path can be specified as follows:
1550 self.metrics_endpoint_provider = MetricsEndpointProvider(
1553 "metrics_path": "/my/strange/metrics/path",
1554 "static_configs": [{"targets": ["*:8080"]}],
1557 Note how the `jobs` argument is a list: this allows you to expose multiple
1558 combinations of paths "metrics_path" and "static_configs" in case your charm
1559 exposes multiple endpoints, which could happen, for example, when you have
1560 multiple workload containers, with applications in each needing to be scraped.
1561 The structure of the objects in the `jobs` list is one-to-one with the
1562 `scrape_config` configuration item of Prometheus' own configuration (see
1563 https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
1564 ), but with only a subset of the fields allowed. The permitted fields are
1565 listed in `ALLOWED_KEYS` object in this charm library module.
1567 It is also possible to specify alert rules. By default, this library will look
1568 into the `<charm_parent_dir>/prometheus_alert_rules`, which in a standard charm
1569 layouts resolves to `src/prometheus_alert_rules`. Each alert rule goes into a
1570 separate `*.rule` file. If the syntax of a rule is invalid,
1571 the `MetricsEndpointProvider` logs an error and does not load the particular
1574 To avoid false positives and negatives in the evaluation of alert rules,
1575 all ingested alert rule expressions are automatically qualified using Juju
1576 Topology filters. This ensures that alert rules provided by your charm, trigger
1577 alerts based only on data scrapped from your charm. For example an alert rule
1578 such as the following
1580 alert: UnitUnavailable
1584 will be automatically transformed into something along the lines of the following
1586 alert: UnitUnavailable
1587 expr: up{juju_model=<model>, juju_model_uuid=<uuid-prefix>, juju_application=<app>} < 1
1590 An attempt will be made to validate alert rules prior to loading them into Prometheus.
1591 If they are invalid, an event will be emitted from this object which charms can respond
1592 to in order to set a meaningful status for administrators.
1594 This can be observed via `consumer.on.alert_rule_status_changed` which contains:
1595 - The error(s) encountered when validating as `errors`
1596 - A `valid` attribute, which can be used to reset the state of charms if alert rules
1597 are updated via another mechanism (e.g. `cos-config`) and refreshed.
1600 charm: a `CharmBase` object that manages this
1601 `MetricsEndpointProvider` object. Typically, this is
1602 `self` in the instantiating class.
1603 relation_name: an optional string name of the relation between `charm`
1604 and the Prometheus charmed service. The default is "metrics-endpoint".
1605 It is strongly advised not to change the default, so that people
1606 deploying your charm will have a consistent experience with all
1607 other charms that provide metrics endpoints.
1608 jobs: an optional list of dictionaries where each
1609 dictionary represents the Prometheus scrape
1610 configuration for a single job. When not provided, a
1611 default scrape configuration is provided for the
1612 `/metrics` endpoint polling all units of the charm on port `80`
1613 using the `MetricsEndpointProvider` object.
1614 alert_rules_path: an optional path for the location of alert rules
1615 files. Defaults to "./prometheus_alert_rules",
1616 resolved relative to the directory hosting the charm entry file.
1617 The alert rules are automatically updated on charm upgrade.
1618 refresh_event: an optional bound event or list of bound events which
1619 will be observed to re-set scrape job data (IP address and others)
1620 external_url: an optional argument that represents an external url that
1621 can be generated by an Ingress or a Proxy.
1622 lookaside_jobs_callable: an optional `Callable` which should be invoked
1623 when the job configuration is built as a secondary mapping. The callable
1624 should return a `List[Dict]` which is syntactically identical to the
1625 `jobs` parameter, but can be updated out of step initialization of
1626 this library without disrupting the 'global' job spec.
1629 RelationNotFoundError: If there is no relation in the charm's metadata.yaml
1630 with the same name as provided via `relation_name` argument.
1631 RelationInterfaceMismatchError: The relation with the same name as provided
1632 via `relation_name` argument does not have the `prometheus_scrape` relation
1634 RelationRoleMismatchError: If the relation with the same name as provided
1635 via `relation_name` argument does not have the `RelationRole.provides`
1638 _validate_relation_by_interface_and_direction(
1639 charm
, relation_name
, RELATION_INTERFACE_NAME
, RelationRole
.provides
1643 alert_rules_path
= _resolve_dir_against_charm_path(charm
, alert_rules_path
)
1644 except InvalidAlertRulePathError
as e
:
1646 "Invalid Prometheus alert rules folder at %s: %s",
1647 e
.alert_rules_absolute_path
,
1651 super().__init
__(charm
, relation_name
)
1652 self
.topology
= JujuTopology
.from_charm(charm
)
1655 self
._alert
_rules
_path
= alert_rules_path
1656 self
._relation
_name
= relation_name
1657 # sanitize job configurations to the supported subset of parameters
1658 jobs
= [] if jobs
is None else jobs
1659 self
._jobs
= PrometheusConfig
.sanitize_scrape_configs(jobs
)
1663 external_url
if urlparse(external_url
).scheme
else ("http://" + external_url
)
1665 self
.external_url
= external_url
1666 self
._lookaside
_jobs
= lookaside_jobs_callable
1668 events
= self
._charm
.on
[self
._relation
_name
]
1669 self
.framework
.observe(events
.relation_changed
, self
._on
_relation
_changed
)
1671 if not refresh_event
:
1672 # FIXME remove once podspec charms are verified.
1673 # `self.set_scrape_job_spec()` is called every re-init so this should not be needed.
1674 if len(self
._charm
.meta
.containers
) == 1:
1675 if "kubernetes" in self
._charm
.meta
.series
:
1676 # This is a podspec charm
1677 refresh_event
= [self
._charm
.on
.update_status
]
1679 # This is a sidecar/pebble charm
1680 container
= list(self
._charm
.meta
.containers
.values())[0]
1681 refresh_event
= [self
._charm
.on
[container
.name
.replace("-", "_")].pebble_ready
]
1684 "%d containers are present in metadata.yaml and "
1685 "refresh_event was not specified. Defaulting to update_status. "
1686 "Metrics IP may not be set in a timely fashion.",
1687 len(self
._charm
.meta
.containers
),
1689 refresh_event
= [self
._charm
.on
.update_status
]
1692 if not isinstance(refresh_event
, list):
1693 refresh_event
= [refresh_event
]
1695 self
.framework
.observe(events
.relation_joined
, self
.set_scrape_job_spec
)
1696 for ev
in refresh_event
:
1697 self
.framework
.observe(ev
, self
.set_scrape_job_spec
)
1699 def _on_relation_changed(self
, event
):
1700 """Check for alert rule messages in the relation data before moving on."""
1701 if self
._charm
.unit
.is_leader():
1702 ev
= json
.loads(event
.relation
.data
[event
.app
].get("event", "{}"))
1705 valid
= bool(ev
.get("valid", True))
1706 errors
= ev
.get("errors", "")
1708 if valid
and not errors
:
1709 self
.on
.alert_rule_status_changed
.emit(valid
=valid
)
1711 self
.on
.alert_rule_status_changed
.emit(valid
=valid
, errors
=errors
)
1713 scrape_errors
= ev
.get("scrape_job_errors", None)
1715 self
.on
.invalid_scrape_job
.emit(errors
=scrape_errors
)
1717 def update_scrape_job_spec(self
, jobs
):
1718 """Update scrape job specification."""
1719 self
._jobs
= PrometheusConfig
.sanitize_scrape_configs(jobs
)
1720 self
.set_scrape_job_spec()
1722 def set_scrape_job_spec(self
, _
=None):
1723 """Ensure scrape target information is made available to prometheus.
1725 When a metrics provider charm is related to a prometheus charm, the
1726 metrics provider sets specification and metadata related to its own
1727 scrape configuration. This information is set using Juju application
1728 data. In addition, each of the consumer units also sets its own
1729 host address in Juju unit relation data.
1733 if not self
._charm
.unit
.is_leader():
1736 alert_rules
= AlertRules(topology
=self
.topology
)
1737 alert_rules
.add_path(self
._alert
_rules
_path
, recursive
=True)
1738 alert_rules_as_dict
= alert_rules
.as_dict()
1740 for relation
in self
._charm
.model
.relations
[self
._relation
_name
]:
1741 relation
.data
[self
._charm
.app
]["scrape_metadata"] = json
.dumps(self
._scrape
_metadata
)
1742 relation
.data
[self
._charm
.app
]["scrape_jobs"] = json
.dumps(self
._scrape
_jobs
)
1744 if alert_rules_as_dict
:
1745 # Update relation data with the string representation of the rule file.
1746 # Juju topology is already included in the "scrape_metadata" field above.
1747 # The consumer side of the relation uses this information to name the rules file
1748 # that is written to the filesystem.
1749 relation
.data
[self
._charm
.app
]["alert_rules"] = json
.dumps(alert_rules_as_dict
)
1751 def _set_unit_ip(self
, _
=None):
1752 """Set unit host address.
1754 Each time a metrics provider charm container is restarted it updates its own
1755 host address in the unit relation data for the prometheus charm.
1757 The only argument specified is an event, and it ignored. This is for expediency
1758 to be able to use this method as an event handler, although no access to the
1759 event is actually needed.
1761 for relation
in self
._charm
.model
.relations
[self
._relation
_name
]:
1762 unit_ip
= str(self
._charm
.model
.get_binding(relation
).network
.bind_address
)
1764 # TODO store entire url in relation data, instead of only select url parts.
1766 if self
.external_url
:
1767 parsed
= urlparse(self
.external_url
)
1768 unit_address
= parsed
.hostname
1770 elif self
._is
_valid
_unit
_address
(unit_ip
):
1771 unit_address
= unit_ip
1774 unit_address
= socket
.getfqdn()
1777 relation
.data
[self
._charm
.unit
]["prometheus_scrape_unit_address"] = unit_address
1778 relation
.data
[self
._charm
.unit
]["prometheus_scrape_unit_path"] = path
1779 relation
.data
[self
._charm
.unit
]["prometheus_scrape_unit_name"] = str(
1780 self
._charm
.model
.unit
.name
1783 def _is_valid_unit_address(self
, address
: str) -> bool:
1784 """Validate a unit address.
1786 At present only IP address validation is supported, but
1787 this may be extended to DNS addresses also, as needed.
1790 address: a string representing a unit address
1793 _
= ipaddress
.ip_address(address
)
1800 def _scrape_jobs(self
) -> list:
1801 """Fetch list of scrape jobs.
1804 A list of dictionaries, where each dictionary specifies a
1805 single scrape job for Prometheus.
1807 jobs
= self
._jobs
if self
._jobs
else [DEFAULT_JOB
]
1808 if callable(self
._lookaside
_jobs
):
1809 return jobs
+ PrometheusConfig
.sanitize_scrape_configs(self
._lookaside
_jobs
())
1813 def _scrape_metadata(self
) -> dict:
1814 """Generate scrape metadata.
1817 Scrape configuration metadata for this metrics provider charm.
1819 return self
.topology
.as_dict()
1822 class PrometheusRulesProvider(Object
):
1823 """Forward rules to Prometheus.
1825 This object may be used to forward rules to Prometheus. At present it only supports
1826 forwarding alert rules. This is unlike :class:`MetricsEndpointProvider`, which
1827 is used for forwarding both scrape targets and associated alert rules. This object
1828 is typically used when there is a desire to forward rules that apply globally (across
1829 all deployed charms and units) rather than to a single charm. All rule files are
1830 forwarded using the same 'prometheus_scrape' interface that is also used by
1831 `MetricsEndpointProvider`.
1834 charm: A charm instance that `provides` a relation with the `prometheus_scrape` interface.
1835 relation_name: Name of the relation in `metadata.yaml` that
1836 has the `prometheus_scrape` interface.
1837 dir_path: Root directory for the collection of rule files.
1838 recursive: Whether to scan for rule files recursively.
1844 relation_name
: str = DEFAULT_RELATION_NAME
,
1845 dir_path
: str = DEFAULT_ALERT_RULES_RELATIVE_PATH
,
1848 super().__init
__(charm
, relation_name
)
1850 self
._relation
_name
= relation_name
1851 self
._recursive
= recursive
1854 dir_path
= _resolve_dir_against_charm_path(charm
, dir_path
)
1855 except InvalidAlertRulePathError
as e
:
1857 "Invalid Prometheus alert rules folder at %s: %s",
1858 e
.alert_rules_absolute_path
,
1861 self
.dir_path
= dir_path
1863 events
= self
._charm
.on
[self
._relation
_name
]
1865 events
.relation_joined
,
1866 events
.relation_changed
,
1867 self
._charm
.on
.leader_elected
,
1868 self
._charm
.on
.upgrade_charm
,
1871 for event_source
in event_sources
:
1872 self
.framework
.observe(event_source
, self
._update
_relation
_data
)
1874 def _reinitialize_alert_rules(self
):
1875 """Reloads alert rules and updates all relations."""
1876 self
._update
_relation
_data
(None)
1878 def _update_relation_data(self
, _
):
1879 """Update application relation data with alert rules for all relations."""
1880 if not self
._charm
.unit
.is_leader():
1883 alert_rules
= AlertRules()
1884 alert_rules
.add_path(self
.dir_path
, recursive
=self
._recursive
)
1885 alert_rules_as_dict
= alert_rules
.as_dict()
1887 logger
.info("Updating relation data with rule files from disk")
1888 for relation
in self
._charm
.model
.relations
[self
._relation
_name
]:
1889 relation
.data
[self
._charm
.app
]["alert_rules"] = json
.dumps(
1890 alert_rules_as_dict
,
1891 sort_keys
=True, # sort, to prevent unnecessary relation_changed events
1895 class MetricsEndpointAggregator(Object
):
1896 """Aggregate metrics from multiple scrape targets.
1898 `MetricsEndpointAggregator` collects scrape target information from one
1899 or more related charms and forwards this to a `MetricsEndpointConsumer`
1900 charm, which may be in a different Juju model. However, it is
1901 essential that `MetricsEndpointAggregator` itself resides in the same
1902 model as its scrape targets, as this is currently the only way to
1903 ensure in Juju that the `MetricsEndpointAggregator` will be able to
1904 determine the model name and uuid of the scrape targets.
1906 `MetricsEndpointAggregator` should be used in place of
1907 `MetricsEndpointProvider` in the following two use cases:
1909 1. Integrating one or more scrape targets that do not support the
1910 `prometheus_scrape` interface.
1912 2. Integrating one or more scrape targets through cross model
1913 relations. Although the [Scrape Config Operator](https://charmhub.io/cos-configuration-k8s)
1914 may also be used for the purpose of supporting cross model
1917 Using `MetricsEndpointAggregator` to build a Prometheus charm client
1918 only requires instantiating it. Instantiating
1919 `MetricsEndpointAggregator` is similar to `MetricsEndpointProvider` except
1920 that it requires specifying the names of three relations: the
1921 relation with scrape targets, the relation for alert rules, and
1922 that with the Prometheus charms. For example
1925 self._aggregator = MetricsEndpointAggregator(
1928 "prometheus": "monitoring",
1929 "scrape_target": "prometheus-target",
1930 "alert_rules": "prometheus-rules"
1935 `MetricsEndpointAggregator` assumes that each unit of a scrape target
1936 sets in its unit-level relation data two entries with keys
1937 "hostname" and "port". If it is required to integrate with charms
1938 that do not honor these assumptions, it is always possible to
1939 derive from `MetricsEndpointAggregator` overriding the `_get_targets()`
1940 method, which is responsible for aggregating the unit name, host
1941 address ("hostname") and port of the scrape target.
1942 `MetricsEndpointAggregator` also assumes that each unit of a
1943 scrape target sets in its unit-level relation data a key named
1944 "groups". The value of this key is expected to be the string
1945 representation of list of Prometheus Alert rules in YAML format.
1946 An example of a single such alert rule is
1949 - alert: HighRequestLatency
1950 expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
1955 summary: High request latency
1958 Once again if it is required to integrate with charms that do not
1959 honour these assumptions about alert rules then an object derived
1960 from `MetricsEndpointAggregator` may be used by overriding the
1961 `_get_alert_rules()` method.
1963 `MetricsEndpointAggregator` ensures that Prometheus scrape job
1964 specifications and alert rules are annotated with Juju topology
1965 information, just like `MetricsEndpointProvider` and
1966 `MetricsEndpointConsumer` do.
1968 By default, `MetricsEndpointAggregator` ensures that Prometheus
1969 "instance" labels refer to Juju topology. This ensures that
1970 instance labels are stable over unit recreation. While it is not
1971 advisable to change this option, if required it can be done by
1972 setting the "relabel_instance" keyword argument to `False` when
1973 constructing an aggregator object.
1976 _stored
= StoredState()
1981 relation_names
: Optional
[dict] = None,
1982 relabel_instance
=True,
1983 resolve_addresses
=False,
1985 """Construct a `MetricsEndpointAggregator`.
1988 charm: a `CharmBase` object that manages this
1989 `MetricsEndpointAggregator` object. Typically, this is
1990 `self` in the instantiating class.
1991 relation_names: a dictionary with three keys. The value
1992 of the "scrape_target" and "alert_rules" keys are
1993 the relation names over which scrape job and alert rule
1994 information is gathered by this `MetricsEndpointAggregator`.
1995 And the value of the "prometheus" key is the name of
1996 the relation with a `MetricsEndpointConsumer` such as
1997 the Prometheus charm.
1998 relabel_instance: A boolean flag indicating if Prometheus
1999 scrape job "instance" labels must refer to Juju Topology.
2000 resolve_addresses: A boolean flag indiccating if the aggregator
2001 should attempt to perform DNS lookups of targets and append
2006 relation_names
= relation_names
or {}
2008 self
._prometheus
_relation
= relation_names
.get(
2009 "prometheus", "downstream-prometheus-scrape"
2011 self
._target
_relation
= relation_names
.get("scrape_target", "prometheus-target")
2012 self
._alert
_rules
_relation
= relation_names
.get("alert_rules", "prometheus-rules")
2014 super().__init
__(charm
, self
._prometheus
_relation
)
2015 self
._stored
.set_default(jobs
=[], alert_rules
=[])
2017 self
._relabel
_instance
= relabel_instance
2018 self
._resolve
_addresses
= resolve_addresses
2020 # manage Prometheus charm relation events
2021 prometheus_events
= self
._charm
.on
[self
._prometheus
_relation
]
2022 self
.framework
.observe(prometheus_events
.relation_joined
, self
._set
_prometheus
_data
)
2024 # manage list of Prometheus scrape jobs from related scrape targets
2025 target_events
= self
._charm
.on
[self
._target
_relation
]
2026 self
.framework
.observe(target_events
.relation_changed
, self
._on
_prometheus
_targets
_changed
)
2027 self
.framework
.observe(
2028 target_events
.relation_departed
, self
._on
_prometheus
_targets
_departed
2031 # manage alert rules for Prometheus from related scrape targets
2032 alert_rule_events
= self
._charm
.on
[self
._alert
_rules
_relation
]
2033 self
.framework
.observe(alert_rule_events
.relation_changed
, self
._on
_alert
_rules
_changed
)
2034 self
.framework
.observe(alert_rule_events
.relation_departed
, self
._on
_alert
_rules
_departed
)
2036 def _set_prometheus_data(self
, event
):
2037 """Ensure every new Prometheus instances is updated.
2039 Any time a new Prometheus unit joins the relation with
2040 `MetricsEndpointAggregator`, that Prometheus unit is provided
2041 with the complete set of existing scrape jobs and alert rules.
2043 if not self
._charm
.unit
.is_leader():
2046 jobs
= [] + _type_convert_stored(
2048 ) # list of scrape jobs, one per relation
2049 for relation
in self
.model
.relations
[self
._target
_relation
]:
2050 targets
= self
._get
_targets
(relation
)
2051 if targets
and relation
.app
:
2052 jobs
.append(self
._static
_scrape
_job
(targets
, relation
.app
.name
))
2054 groups
= [] + _type_convert_stored(self
._stored
.alert_rules
) # list of alert rule groups
2055 for relation
in self
.model
.relations
[self
._alert
_rules
_relation
]:
2056 unit_rules
= self
._get
_alert
_rules
(relation
)
2057 if unit_rules
and relation
.app
:
2058 appname
= relation
.app
.name
2059 rules
= self
._label
_alert
_rules
(unit_rules
, appname
)
2060 group
= {"name": self
.group_name(appname
), "rules": rules
}
2061 groups
.append(group
)
2063 event
.relation
.data
[self
._charm
.app
]["scrape_jobs"] = json
.dumps(jobs
)
2064 event
.relation
.data
[self
._charm
.app
]["alert_rules"] = json
.dumps({"groups": groups
})
2066 def _on_prometheus_targets_changed(self
, event
):
2067 """Update scrape jobs in response to scrape target changes.
2069 When there is any change in relation data with any scrape
2070 target, the Prometheus scrape job, for that specific target is
2073 targets
= self
._get
_targets
(event
.relation
)
2077 # new scrape job for the relation that has changed
2078 self
.set_target_job_data(targets
, event
.relation
.app
.name
)
2080 def set_target_job_data(self
, targets
: dict, app_name
: str, **kwargs
) -> None:
2081 """Update scrape jobs in response to scrape target changes.
2083 When there is any change in relation data with any scrape
2084 target, the Prometheus scrape job, for that specific target is
2085 updated. Additionally, if this method is called manually, do the
2089 targets: a `dict` containing target information
2090 app_name: a `str` identifying the application
2091 kwargs: a `dict` of the extra arguments passed to the function
2093 if not self
._charm
.unit
.is_leader():
2096 # new scrape job for the relation that has changed
2097 updated_job
= self
._static
_scrape
_job
(targets
, app_name
, **kwargs
)
2099 for relation
in self
.model
.relations
[self
._prometheus
_relation
]:
2100 jobs
= json
.loads(relation
.data
[self
._charm
.app
].get("scrape_jobs", "[]"))
2101 # list of scrape jobs that have not changed
2102 jobs
= [job
for job
in jobs
if updated_job
["job_name"] != job
["job_name"]]
2103 jobs
.append(updated_job
)
2104 relation
.data
[self
._charm
.app
]["scrape_jobs"] = json
.dumps(jobs
)
2106 if not _type_convert_stored(self
._stored
.jobs
) == jobs
:
2107 self
._stored
.jobs
= jobs
2109 def _on_prometheus_targets_departed(self
, event
):
2110 """Remove scrape jobs when a target departs.
2112 Any time a scrape target departs, any Prometheus scrape job
2113 associated with that specific scrape target is removed.
2115 job_name
= self
._job
_name
(event
.relation
.app
.name
)
2116 unit_name
= event
.unit
.name
2117 self
.remove_prometheus_jobs(job_name
, unit_name
)
2119 def remove_prometheus_jobs(self
, job_name
: str, unit_name
: Optional
[str] = ""):
2120 """Given a job name and unit name, remove scrape jobs associated.
2122 The `unit_name` parameter is used for automatic, relation data bag-based
2123 generation, where the unit name in labels can be used to ensure that jobs with
2124 similar names (which are generated via the app name when scanning relation data
2125 bags) are not accidentally removed, as their unit name labels will differ.
2126 For NRPE, the job name is calculated from an ID sent via the NRPE relation, and is
2127 sufficient to uniquely identify the target.
2129 if not self
._charm
.unit
.is_leader():
2132 for relation
in self
.model
.relations
[self
._prometheus
_relation
]:
2133 jobs
= json
.loads(relation
.data
[self
._charm
.app
].get("scrape_jobs", "[]"))
2137 changed_job
= [j
for j
in jobs
if j
.get("job_name") == job_name
]
2140 changed_job
= changed_job
[0]
2142 # list of scrape jobs that have not changed
2143 jobs
= [job
for job
in jobs
if job
.get("job_name") != job_name
]
2145 # list of scrape jobs for units of the same application that still exist
2148 for config
in changed_job
["static_configs"] # type: ignore
2149 if config
.get("labels", {}).get("juju_unit") != unit_name
2153 changed_job
["static_configs"] = configs_kept
# type: ignore
2154 jobs
.append(changed_job
)
2156 relation
.data
[self
._charm
.app
]["scrape_jobs"] = json
.dumps(jobs
)
2158 if not _type_convert_stored(self
._stored
.jobs
) == jobs
:
2159 self
._stored
.jobs
= jobs
2161 def _job_name(self
, appname
) -> str:
2162 """Construct a scrape job name.
2164 Each relation has its own unique scrape job name. All units in
2165 the relation are scraped as part of the same scrape job.
2168 appname: string name of a related application.
2171 a string Prometheus scrape job name for the application.
2173 return "juju_{}_{}_{}_prometheus_scrape".format(
2174 self
.model
.name
, self
.model
.uuid
[:7], appname
2177 def _get_targets(self
, relation
) -> dict:
2178 """Fetch scrape targets for a relation.
2180 Scrape target information is returned for each unit in the
2181 relation. This information contains the unit name, network
2182 hostname (or address) for that unit, and port on which a
2183 metrics endpoint is exposed in that unit.
2186 relation: an `ops.model.Relation` object for which scrape
2187 targets are required.
2190 a dictionary whose keys are names of the units in the
2191 relation. There values associated with each key is itself
2192 a dictionary of the form
2194 {"hostname": hostname, "port": port}
2198 for unit
in relation
.units
:
2199 port
= relation
.data
[unit
].get("port", 80)
2200 hostname
= relation
.data
[unit
].get("hostname")
2202 targets
.update({unit
.name
: {"hostname": hostname
, "port": port
}})
2206 def _static_scrape_job(self
, targets
, application_name
, **kwargs
) -> dict:
2207 """Construct a static scrape job for an application.
2210 targets: a dictionary providing hostname and port for all
2211 scrape target. The keys of this dictionary are unit
2212 names. Values corresponding to these keys are
2213 themselves a dictionary with keys "hostname" and
2215 application_name: a string name of the application for
2216 which this static scrape job is being constructed.
2217 kwargs: a `dict` of the extra arguments passed to the function
2220 A dictionary corresponding to a Prometheus static scrape
2221 job configuration for one application. The returned
2222 dictionary may be transformed into YAML and appended to
2223 the list of any existing list of Prometheus static configs.
2225 juju_model
= self
.model
.name
2226 juju_model_uuid
= self
.model
.uuid
2229 "job_name": self
._job
_name
(application_name
),
2232 "targets": ["{}:{}".format(target
["hostname"], target
["port"])],
2234 "juju_model": juju_model
,
2235 "juju_model_uuid": juju_model_uuid
,
2236 "juju_application": application_name
,
2237 "juju_unit": unit_name
,
2238 "host": target
["hostname"],
2239 # Expanding this will merge the dicts and replace the
2240 # topology labels if any were present/found
2241 **self
._static
_config
_extra
_labels
(target
),
2244 for unit_name
, target
in targets
.items()
2246 "relabel_configs": self
._relabel
_configs
+ kwargs
.get("relabel_configs", []),
2248 job
.update(kwargs
.get("updates", {}))
2252 def _static_config_extra_labels(self
, target
: Dict
[str, str]) -> Dict
[str, str]:
2253 """Build a list of extra static config parameters, if specified."""
2256 if self
._resolve
_addresses
:
2258 dns_name
= socket
.gethostbyaddr(target
["hostname"])[0]
2260 logger
.debug("Could not perform DNS lookup for %s", target
["hostname"])
2261 dns_name
= target
["hostname"]
2262 extra_info
["dns_name"] = dns_name
2263 label_re
= re
.compile(r
'(?P<label>juju.*?)="(?P<value>.*?)",?')
2266 with
urlopen(f
'http://{target["hostname"]}:{target["port"]}/metrics') as resp
:
2267 data
= resp
.read().decode("utf-8").splitlines()
2269 for match
in label_re
.finditer(metric
):
2270 extra_info
[match
.group("label")] = match
.group("value")
2271 except (HTTPError
, URLError
, OSError, ConnectionResetError
, Exception) as e
:
2272 logger
.debug("Could not scrape target: %s", e
)
2276 def _relabel_configs(self
) -> list:
2277 """Create Juju topology relabeling configuration.
2279 Using Juju topology for instance labels ensures that these
2280 labels are stable across unit recreation.
2283 a list of Prometheus relabeling configurations. Each item in
2284 this list is one relabel configuration.
2296 "target_label": "instance",
2300 if self
._relabel
_instance
2304 def _on_alert_rules_changed(self
, event
):
2305 """Update alert rules in response to scrape target changes.
2307 When there is any change in alert rule relation data for any
2308 scrape target, the list of alert rules for that specific
2311 unit_rules
= self
._get
_alert
_rules
(event
.relation
)
2315 app_name
= event
.relation
.app
.name
2316 self
.set_alert_rule_data(app_name
, unit_rules
)
2318 def set_alert_rule_data(self
, name
: str, unit_rules
: dict, label_rules
: bool = True) -> None:
2319 """Update alert rule data.
2321 The unit rules should be a dict, which is has additional Juju topology labels added. For
2322 rules generated by the NRPE exporter, they are pre-labeled so lookups can be performed.
2324 if not self
._charm
.unit
.is_leader():
2328 rules
= self
._label
_alert
_rules
(unit_rules
, name
)
2330 rules
= [unit_rules
]
2331 updated_group
= {"name": self
.group_name(name
), "rules": rules
}
2333 for relation
in self
.model
.relations
[self
._prometheus
_relation
]:
2334 alert_rules
= json
.loads(relation
.data
[self
._charm
.app
].get("alert_rules", "{}"))
2335 groups
= alert_rules
.get("groups", [])
2336 # list of alert rule groups that have not changed
2337 for group
in groups
:
2338 if group
["name"] == updated_group
["name"]:
2339 group
["rules"] = [r
for r
in group
["rules"] if r
not in updated_group
["rules"]]
2340 group
["rules"].extend(updated_group
["rules"])
2342 if updated_group
["name"] not in [g
["name"] for g
in groups
]:
2343 groups
.append(updated_group
)
2344 relation
.data
[self
._charm
.app
]["alert_rules"] = json
.dumps({"groups": groups
})
2346 if not _type_convert_stored(self
._stored
.alert_rules
) == groups
:
2347 self
._stored
.alert_rules
= groups
2349 def _on_alert_rules_departed(self
, event
):
2350 """Remove alert rules for departed targets.
2352 Any time a scrape target departs any alert rules associated
2353 with that specific scrape target is removed.
2355 group_name
= self
.group_name(event
.relation
.app
.name
)
2356 unit_name
= event
.unit
.name
2357 self
.remove_alert_rules(group_name
, unit_name
)
2359 def remove_alert_rules(self
, group_name
: str, unit_name
: str) -> None:
2360 """Remove an alert rule group from relation data."""
2361 if not self
._charm
.unit
.is_leader():
2364 for relation
in self
.model
.relations
[self
._prometheus
_relation
]:
2365 alert_rules
= json
.loads(relation
.data
[self
._charm
.app
].get("alert_rules", "{}"))
2369 groups
= alert_rules
.get("groups", [])
2373 changed_group
= [group
for group
in groups
if group
["name"] == group_name
]
2374 if not changed_group
:
2376 changed_group
= changed_group
[0]
2378 # list of alert rule groups that have not changed
2379 groups
= [group
for group
in groups
if group
["name"] != group_name
]
2381 # list of alert rules not associated with departing unit
2384 for rule
in changed_group
.get("rules") # type: ignore
2385 if rule
.get("labels").get("juju_unit") != unit_name
2389 changed_group
["rules"] = rules_kept
# type: ignore
2390 groups
.append(changed_group
)
2392 relation
.data
[self
._charm
.app
]["alert_rules"] = (
2393 json
.dumps({"groups": groups
}) if groups
else "{}"
2396 if not _type_convert_stored(self
._stored
.alert_rules
) == groups
:
2397 self
._stored
.alert_rules
= groups
2399 def _get_alert_rules(self
, relation
) -> dict:
2400 """Fetch alert rules for a relation.
2402 Each unit of the related scrape target may have its own
2403 associated alert rules. Alert rules for all units are returned
2404 indexed by unit name.
2407 relation: an `ops.model.Relation` object for which alert
2411 a dictionary whose keys are names of the units in the
2412 relation. There values associated with each key is a list
2413 of alert rules. Each rule is in dictionary format. The
2414 structure "rule dictionary" corresponds to single
2415 Prometheus alert rule.
2418 for unit
in relation
.units
:
2419 unit_rules
= yaml
.safe_load(relation
.data
[unit
].get("groups", ""))
2421 rules
.update({unit
.name
: unit_rules
})
2425 def group_name(self
, unit_name
: str) -> str:
2426 """Construct name for an alert rule group.
2428 Each unit in a relation may define its own alert rules. All
2429 rules, for all units in a relation are grouped together and
2430 given a single alert rule group name.
2433 unit_name: string name of a related application.
2436 a string Prometheus alert rules group name for the unit.
2438 unit_name
= re
.sub(r
"/", "_", unit_name
)
2439 return "juju_{}_{}_{}_alert_rules".format(self
.model
.name
, self
.model
.uuid
[:7], unit_name
)
2441 def _label_alert_rules(self
, unit_rules
, app_name
: str) -> list:
2442 """Apply juju topology labels to alert rules.
2445 unit_rules: a list of alert rules, where each rule is in
2447 app_name: a string name of the application to which the
2451 a list of alert rules with Juju topology labels.
2454 for unit_name
, rules
in unit_rules
.items():
2456 # the new JujuTopology removed this, so build it up by hand
2458 "juju_{}".format(k
): v
2459 for k
, v
in JujuTopology(self
.model
.name
, self
.model
.uuid
, app_name
, unit_name
)
2460 .as_dict(excluded_keys
=["charm_name"])
2463 rule
["labels"].update(matchers
.items())
2464 labeled_rules
.append(rule
)
2466 return labeled_rules
2470 """Uses cos-tool to inject label matchers into alert rule expressions and validate rules."""
2475 def __init__(self
, charm
):
2480 """Lazy lookup of the path of cos-tool."""
2484 self
._path
= self
._get
_tool
_path
()
2486 logger
.debug("Skipping injection of juju topology as label matchers")
2487 self
._disabled
= True
2490 def apply_label_matchers(self
, rules
) -> dict:
2491 """Will apply label matchers to the expression of all alerts in all supplied groups."""
2494 for group
in rules
["groups"]:
2495 rules_in_group
= group
.get("rules", [])
2496 for rule
in rules_in_group
:
2498 # if the user for some reason has provided juju_unit, we'll need to honor it
2499 # in most cases, however, this will be empty
2507 if label
in rule
["labels"]:
2508 topology
[label
] = rule
["labels"][label
]
2510 rule
["expr"] = self
.inject_label_matchers(rule
["expr"], topology
)
2513 def validate_alert_rules(self
, rules
: dict) -> Tuple
[bool, str]:
2514 """Will validate correctness of alert rules, returning a boolean and any errors."""
2516 logger
.debug("`cos-tool` unavailable. Not validating alert correctness.")
2519 with tempfile
.TemporaryDirectory() as tmpdir
:
2520 rule_path
= Path(tmpdir
+ "/validate_rule.yaml")
2521 rule_path
.write_text(yaml
.dump(rules
))
2523 args
= [str(self
.path
), "validate", str(rule_path
)]
2524 # noinspection PyBroadException
2528 except subprocess
.CalledProcessError
as e
:
2529 logger
.debug("Validating the rules failed: %s", e
.output
)
2530 return False, ", ".join(
2533 for line
in e
.output
.decode("utf8").splitlines()
2534 if "error validating" in line
2538 def validate_scrape_jobs(self
, jobs
: list) -> bool:
2539 """Validate scrape jobs using cos-tool."""
2541 logger
.debug("`cos-tool` unavailable. Not validating scrape jobs.")
2543 conf
= {"scrape_configs": jobs
}
2544 with tempfile
.NamedTemporaryFile() as tmpfile
:
2545 with
open(tmpfile
.name
, "w") as f
:
2546 f
.write(yaml
.safe_dump(conf
))
2548 self
._exec
([str(self
.path
), "validate-config", tmpfile
.name
])
2549 except subprocess
.CalledProcessError
as e
:
2550 logger
.error("Validating scrape jobs failed: {}".format(e
.output
))
2554 def inject_label_matchers(self
, expression
, topology
) -> str:
2555 """Add label matchers to an expression."""
2559 logger
.debug("`cos-tool` unavailable. Leaving expression unchanged: %s", expression
)
2561 args
= [str(self
.path
), "transform"]
2563 ["--label-matcher={}={}".format(key
, value
) for key
, value
in topology
.items()]
2566 args
.extend(["{}".format(expression
)])
2567 # noinspection PyBroadException
2569 return self
._exec
(args
)
2570 except subprocess
.CalledProcessError
as e
:
2571 logger
.debug('Applying the expression failed: "%s", falling back to the original', e
)
2574 def _get_tool_path(self
) -> Optional
[Path
]:
2575 arch
= platform
.machine()
2576 arch
= "amd64" if arch
== "x86_64" else arch
2577 res
= "cos-tool-{}".format(arch
)
2579 path
= Path(res
).resolve()
2582 except NotImplementedError:
2583 logger
.debug("System lacks support for chmod")
2584 except FileNotFoundError
:
2585 logger
.debug('Could not locate cos-tool at: "{}"'.format(res
))
2588 def _exec(self
, cmd
) -> str:
2589 result
= subprocess
.run(cmd
, check
=True, stdout
=subprocess
.PIPE
, stderr
=subprocess
.STDOUT
)
2590 return result
.stdout
.decode("utf-8").strip()