Integrate MON and Prometheus
[osm/devops.git] / installers / charm / osm-mon / lib / charms / prometheus_k8s / v0 / prometheus_scrape.py
1 # Copyright 2023 Canonical Ltd.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
4 # you may not use this file except in compliance with the License.
5 # You may obtain a copy of the License at
6 #
7 # http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14
15 """Prometheus Scrape Library.
16
17 ## Overview
18
19 This document explains how to integrate with the Prometheus charm
20 for the purpose of providing a metrics endpoint to Prometheus. It
21 also explains how alternative implementations of the Prometheus charms
22 may maintain the same interface and be backward compatible with all
23 currently integrated charms. Finally this document is the
24 authoritative reference on the structure of relation data that is
25 shared between Prometheus charms and any other charm that intends to
26 provide a scrape target for Prometheus.
27
28 ## Source code
29
30 Source code can be found on GitHub at:
31 https://github.com/canonical/prometheus-k8s-operator/tree/main/lib/charms/prometheus_k8s
32
33 ## Dependencies
34
35 Using this library requires you to fetch the juju_topology library from
36 [observability-libs](https://charmhub.io/observability-libs/libraries/juju_topology).
37
38 `charmcraft fetch-lib charms.observability_libs.v0.juju_topology`
39
40 ## Provider Library Usage
41
42 This Prometheus charm interacts with its scrape targets using its
43 charm library. Charms seeking to expose metric endpoints for the
44 Prometheus charm, must do so using the `MetricsEndpointProvider`
45 object from this charm library. For the simplest use cases, using the
46 `MetricsEndpointProvider` object only requires instantiating it,
47 typically in the constructor of your charm (the one which exposes a
48 metrics endpoint). The `MetricsEndpointProvider` constructor requires
49 the name of the relation over which a scrape target (metrics endpoint)
50 is exposed to the Prometheus charm. This relation must use the
51 `prometheus_scrape` interface. By default address of the metrics
52 endpoint is set to the unit IP address, by each unit of the
53 `MetricsEndpointProvider` charm. These units set their address in
54 response to the `PebbleReady` event of each container in the unit,
55 since container restarts of Kubernetes charms can result in change of
56 IP addresses. The default name for the metrics endpoint relation is
57 `metrics-endpoint`. It is strongly recommended to use the same
58 relation name for consistency across charms and doing so obviates the
59 need for an additional constructor argument. The
60 `MetricsEndpointProvider` object may be instantiated as follows
61
62 from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointProvider
63
64 def __init__(self, *args):
65 super().__init__(*args)
66 ...
67 self.metrics_endpoint = MetricsEndpointProvider(self)
68 ...
69
70 Note that the first argument (`self`) to `MetricsEndpointProvider` is
71 always a reference to the parent (scrape target) charm.
72
73 An instantiated `MetricsEndpointProvider` object will ensure that each
74 unit of its parent charm, is a scrape target for the
75 `MetricsEndpointConsumer` (Prometheus) charm. By default
76 `MetricsEndpointProvider` assumes each unit of the consumer charm
77 exports its metrics at a path given by `/metrics` on port 80. These
78 defaults may be changed by providing the `MetricsEndpointProvider`
79 constructor an optional argument (`jobs`) that represents a
80 Prometheus scrape job specification using Python standard data
81 structures. This job specification is a subset of Prometheus' own
82 [scrape
83 configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)
84 format but represented using Python data structures. More than one job
85 may be provided using the `jobs` argument. Hence `jobs` accepts a list
86 of dictionaries where each dictionary represents one `<scrape_config>`
87 object as described in the Prometheus documentation. The currently
88 supported configuration subset is: `job_name`, `metrics_path`,
89 `static_configs`
90
91 Suppose it is required to change the port on which scraped metrics are
92 exposed to 8000. This may be done by providing the following data
93 structure as the value of `jobs`.
94
95 ```
96 [
97 {
98 "static_configs": [
99 {
100 "targets": ["*:8000"]
101 }
102 ]
103 }
104 ]
105 ```
106
107 The wildcard ("*") host specification implies that the scrape targets
108 will automatically be set to the host addresses advertised by each
109 unit of the consumer charm.
110
111 It is also possible to change the metrics path and scrape multiple
112 ports, for example
113
114 ```
115 [
116 {
117 "metrics_path": "/my-metrics-path",
118 "static_configs": [
119 {
120 "targets": ["*:8000", "*:8081"],
121 }
122 ]
123 }
124 ]
125 ```
126
127 More complex scrape configurations are possible. For example
128
129 ```
130 [
131 {
132 "static_configs": [
133 {
134 "targets": ["10.1.32.215:7000", "*:8000"],
135 "labels": {
136 "some_key": "some-value"
137 }
138 }
139 ]
140 }
141 ]
142 ```
143
144 This example scrapes the target "10.1.32.215" at port 7000 in addition
145 to scraping each unit at port 8000. There is however one difference
146 between wildcard targets (specified using "*") and fully qualified
147 targets (such as "10.1.32.215"). The Prometheus charm automatically
148 associates labels with metrics generated by each target. These labels
149 localise the source of metrics within the Juju topology by specifying
150 its "model name", "model UUID", "application name" and "unit
151 name". However unit name is associated only with wildcard targets but
152 not with fully qualified targets.
153
154 Multiple jobs with different metrics paths and labels are allowed, but
155 each job must be given a unique name:
156
157 ```
158 [
159 {
160 "job_name": "my-first-job",
161 "metrics_path": "one-path",
162 "static_configs": [
163 {
164 "targets": ["*:7000"],
165 "labels": {
166 "some_key": "some-value"
167 }
168 }
169 ]
170 },
171 {
172 "job_name": "my-second-job",
173 "metrics_path": "another-path",
174 "static_configs": [
175 {
176 "targets": ["*:8000"],
177 "labels": {
178 "some_other_key": "some-other-value"
179 }
180 }
181 ]
182 }
183 ]
184 ```
185
186 **Important:** `job_name` should be a fixed string (e.g. hardcoded literal).
187 For instance, if you include variable elements, like your `unit.name`, it may break
188 the continuity of the metrics time series gathered by Prometheus when the leader unit
189 changes (e.g. on upgrade or rescale).
190
191 Additionally, it is also technically possible, but **strongly discouraged**, to
192 configure the following scrape-related settings, which behave as described by the
193 [Prometheus documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config):
194
195 - `static_configs`
196 - `scrape_interval`
197 - `scrape_timeout`
198 - `proxy_url`
199 - `relabel_configs`
200 - `metrics_relabel_configs`
201 - `sample_limit`
202 - `label_limit`
203 - `label_name_length_limit`
204 - `label_value_length_limit`
205
206 The settings above are supported by the `prometheus_scrape` library only for the sake of
207 specialized facilities like the [Prometheus Scrape Config](https://charmhub.io/prometheus-scrape-config-k8s)
208 charm. Virtually no charms should use these settings, and charmers definitely **should not**
209 expose them to the Juju administrator via configuration options.
210
211 ## Consumer Library Usage
212
213 The `MetricsEndpointConsumer` object may be used by Prometheus
214 charms to manage relations with their scrape targets. For this
215 purposes a Prometheus charm needs to do two things
216
217 1. Instantiate the `MetricsEndpointConsumer` object by providing it a
218 reference to the parent (Prometheus) charm and optionally the name of
219 the relation that the Prometheus charm uses to interact with scrape
220 targets. This relation must confirm to the `prometheus_scrape`
221 interface and it is strongly recommended that this relation be named
222 `metrics-endpoint` which is its default value.
223
224 For example a Prometheus charm may instantiate the
225 `MetricsEndpointConsumer` in its constructor as follows
226
227 from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointConsumer
228
229 def __init__(self, *args):
230 super().__init__(*args)
231 ...
232 self.metrics_consumer = MetricsEndpointConsumer(self)
233 ...
234
235 2. A Prometheus charm also needs to respond to the
236 `TargetsChangedEvent` event of the `MetricsEndpointConsumer` by adding itself as
237 an observer for these events, as in
238
239 self.framework.observe(
240 self.metrics_consumer.on.targets_changed,
241 self._on_scrape_targets_changed,
242 )
243
244 In responding to the `TargetsChangedEvent` event the Prometheus
245 charm must update the Prometheus configuration so that any new scrape
246 targets are added and/or old ones removed from the list of scraped
247 endpoints. For this purpose the `MetricsEndpointConsumer` object
248 exposes a `jobs()` method that returns a list of scrape jobs. Each
249 element of this list is the Prometheus scrape configuration for that
250 job. In order to update the Prometheus configuration, the Prometheus
251 charm needs to replace the current list of jobs with the list provided
252 by `jobs()` as follows
253
254 def _on_scrape_targets_changed(self, event):
255 ...
256 scrape_jobs = self.metrics_consumer.jobs()
257 for job in scrape_jobs:
258 prometheus_scrape_config.append(job)
259 ...
260
261 ## Alerting Rules
262
263 This charm library also supports gathering alerting rules from all
264 related `MetricsEndpointProvider` charms and enabling corresponding alerts within the
265 Prometheus charm. Alert rules are automatically gathered by `MetricsEndpointProvider`
266 charms when using this library, from a directory conventionally named
267 `prometheus_alert_rules`. This directory must reside at the top level
268 in the `src` folder of the consumer charm. Each file in this directory
269 is assumed to be in one of two formats:
270 - the official prometheus alert rule format, conforming to the
271 [Prometheus docs](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
272 - a single rule format, which is a simplified subset of the official format,
273 comprising a single alert rule per file, using the same YAML fields.
274
275 The file name must have one of the following extensions:
276 - `.rule`
277 - `.rules`
278 - `.yml`
279 - `.yaml`
280
281 An example of the contents of such a file in the custom single rule
282 format is shown below.
283
284 ```
285 alert: HighRequestLatency
286 expr: job:request_latency_seconds:mean5m{my_key=my_value} > 0.5
287 for: 10m
288 labels:
289 severity: Medium
290 type: HighLatency
291 annotations:
292 summary: High request latency for {{ $labels.instance }}.
293 ```
294
295 The `MetricsEndpointProvider` will read all available alert rules and
296 also inject "filtering labels" into the alert expressions. The
297 filtering labels ensure that alert rules are localised to the metrics
298 provider charm's Juju topology (application, model and its UUID). Such
299 a topology filter is essential to ensure that alert rules submitted by
300 one provider charm generates alerts only for that same charm. When
301 alert rules are embedded in a charm, and the charm is deployed as a
302 Juju application, the alert rules from that application have their
303 expressions automatically updated to filter for metrics coming from
304 the units of that application alone. This remove risk of spurious
305 evaluation, e.g., when you have multiple deployments of the same charm
306 monitored by the same Prometheus.
307
308 Not all alerts one may want to specify can be embedded in a
309 charm. Some alert rules will be specific to a user's use case. This is
310 the case, for example, of alert rules that are based on business
311 constraints, like expecting a certain amount of requests to a specific
312 API every five minutes. Such alert rules can be specified via the
313 [COS Config Charm](https://charmhub.io/cos-configuration-k8s),
314 which allows importing alert rules and other settings like dashboards
315 from a Git repository.
316
317 Gathering alert rules and generating rule files within the Prometheus
318 charm is easily done using the `alerts()` method of
319 `MetricsEndpointConsumer`. Alerts generated by Prometheus will
320 automatically include Juju topology labels in the alerts. These labels
321 indicate the source of the alert. The following labels are
322 automatically included with each alert
323
324 - `juju_model`
325 - `juju_model_uuid`
326 - `juju_application`
327
328 ## Relation Data
329
330 The Prometheus charm uses both application and unit relation data to
331 obtain information regarding its scrape jobs, alert rules and scrape
332 targets. This relation data is in JSON format and it closely resembles
333 the YAML structure of Prometheus [scrape configuration]
334 (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config).
335
336 Units of Metrics provider charms advertise their names and addresses
337 over unit relation data using the `prometheus_scrape_unit_name` and
338 `prometheus_scrape_unit_address` keys. While the `scrape_metadata`,
339 `scrape_jobs` and `alert_rules` keys in application relation data
340 of Metrics provider charms hold eponymous information.
341
342 """ # noqa: W505
343
344 import copy
345 import hashlib
346 import ipaddress
347 import json
348 import logging
349 import os
350 import platform
351 import re
352 import socket
353 import subprocess
354 import tempfile
355 from collections import defaultdict
356 from pathlib import Path
357 from typing import Any, Callable, Dict, List, Optional, Tuple, Union
358 from urllib.error import HTTPError, URLError
359 from urllib.parse import urlparse
360 from urllib.request import urlopen
361
362 import yaml
363 from charms.observability_libs.v0.juju_topology import JujuTopology
364 from ops.charm import CharmBase, RelationRole
365 from ops.framework import (
366 BoundEvent,
367 EventBase,
368 EventSource,
369 Object,
370 ObjectEvents,
371 StoredDict,
372 StoredList,
373 StoredState,
374 )
375 from ops.model import Relation
376
377 # The unique Charmhub library identifier, never change it
378 LIBID = "bc84295fef5f4049878f07b131968ee2"
379
380 # Increment this major API version when introducing breaking changes
381 LIBAPI = 0
382
383 # Increment this PATCH version before using `charmcraft publish-lib` or reset
384 # to 0 if you are raising the major API version
385 LIBPATCH = 36
386
387 logger = logging.getLogger(__name__)
388
389
390 ALLOWED_KEYS = {
391 "job_name",
392 "metrics_path",
393 "static_configs",
394 "scrape_interval",
395 "scrape_timeout",
396 "proxy_url",
397 "relabel_configs",
398 "metrics_relabel_configs",
399 "sample_limit",
400 "label_limit",
401 "label_name_length_limit",
402 "label_value_length_limit",
403 "scheme",
404 "basic_auth",
405 "tls_config",
406 }
407 DEFAULT_JOB = {
408 "metrics_path": "/metrics",
409 "static_configs": [{"targets": ["*:80"]}],
410 }
411
412
413 DEFAULT_RELATION_NAME = "metrics-endpoint"
414 RELATION_INTERFACE_NAME = "prometheus_scrape"
415
416 DEFAULT_ALERT_RULES_RELATIVE_PATH = "./src/prometheus_alert_rules"
417
418
419 class PrometheusConfig:
420 """A namespace for utility functions for manipulating the prometheus config dict."""
421
422 # relabel instance labels so that instance identifiers are globally unique
423 # stable over unit recreation
424 topology_relabel_config = {
425 "source_labels": ["juju_model", "juju_model_uuid", "juju_application"],
426 "separator": "_",
427 "target_label": "instance",
428 "regex": "(.*)",
429 }
430
431 topology_relabel_config_wildcard = {
432 "source_labels": ["juju_model", "juju_model_uuid", "juju_application", "juju_unit"],
433 "separator": "_",
434 "target_label": "instance",
435 "regex": "(.*)",
436 }
437
438 @staticmethod
439 def sanitize_scrape_config(job: dict) -> dict:
440 """Restrict permissible scrape configuration options.
441
442 If job is empty then a default job is returned. The
443 default job is
444
445 ```
446 {
447 "metrics_path": "/metrics",
448 "static_configs": [{"targets": ["*:80"]}],
449 }
450 ```
451
452 Args:
453 job: a dict containing a single Prometheus job
454 specification.
455
456 Returns:
457 a dictionary containing a sanitized job specification.
458 """
459 sanitized_job = DEFAULT_JOB.copy()
460 sanitized_job.update({key: value for key, value in job.items() if key in ALLOWED_KEYS})
461 return sanitized_job
462
463 @staticmethod
464 def sanitize_scrape_configs(scrape_configs: List[dict]) -> List[dict]:
465 """A vectorized version of `sanitize_scrape_config`."""
466 return [PrometheusConfig.sanitize_scrape_config(job) for job in scrape_configs]
467
468 @staticmethod
469 def prefix_job_names(scrape_configs: List[dict], prefix: str) -> List[dict]:
470 """Adds the given prefix to all the job names in the given scrape_configs list."""
471 modified_scrape_configs = []
472 for scrape_config in scrape_configs:
473 job_name = scrape_config.get("job_name")
474 modified = scrape_config.copy()
475 modified["job_name"] = prefix + "_" + job_name if job_name else prefix
476 modified_scrape_configs.append(modified)
477
478 return modified_scrape_configs
479
480 @staticmethod
481 def expand_wildcard_targets_into_individual_jobs(
482 scrape_jobs: List[dict],
483 hosts: Dict[str, Tuple[str, str]],
484 topology: Optional[JujuTopology] = None,
485 ) -> List[dict]:
486 """Extract wildcard hosts from the given scrape_configs list into separate jobs.
487
488 Args:
489 scrape_jobs: list of scrape jobs.
490 hosts: a dictionary mapping host names to host address for
491 all units of the relation for which this job configuration
492 must be constructed.
493 topology: optional arg for adding topology labels to scrape targets.
494 """
495 # hosts = self._relation_hosts(relation)
496
497 modified_scrape_jobs = []
498 for job in scrape_jobs:
499 static_configs = job.get("static_configs")
500 if not static_configs:
501 continue
502
503 # When a single unit specified more than one wildcard target, then they are expanded
504 # into a static_config per target
505 non_wildcard_static_configs = []
506
507 for static_config in static_configs:
508 targets = static_config.get("targets")
509 if not targets:
510 continue
511
512 # All non-wildcard targets remain in the same static_config
513 non_wildcard_targets = []
514
515 # All wildcard targets are extracted to a job per unit. If multiple wildcard
516 # targets are specified, they remain in the same static_config (per unit).
517 wildcard_targets = []
518
519 for target in targets:
520 match = re.compile(r"\*(?:(:\d+))?").match(target)
521 if match:
522 # This is a wildcard target.
523 # Need to expand into separate jobs and remove it from this job here
524 wildcard_targets.append(target)
525 else:
526 # This is not a wildcard target. Copy it over into its own static_config.
527 non_wildcard_targets.append(target)
528
529 # All non-wildcard targets remain in the same static_config
530 if non_wildcard_targets:
531 non_wildcard_static_config = static_config.copy()
532 non_wildcard_static_config["targets"] = non_wildcard_targets
533
534 if topology:
535 # When non-wildcard targets (aka fully qualified hostnames) are specified,
536 # there is no reliable way to determine the name (Juju topology unit name)
537 # for such a target. Therefore labeling with Juju topology, excluding the
538 # unit name.
539 non_wildcard_static_config["labels"] = {
540 **non_wildcard_static_config.get("labels", {}),
541 **topology.label_matcher_dict,
542 }
543
544 non_wildcard_static_configs.append(non_wildcard_static_config)
545
546 # Extract wildcard targets into individual jobs
547 if wildcard_targets:
548 for unit_name, (unit_hostname, unit_path) in hosts.items():
549 modified_job = job.copy()
550 modified_job["static_configs"] = [static_config.copy()]
551 modified_static_config = modified_job["static_configs"][0]
552 modified_static_config["targets"] = [
553 target.replace("*", unit_hostname) for target in wildcard_targets
554 ]
555
556 unit_num = unit_name.split("/")[-1]
557 job_name = modified_job.get("job_name", "unnamed-job") + "-" + unit_num
558 modified_job["job_name"] = job_name
559 modified_job["metrics_path"] = unit_path + (
560 job.get("metrics_path") or "/metrics"
561 )
562
563 if topology:
564 # Add topology labels
565 modified_static_config["labels"] = {
566 **modified_static_config.get("labels", {}),
567 **topology.label_matcher_dict,
568 **{"juju_unit": unit_name},
569 }
570
571 # Instance relabeling for topology should be last in order.
572 modified_job["relabel_configs"] = modified_job.get(
573 "relabel_configs", []
574 ) + [PrometheusConfig.topology_relabel_config_wildcard]
575
576 modified_scrape_jobs.append(modified_job)
577
578 if non_wildcard_static_configs:
579 modified_job = job.copy()
580 modified_job["static_configs"] = non_wildcard_static_configs
581 modified_job["metrics_path"] = modified_job.get("metrics_path") or "/metrics"
582
583 if topology:
584 # Instance relabeling for topology should be last in order.
585 modified_job["relabel_configs"] = modified_job.get("relabel_configs", []) + [
586 PrometheusConfig.topology_relabel_config
587 ]
588
589 modified_scrape_jobs.append(modified_job)
590
591 return modified_scrape_jobs
592
593 @staticmethod
594 def render_alertmanager_static_configs(alertmanagers: List[str]):
595 """Render the alertmanager static_configs section from a list of URLs.
596
597 Each target must be in the hostname:port format, and prefixes are specified in a separate
598 key. Therefore, with ingress in place, would need to extract the path into the
599 `path_prefix` key, which is higher up in the config hierarchy.
600
601 https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
602
603 Args:
604 alertmanagers: List of alertmanager URLs.
605
606 Returns:
607 A dict representation for the static_configs section.
608 """
609 # Make sure it's a valid url so urlparse could parse it.
610 scheme = re.compile(r"^https?://")
611 sanitized = [am if scheme.search(am) else "http://" + am for am in alertmanagers]
612
613 # Create a mapping from paths to netlocs
614 # Group alertmanager targets into a dictionary of lists:
615 # {path: [netloc1, netloc2]}
616 paths = defaultdict(list) # type: Dict[str, List[str]]
617 for parsed in map(urlparse, sanitized):
618 path = parsed.path or "/"
619 paths[path].append(parsed.netloc)
620
621 return {
622 "alertmanagers": [
623 {"path_prefix": path_prefix, "static_configs": [{"targets": netlocs}]}
624 for path_prefix, netlocs in paths.items()
625 ]
626 }
627
628
629 class RelationNotFoundError(Exception):
630 """Raised if there is no relation with the given name is found."""
631
632 def __init__(self, relation_name: str):
633 self.relation_name = relation_name
634 self.message = "No relation named '{}' found".format(relation_name)
635
636 super().__init__(self.message)
637
638
639 class RelationInterfaceMismatchError(Exception):
640 """Raised if the relation with the given name has a different interface."""
641
642 def __init__(
643 self,
644 relation_name: str,
645 expected_relation_interface: str,
646 actual_relation_interface: str,
647 ):
648 self.relation_name = relation_name
649 self.expected_relation_interface = expected_relation_interface
650 self.actual_relation_interface = actual_relation_interface
651 self.message = (
652 "The '{}' relation has '{}' as interface rather than the expected '{}'".format(
653 relation_name, actual_relation_interface, expected_relation_interface
654 )
655 )
656
657 super().__init__(self.message)
658
659
660 class RelationRoleMismatchError(Exception):
661 """Raised if the relation with the given name has a different role."""
662
663 def __init__(
664 self,
665 relation_name: str,
666 expected_relation_role: RelationRole,
667 actual_relation_role: RelationRole,
668 ):
669 self.relation_name = relation_name
670 self.expected_relation_interface = expected_relation_role
671 self.actual_relation_role = actual_relation_role
672 self.message = "The '{}' relation has role '{}' rather than the expected '{}'".format(
673 relation_name, repr(actual_relation_role), repr(expected_relation_role)
674 )
675
676 super().__init__(self.message)
677
678
679 class InvalidAlertRuleEvent(EventBase):
680 """Event emitted when alert rule files are not parsable.
681
682 Enables us to set a clear status on the provider.
683 """
684
685 def __init__(self, handle, errors: str = "", valid: bool = False):
686 super().__init__(handle)
687 self.errors = errors
688 self.valid = valid
689
690 def snapshot(self) -> Dict:
691 """Save alert rule information."""
692 return {
693 "valid": self.valid,
694 "errors": self.errors,
695 }
696
697 def restore(self, snapshot):
698 """Restore alert rule information."""
699 self.valid = snapshot["valid"]
700 self.errors = snapshot["errors"]
701
702
703 class InvalidScrapeJobEvent(EventBase):
704 """Event emitted when alert rule files are not valid."""
705
706 def __init__(self, handle, errors: str = ""):
707 super().__init__(handle)
708 self.errors = errors
709
710 def snapshot(self) -> Dict:
711 """Save error information."""
712 return {"errors": self.errors}
713
714 def restore(self, snapshot):
715 """Restore error information."""
716 self.errors = snapshot["errors"]
717
718
719 class MetricsEndpointProviderEvents(ObjectEvents):
720 """Events raised by :class:`InvalidAlertRuleEvent`s."""
721
722 alert_rule_status_changed = EventSource(InvalidAlertRuleEvent)
723 invalid_scrape_job = EventSource(InvalidScrapeJobEvent)
724
725
726 def _type_convert_stored(obj):
727 """Convert Stored* to their appropriate types, recursively."""
728 if isinstance(obj, StoredList):
729 return list(map(_type_convert_stored, obj))
730 if isinstance(obj, StoredDict):
731 rdict = {} # type: Dict[Any, Any]
732 for k in obj.keys():
733 rdict[k] = _type_convert_stored(obj[k])
734 return rdict
735 return obj
736
737
738 def _validate_relation_by_interface_and_direction(
739 charm: CharmBase,
740 relation_name: str,
741 expected_relation_interface: str,
742 expected_relation_role: RelationRole,
743 ):
744 """Verifies that a relation has the necessary characteristics.
745
746 Verifies that the `relation_name` provided: (1) exists in metadata.yaml,
747 (2) declares as interface the interface name passed as `relation_interface`
748 and (3) has the right "direction", i.e., it is a relation that `charm`
749 provides or requires.
750
751 Args:
752 charm: a `CharmBase` object to scan for the matching relation.
753 relation_name: the name of the relation to be verified.
754 expected_relation_interface: the interface name to be matched by the
755 relation named `relation_name`.
756 expected_relation_role: whether the `relation_name` must be either
757 provided or required by `charm`.
758
759 Raises:
760 RelationNotFoundError: If there is no relation in the charm's metadata.yaml
761 with the same name as provided via `relation_name` argument.
762 RelationInterfaceMismatchError: The relation with the same name as provided
763 via `relation_name` argument does not have the same relation interface
764 as specified via the `expected_relation_interface` argument.
765 RelationRoleMismatchError: If the relation with the same name as provided
766 via `relation_name` argument does not have the same role as specified
767 via the `expected_relation_role` argument.
768 """
769 if relation_name not in charm.meta.relations:
770 raise RelationNotFoundError(relation_name)
771
772 relation = charm.meta.relations[relation_name]
773
774 actual_relation_interface = relation.interface_name
775 if actual_relation_interface != expected_relation_interface:
776 raise RelationInterfaceMismatchError(
777 relation_name, expected_relation_interface, actual_relation_interface
778 )
779
780 if expected_relation_role == RelationRole.provides:
781 if relation_name not in charm.meta.provides:
782 raise RelationRoleMismatchError(
783 relation_name, RelationRole.provides, RelationRole.requires
784 )
785 elif expected_relation_role == RelationRole.requires:
786 if relation_name not in charm.meta.requires:
787 raise RelationRoleMismatchError(
788 relation_name, RelationRole.requires, RelationRole.provides
789 )
790 else:
791 raise Exception("Unexpected RelationDirection: {}".format(expected_relation_role))
792
793
794 class InvalidAlertRulePathError(Exception):
795 """Raised if the alert rules folder cannot be found or is otherwise invalid."""
796
797 def __init__(
798 self,
799 alert_rules_absolute_path: Path,
800 message: str,
801 ):
802 self.alert_rules_absolute_path = alert_rules_absolute_path
803 self.message = message
804
805 super().__init__(self.message)
806
807
808 def _is_official_alert_rule_format(rules_dict: dict) -> bool:
809 """Are alert rules in the upstream format as supported by Prometheus.
810
811 Alert rules in dictionary format are in "official" form if they
812 contain a "groups" key, since this implies they contain a list of
813 alert rule groups.
814
815 Args:
816 rules_dict: a set of alert rules in Python dictionary format
817
818 Returns:
819 True if alert rules are in official Prometheus file format.
820 """
821 return "groups" in rules_dict
822
823
824 def _is_single_alert_rule_format(rules_dict: dict) -> bool:
825 """Are alert rules in single rule format.
826
827 The Prometheus charm library supports reading of alert rules in a
828 custom format that consists of a single alert rule per file. This
829 does not conform to the official Prometheus alert rule file format
830 which requires that each alert rules file consists of a list of
831 alert rule groups and each group consists of a list of alert
832 rules.
833
834 Alert rules in dictionary form are considered to be in single rule
835 format if in the least it contains two keys corresponding to the
836 alert rule name and alert expression.
837
838 Returns:
839 True if alert rule is in single rule file format.
840 """
841 # one alert rule per file
842 return set(rules_dict) >= {"alert", "expr"}
843
844
845 class AlertRules:
846 """Utility class for amalgamating prometheus alert rule files and injecting juju topology.
847
848 An `AlertRules` object supports aggregating alert rules from files and directories in both
849 official and single rule file formats using the `add_path()` method. All the alert rules
850 read are annotated with Juju topology labels and amalgamated into a single data structure
851 in the form of a Python dictionary using the `as_dict()` method. Such a dictionary can be
852 easily dumped into JSON format and exchanged over relation data. The dictionary can also
853 be dumped into YAML format and written directly into an alert rules file that is read by
854 Prometheus. Note that multiple `AlertRules` objects must not be written into the same file,
855 since Prometheus allows only a single list of alert rule groups per alert rules file.
856
857 The official Prometheus format is a YAML file conforming to the Prometheus documentation
858 (https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).
859 The custom single rule format is a subsection of the official YAML, having a single alert
860 rule, effectively "one alert per file".
861 """
862
863 # This class uses the following terminology for the various parts of a rule file:
864 # - alert rules file: the entire groups[] yaml, including the "groups:" key.
865 # - alert groups (plural): the list of groups[] (a list, i.e. no "groups:" key) - it is a list
866 # of dictionaries that have the "name" and "rules" keys.
867 # - alert group (singular): a single dictionary that has the "name" and "rules" keys.
868 # - alert rules (plural): all the alerts in a given alert group - a list of dictionaries with
869 # the "alert" and "expr" keys.
870 # - alert rule (singular): a single dictionary that has the "alert" and "expr" keys.
871
872 def __init__(self, topology: Optional[JujuTopology] = None):
873 """Build and alert rule object.
874
875 Args:
876 topology: an optional `JujuTopology` instance that is used to annotate all alert rules.
877 """
878 self.topology = topology
879 self.tool = CosTool(None)
880 self.alert_groups = [] # type: List[dict]
881
882 def _from_file(self, root_path: Path, file_path: Path) -> List[dict]:
883 """Read a rules file from path, injecting juju topology.
884
885 Args:
886 root_path: full path to the root rules folder (used only for generating group name)
887 file_path: full path to a *.rule file.
888
889 Returns:
890 A list of dictionaries representing the rules file, if file is valid (the structure is
891 formed by `yaml.safe_load` of the file); an empty list otherwise.
892 """
893 with file_path.open() as rf:
894 # Load a list of rules from file then add labels and filters
895 try:
896 rule_file = yaml.safe_load(rf)
897
898 except Exception as e:
899 logger.error("Failed to read alert rules from %s: %s", file_path.name, e)
900 return []
901
902 if not rule_file:
903 logger.warning("Empty rules file: %s", file_path.name)
904 return []
905 if not isinstance(rule_file, dict):
906 logger.error("Invalid rules file (must be a dict): %s", file_path.name)
907 return []
908 if _is_official_alert_rule_format(rule_file):
909 alert_groups = rule_file["groups"]
910 elif _is_single_alert_rule_format(rule_file):
911 # convert to list of alert groups
912 # group name is made up from the file name
913 alert_groups = [{"name": file_path.stem, "rules": [rule_file]}]
914 else:
915 # invalid/unsupported
916 logger.error("Invalid rules file: %s", file_path.name)
917 return []
918
919 # update rules with additional metadata
920 for alert_group in alert_groups:
921 # update group name with topology and sub-path
922 alert_group["name"] = self._group_name(
923 str(root_path),
924 str(file_path),
925 alert_group["name"],
926 )
927
928 # add "juju_" topology labels
929 for alert_rule in alert_group["rules"]:
930 if "labels" not in alert_rule:
931 alert_rule["labels"] = {}
932
933 if self.topology:
934 alert_rule["labels"].update(self.topology.label_matcher_dict)
935 # insert juju topology filters into a prometheus alert rule
936 alert_rule["expr"] = self.tool.inject_label_matchers(
937 re.sub(r"%%juju_topology%%,?", "", alert_rule["expr"]),
938 self.topology.label_matcher_dict,
939 )
940
941 return alert_groups
942
943 def _group_name(self, root_path: str, file_path: str, group_name: str) -> str:
944 """Generate group name from path and topology.
945
946 The group name is made up of the relative path between the root dir_path, the file path,
947 and topology identifier.
948
949 Args:
950 root_path: path to the root rules dir.
951 file_path: path to rule file.
952 group_name: original group name to keep as part of the new augmented group name
953
954 Returns:
955 New group name, augmented by juju topology and relative path.
956 """
957 rel_path = os.path.relpath(os.path.dirname(file_path), root_path)
958 rel_path = "" if rel_path == "." else rel_path.replace(os.path.sep, "_")
959
960 # Generate group name:
961 # - name, from juju topology
962 # - suffix, from the relative path of the rule file;
963 group_name_parts = [self.topology.identifier] if self.topology else []
964 group_name_parts.extend([rel_path, group_name, "alerts"])
965 # filter to remove empty strings
966 return "_".join(filter(None, group_name_parts))
967
968 @classmethod
969 def _multi_suffix_glob(
970 cls, dir_path: Path, suffixes: List[str], recursive: bool = True
971 ) -> list:
972 """Helper function for getting all files in a directory that have a matching suffix.
973
974 Args:
975 dir_path: path to the directory to glob from.
976 suffixes: list of suffixes to include in the glob (items should begin with a period).
977 recursive: a flag indicating whether a glob is recursive (nested) or not.
978
979 Returns:
980 List of files in `dir_path` that have one of the suffixes specified in `suffixes`.
981 """
982 all_files_in_dir = dir_path.glob("**/*" if recursive else "*")
983 return list(filter(lambda f: f.is_file() and f.suffix in suffixes, all_files_in_dir))
984
985 def _from_dir(self, dir_path: Path, recursive: bool) -> List[dict]:
986 """Read all rule files in a directory.
987
988 All rules from files for the same directory are loaded into a single
989 group. The generated name of this group includes juju topology.
990 By default, only the top directory is scanned; for nested scanning, pass `recursive=True`.
991
992 Args:
993 dir_path: directory containing *.rule files (alert rules without groups).
994 recursive: flag indicating whether to scan for rule files recursively.
995
996 Returns:
997 a list of dictionaries representing prometheus alert rule groups, each dictionary
998 representing an alert group (structure determined by `yaml.safe_load`).
999 """
1000 alert_groups = [] # type: List[dict]
1001
1002 # Gather all alerts into a list of groups
1003 for file_path in self._multi_suffix_glob(
1004 dir_path, [".rule", ".rules", ".yml", ".yaml"], recursive
1005 ):
1006 alert_groups_from_file = self._from_file(dir_path, file_path)
1007 if alert_groups_from_file:
1008 logger.debug("Reading alert rule from %s", file_path)
1009 alert_groups.extend(alert_groups_from_file)
1010
1011 return alert_groups
1012
1013 def add_path(self, path: str, *, recursive: bool = False) -> None:
1014 """Add rules from a dir path.
1015
1016 All rules from files are aggregated into a data structure representing a single rule file.
1017 All group names are augmented with juju topology.
1018
1019 Args:
1020 path: either a rules file or a dir of rules files.
1021 recursive: whether to read files recursively or not (no impact if `path` is a file).
1022
1023 Returns:
1024 True if path was added else False.
1025 """
1026 path = Path(path) # type: Path
1027 if path.is_dir():
1028 self.alert_groups.extend(self._from_dir(path, recursive))
1029 elif path.is_file():
1030 self.alert_groups.extend(self._from_file(path.parent, path))
1031 else:
1032 logger.debug("Alert rules path does not exist: %s", path)
1033
1034 def as_dict(self) -> dict:
1035 """Return standard alert rules file in dict representation.
1036
1037 Returns:
1038 a dictionary containing a single list of alert rule groups.
1039 The list of alert rule groups is provided as value of the
1040 "groups" dictionary key.
1041 """
1042 return {"groups": self.alert_groups} if self.alert_groups else {}
1043
1044
1045 class TargetsChangedEvent(EventBase):
1046 """Event emitted when Prometheus scrape targets change."""
1047
1048 def __init__(self, handle, relation_id):
1049 super().__init__(handle)
1050 self.relation_id = relation_id
1051
1052 def snapshot(self):
1053 """Save scrape target relation information."""
1054 return {"relation_id": self.relation_id}
1055
1056 def restore(self, snapshot):
1057 """Restore scrape target relation information."""
1058 self.relation_id = snapshot["relation_id"]
1059
1060
1061 class MonitoringEvents(ObjectEvents):
1062 """Event descriptor for events raised by `MetricsEndpointConsumer`."""
1063
1064 targets_changed = EventSource(TargetsChangedEvent)
1065
1066
1067 class MetricsEndpointConsumer(Object):
1068 """A Prometheus based Monitoring service."""
1069
1070 on = MonitoringEvents()
1071
1072 def __init__(self, charm: CharmBase, relation_name: str = DEFAULT_RELATION_NAME):
1073 """A Prometheus based Monitoring service.
1074
1075 Args:
1076 charm: a `CharmBase` instance that manages this
1077 instance of the Prometheus service.
1078 relation_name: an optional string name of the relation between `charm`
1079 and the Prometheus charmed service. The default is "metrics-endpoint".
1080 It is strongly advised not to change the default, so that people
1081 deploying your charm will have a consistent experience with all
1082 other charms that consume metrics endpoints.
1083
1084 Raises:
1085 RelationNotFoundError: If there is no relation in the charm's metadata.yaml
1086 with the same name as provided via `relation_name` argument.
1087 RelationInterfaceMismatchError: The relation with the same name as provided
1088 via `relation_name` argument does not have the `prometheus_scrape` relation
1089 interface.
1090 RelationRoleMismatchError: If the relation with the same name as provided
1091 via `relation_name` argument does not have the `RelationRole.requires`
1092 role.
1093 """
1094 _validate_relation_by_interface_and_direction(
1095 charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.requires
1096 )
1097
1098 super().__init__(charm, relation_name)
1099 self._charm = charm
1100 self._relation_name = relation_name
1101 self._tool = CosTool(self._charm)
1102 events = self._charm.on[relation_name]
1103 self.framework.observe(events.relation_changed, self._on_metrics_provider_relation_changed)
1104 self.framework.observe(
1105 events.relation_departed, self._on_metrics_provider_relation_departed
1106 )
1107
1108 def _on_metrics_provider_relation_changed(self, event):
1109 """Handle changes with related metrics providers.
1110
1111 Anytime there are changes in relations between Prometheus
1112 and metrics provider charms the Prometheus charm is informed,
1113 through a `TargetsChangedEvent` event. The Prometheus charm can
1114 then choose to update its scrape configuration.
1115
1116 Args:
1117 event: a `CharmEvent` in response to which the Prometheus
1118 charm must update its scrape configuration.
1119 """
1120 rel_id = event.relation.id
1121
1122 self.on.targets_changed.emit(relation_id=rel_id)
1123
1124 def _on_metrics_provider_relation_departed(self, event):
1125 """Update job config when a metrics provider departs.
1126
1127 When a metrics provider departs the Prometheus charm is informed
1128 through a `TargetsChangedEvent` event so that it can update its
1129 scrape configuration to ensure that the departed metrics provider
1130 is removed from the list of scrape jobs and
1131
1132 Args:
1133 event: a `CharmEvent` that indicates a metrics provider
1134 unit has departed.
1135 """
1136 rel_id = event.relation.id
1137 self.on.targets_changed.emit(relation_id=rel_id)
1138
1139 def jobs(self) -> list:
1140 """Fetch the list of scrape jobs.
1141
1142 Returns:
1143 A list consisting of all the static scrape configurations
1144 for each related `MetricsEndpointProvider` that has specified
1145 its scrape targets.
1146 """
1147 scrape_jobs = []
1148
1149 for relation in self._charm.model.relations[self._relation_name]:
1150 static_scrape_jobs = self._static_scrape_config(relation)
1151 if static_scrape_jobs:
1152 # Duplicate job names will cause validate_scrape_jobs to fail.
1153 # Therefore we need to dedupe here and after all jobs are collected.
1154 static_scrape_jobs = _dedupe_job_names(static_scrape_jobs)
1155 try:
1156 self._tool.validate_scrape_jobs(static_scrape_jobs)
1157 except subprocess.CalledProcessError as e:
1158 if self._charm.unit.is_leader():
1159 data = json.loads(relation.data[self._charm.app].get("event", "{}"))
1160 data["scrape_job_errors"] = str(e)
1161 relation.data[self._charm.app]["event"] = json.dumps(data)
1162 else:
1163 scrape_jobs.extend(static_scrape_jobs)
1164
1165 scrape_jobs = _dedupe_job_names(scrape_jobs)
1166
1167 return scrape_jobs
1168
1169 @property
1170 def alerts(self) -> dict:
1171 """Fetch alerts for all relations.
1172
1173 A Prometheus alert rules file consists of a list of "groups". Each
1174 group consists of a list of alerts (`rules`) that are sequentially
1175 executed. This method returns all the alert rules provided by each
1176 related metrics provider charm. These rules may be used to generate a
1177 separate alert rules file for each relation since the returned list
1178 of alert groups are indexed by that relations Juju topology identifier.
1179 The Juju topology identifier string includes substrings that identify
1180 alert rule related metadata such as the Juju model, model UUID and the
1181 application name from where the alert rule originates. Since this
1182 topology identifier is globally unique, it may be used for instance as
1183 the name for the file into which the list of alert rule groups are
1184 written. For each relation, the structure of data returned is a dictionary
1185 representation of a standard prometheus rules file:
1186
1187 {"groups": [{"name": ...}, ...]}
1188
1189 per official prometheus documentation
1190 https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
1191
1192 The value of the `groups` key is such that it may be used to generate
1193 a Prometheus alert rules file directly using `yaml.dump` but the
1194 `groups` key itself must be included as this is required by Prometheus.
1195
1196 For example the list of alert rule groups returned by this method may
1197 be written into files consumed by Prometheus as follows
1198
1199 ```
1200 for topology_identifier, alert_rule_groups in self.metrics_consumer.alerts().items():
1201 filename = "juju_" + topology_identifier + ".rules"
1202 path = os.path.join(PROMETHEUS_RULES_DIR, filename)
1203 rules = yaml.safe_dump(alert_rule_groups)
1204 container.push(path, rules, make_dirs=True)
1205 ```
1206
1207 Returns:
1208 A dictionary mapping the Juju topology identifier of the source charm to
1209 its list of alert rule groups.
1210 """
1211 alerts = {} # type: Dict[str, dict] # mapping b/w juju identifiers and alert rule files
1212 for relation in self._charm.model.relations[self._relation_name]:
1213 if not relation.units or not relation.app:
1214 continue
1215
1216 alert_rules = json.loads(relation.data[relation.app].get("alert_rules", "{}"))
1217 if not alert_rules:
1218 continue
1219
1220 alert_rules = self._inject_alert_expr_labels(alert_rules)
1221
1222 identifier, topology = self._get_identifier_by_alert_rules(alert_rules)
1223 if not topology:
1224 try:
1225 scrape_metadata = json.loads(relation.data[relation.app]["scrape_metadata"])
1226 identifier = JujuTopology.from_dict(scrape_metadata).identifier
1227 alerts[identifier] = self._tool.apply_label_matchers(alert_rules) # type: ignore
1228
1229 except KeyError as e:
1230 logger.debug(
1231 "Relation %s has no 'scrape_metadata': %s",
1232 relation.id,
1233 e,
1234 )
1235
1236 if not identifier:
1237 logger.error(
1238 "Alert rules were found but no usable group or identifier was present."
1239 )
1240 continue
1241
1242 alerts[identifier] = alert_rules
1243
1244 _, errmsg = self._tool.validate_alert_rules(alert_rules)
1245 if errmsg:
1246 if alerts[identifier]:
1247 del alerts[identifier]
1248 if self._charm.unit.is_leader():
1249 data = json.loads(relation.data[self._charm.app].get("event", "{}"))
1250 data["errors"] = errmsg
1251 relation.data[self._charm.app]["event"] = json.dumps(data)
1252 continue
1253
1254 return alerts
1255
1256 def _get_identifier_by_alert_rules(
1257 self, rules: dict
1258 ) -> Tuple[Union[str, None], Union[JujuTopology, None]]:
1259 """Determine an appropriate dict key for alert rules.
1260
1261 The key is used as the filename when writing alerts to disk, so the structure
1262 and uniqueness is important.
1263
1264 Args:
1265 rules: a dict of alert rules
1266 Returns:
1267 A tuple containing an identifier, if found, and a JujuTopology, if it could
1268 be constructed.
1269 """
1270 if "groups" not in rules:
1271 logger.debug("No alert groups were found in relation data")
1272 return None, None
1273
1274 # Construct an ID based on what's in the alert rules if they have labels
1275 for group in rules["groups"]:
1276 try:
1277 labels = group["rules"][0]["labels"]
1278 topology = JujuTopology(
1279 # Don't try to safely get required constructor fields. There's already
1280 # a handler for KeyErrors
1281 model_uuid=labels["juju_model_uuid"],
1282 model=labels["juju_model"],
1283 application=labels["juju_application"],
1284 unit=labels.get("juju_unit", ""),
1285 charm_name=labels.get("juju_charm", ""),
1286 )
1287 return topology.identifier, topology
1288 except KeyError:
1289 logger.debug("Alert rules were found but no usable labels were present")
1290 continue
1291
1292 logger.warning(
1293 "No labeled alert rules were found, and no 'scrape_metadata' "
1294 "was available. Using the alert group name as filename."
1295 )
1296 try:
1297 for group in rules["groups"]:
1298 return group["name"], None
1299 except KeyError:
1300 logger.debug("No group name was found to use as identifier")
1301
1302 return None, None
1303
1304 def _inject_alert_expr_labels(self, rules: Dict[str, Any]) -> Dict[str, Any]:
1305 """Iterate through alert rules and inject topology into expressions.
1306
1307 Args:
1308 rules: a dict of alert rules
1309 """
1310 if "groups" not in rules:
1311 return rules
1312
1313 modified_groups = []
1314 for group in rules["groups"]:
1315 # Copy off rules, so we don't modify an object we're iterating over
1316 rules_copy = group["rules"]
1317 for idx, rule in enumerate(rules_copy):
1318 labels = rule.get("labels")
1319
1320 if labels:
1321 try:
1322 topology = JujuTopology(
1323 # Don't try to safely get required constructor fields. There's already
1324 # a handler for KeyErrors
1325 model_uuid=labels["juju_model_uuid"],
1326 model=labels["juju_model"],
1327 application=labels["juju_application"],
1328 unit=labels.get("juju_unit", ""),
1329 charm_name=labels.get("juju_charm", ""),
1330 )
1331
1332 # Inject topology and put it back in the list
1333 rule["expr"] = self._tool.inject_label_matchers(
1334 re.sub(r"%%juju_topology%%,?", "", rule["expr"]),
1335 topology.label_matcher_dict,
1336 )
1337 except KeyError:
1338 # Some required JujuTopology key is missing. Just move on.
1339 pass
1340
1341 group["rules"][idx] = rule
1342
1343 modified_groups.append(group)
1344
1345 rules["groups"] = modified_groups
1346 return rules
1347
1348 def _static_scrape_config(self, relation) -> list:
1349 """Generate the static scrape configuration for a single relation.
1350
1351 If the relation data includes `scrape_metadata` then the value
1352 of this key is used to annotate the scrape jobs with Juju
1353 Topology labels before returning them.
1354
1355 Args:
1356 relation: an `ops.model.Relation` object whose static
1357 scrape configuration is required.
1358
1359 Returns:
1360 A list (possibly empty) of scrape jobs. Each job is a
1361 valid Prometheus scrape configuration for that job,
1362 represented as a Python dictionary.
1363 """
1364 if not relation.units:
1365 return []
1366
1367 scrape_jobs = json.loads(relation.data[relation.app].get("scrape_jobs", "[]"))
1368
1369 if not scrape_jobs:
1370 return []
1371
1372 scrape_metadata = json.loads(relation.data[relation.app].get("scrape_metadata", "{}"))
1373
1374 if not scrape_metadata:
1375 return scrape_jobs
1376
1377 topology = JujuTopology.from_dict(scrape_metadata)
1378
1379 job_name_prefix = "juju_{}_prometheus_scrape".format(topology.identifier)
1380 scrape_jobs = PrometheusConfig.prefix_job_names(scrape_jobs, job_name_prefix)
1381 scrape_jobs = PrometheusConfig.sanitize_scrape_configs(scrape_jobs)
1382
1383 hosts = self._relation_hosts(relation)
1384
1385 scrape_jobs = PrometheusConfig.expand_wildcard_targets_into_individual_jobs(
1386 scrape_jobs, hosts, topology
1387 )
1388
1389 return scrape_jobs
1390
1391 def _relation_hosts(self, relation: Relation) -> Dict[str, Tuple[str, str]]:
1392 """Returns a mapping from unit names to (address, path) tuples, for the given relation."""
1393 hosts = {}
1394 for unit in relation.units:
1395 # TODO deprecate and remove unit.name
1396 unit_name = relation.data[unit].get("prometheus_scrape_unit_name") or unit.name
1397 # TODO deprecate and remove "prometheus_scrape_host"
1398 unit_address = relation.data[unit].get(
1399 "prometheus_scrape_unit_address"
1400 ) or relation.data[unit].get("prometheus_scrape_host")
1401 unit_path = relation.data[unit].get("prometheus_scrape_unit_path", "")
1402 if unit_name and unit_address:
1403 hosts.update({unit_name: (unit_address, unit_path)})
1404
1405 return hosts
1406
1407 def _target_parts(self, target) -> list:
1408 """Extract host and port from a wildcard target.
1409
1410 Args:
1411 target: a string specifying a scrape target. A
1412 scrape target is expected to have the format
1413 "host:port". The host part may be a wildcard
1414 "*" and the port part can be missing (along
1415 with ":") in which case port is set to 80.
1416
1417 Returns:
1418 a list with target host and port as in [host, port]
1419 """
1420 if ":" in target:
1421 parts = target.split(":")
1422 else:
1423 parts = [target, "80"]
1424
1425 return parts
1426
1427
1428 def _dedupe_job_names(jobs: List[dict]):
1429 """Deduplicate a list of dicts by appending a hash to the value of the 'job_name' key.
1430
1431 Additionally, fully de-duplicate any identical jobs.
1432
1433 Args:
1434 jobs: A list of prometheus scrape jobs
1435 """
1436 jobs_copy = copy.deepcopy(jobs)
1437
1438 # Convert to a dict with job names as keys
1439 # I think this line is O(n^2) but it should be okay given the list sizes
1440 jobs_dict = {
1441 job["job_name"]: list(filter(lambda x: x["job_name"] == job["job_name"], jobs_copy))
1442 for job in jobs_copy
1443 }
1444
1445 # If multiple jobs have the same name, convert the name to "name_<hash-of-job>"
1446 for key in jobs_dict:
1447 if len(jobs_dict[key]) > 1:
1448 for job in jobs_dict[key]:
1449 job_json = json.dumps(job)
1450 hashed = hashlib.sha256(job_json.encode()).hexdigest()
1451 job["job_name"] = "{}_{}".format(job["job_name"], hashed)
1452 new_jobs = []
1453 for key in jobs_dict:
1454 new_jobs.extend(list(jobs_dict[key]))
1455
1456 # Deduplicate jobs which are equal
1457 # Again this in O(n^2) but it should be okay
1458 deduped_jobs = []
1459 seen = []
1460 for job in new_jobs:
1461 job_json = json.dumps(job)
1462 hashed = hashlib.sha256(job_json.encode()).hexdigest()
1463 if hashed in seen:
1464 continue
1465 seen.append(hashed)
1466 deduped_jobs.append(job)
1467
1468 return deduped_jobs
1469
1470
1471 def _resolve_dir_against_charm_path(charm: CharmBase, *path_elements: str) -> str:
1472 """Resolve the provided path items against the directory of the main file.
1473
1474 Look up the directory of the `main.py` file being executed. This is normally
1475 going to be the charm.py file of the charm including this library. Then, resolve
1476 the provided path elements and, if the result path exists and is a directory,
1477 return its absolute path; otherwise, raise en exception.
1478
1479 Raises:
1480 InvalidAlertRulePathError, if the path does not exist or is not a directory.
1481 """
1482 charm_dir = Path(str(charm.charm_dir))
1483 if not charm_dir.exists() or not charm_dir.is_dir():
1484 # Operator Framework does not currently expose a robust
1485 # way to determine the top level charm source directory
1486 # that is consistent across deployed charms and unit tests
1487 # Hence for unit tests the current working directory is used
1488 # TODO: updated this logic when the following ticket is resolved
1489 # https://github.com/canonical/operator/issues/643
1490 charm_dir = Path(os.getcwd())
1491
1492 alerts_dir_path = charm_dir.absolute().joinpath(*path_elements)
1493
1494 if not alerts_dir_path.exists():
1495 raise InvalidAlertRulePathError(alerts_dir_path, "directory does not exist")
1496 if not alerts_dir_path.is_dir():
1497 raise InvalidAlertRulePathError(alerts_dir_path, "is not a directory")
1498
1499 return str(alerts_dir_path)
1500
1501
1502 class MetricsEndpointProvider(Object):
1503 """A metrics endpoint for Prometheus."""
1504
1505 on = MetricsEndpointProviderEvents()
1506
1507 def __init__(
1508 self,
1509 charm,
1510 relation_name: str = DEFAULT_RELATION_NAME,
1511 jobs=None,
1512 alert_rules_path: str = DEFAULT_ALERT_RULES_RELATIVE_PATH,
1513 refresh_event: Optional[Union[BoundEvent, List[BoundEvent]]] = None,
1514 external_url: str = "",
1515 lookaside_jobs_callable: Optional[Callable] = None,
1516 ):
1517 """Construct a metrics provider for a Prometheus charm.
1518
1519 If your charm exposes a Prometheus metrics endpoint, the
1520 `MetricsEndpointProvider` object enables your charm to easily
1521 communicate how to reach that metrics endpoint.
1522
1523 By default, a charm instantiating this object has the metrics
1524 endpoints of each of its units scraped by the related Prometheus
1525 charms. The scraped metrics are automatically tagged by the
1526 Prometheus charms with Juju topology data via the
1527 `juju_model_name`, `juju_model_uuid`, `juju_application_name`
1528 and `juju_unit` labels. To support such tagging `MetricsEndpointProvider`
1529 automatically forwards scrape metadata to a `MetricsEndpointConsumer`
1530 (Prometheus charm).
1531
1532 Scrape targets provided by `MetricsEndpointProvider` can be
1533 customized when instantiating this object. For example in the
1534 case of a charm exposing the metrics endpoint for each of its
1535 units on port 8080 and the `/metrics` path, the
1536 `MetricsEndpointProvider` can be instantiated as follows:
1537
1538 self.metrics_endpoint_provider = MetricsEndpointProvider(
1539 self,
1540 jobs=[{
1541 "static_configs": [{"targets": ["*:8080"]}],
1542 }])
1543
1544 The notation `*:<port>` means "scrape each unit of this charm on port
1545 `<port>`.
1546
1547 In case the metrics endpoints are not on the standard `/metrics` path,
1548 a custom path can be specified as follows:
1549
1550 self.metrics_endpoint_provider = MetricsEndpointProvider(
1551 self,
1552 jobs=[{
1553 "metrics_path": "/my/strange/metrics/path",
1554 "static_configs": [{"targets": ["*:8080"]}],
1555 }])
1556
1557 Note how the `jobs` argument is a list: this allows you to expose multiple
1558 combinations of paths "metrics_path" and "static_configs" in case your charm
1559 exposes multiple endpoints, which could happen, for example, when you have
1560 multiple workload containers, with applications in each needing to be scraped.
1561 The structure of the objects in the `jobs` list is one-to-one with the
1562 `scrape_config` configuration item of Prometheus' own configuration (see
1563 https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
1564 ), but with only a subset of the fields allowed. The permitted fields are
1565 listed in `ALLOWED_KEYS` object in this charm library module.
1566
1567 It is also possible to specify alert rules. By default, this library will look
1568 into the `<charm_parent_dir>/prometheus_alert_rules`, which in a standard charm
1569 layouts resolves to `src/prometheus_alert_rules`. Each alert rule goes into a
1570 separate `*.rule` file. If the syntax of a rule is invalid,
1571 the `MetricsEndpointProvider` logs an error and does not load the particular
1572 rule.
1573
1574 To avoid false positives and negatives in the evaluation of alert rules,
1575 all ingested alert rule expressions are automatically qualified using Juju
1576 Topology filters. This ensures that alert rules provided by your charm, trigger
1577 alerts based only on data scrapped from your charm. For example an alert rule
1578 such as the following
1579
1580 alert: UnitUnavailable
1581 expr: up < 1
1582 for: 0m
1583
1584 will be automatically transformed into something along the lines of the following
1585
1586 alert: UnitUnavailable
1587 expr: up{juju_model=<model>, juju_model_uuid=<uuid-prefix>, juju_application=<app>} < 1
1588 for: 0m
1589
1590 An attempt will be made to validate alert rules prior to loading them into Prometheus.
1591 If they are invalid, an event will be emitted from this object which charms can respond
1592 to in order to set a meaningful status for administrators.
1593
1594 This can be observed via `consumer.on.alert_rule_status_changed` which contains:
1595 - The error(s) encountered when validating as `errors`
1596 - A `valid` attribute, which can be used to reset the state of charms if alert rules
1597 are updated via another mechanism (e.g. `cos-config`) and refreshed.
1598
1599 Args:
1600 charm: a `CharmBase` object that manages this
1601 `MetricsEndpointProvider` object. Typically, this is
1602 `self` in the instantiating class.
1603 relation_name: an optional string name of the relation between `charm`
1604 and the Prometheus charmed service. The default is "metrics-endpoint".
1605 It is strongly advised not to change the default, so that people
1606 deploying your charm will have a consistent experience with all
1607 other charms that provide metrics endpoints.
1608 jobs: an optional list of dictionaries where each
1609 dictionary represents the Prometheus scrape
1610 configuration for a single job. When not provided, a
1611 default scrape configuration is provided for the
1612 `/metrics` endpoint polling all units of the charm on port `80`
1613 using the `MetricsEndpointProvider` object.
1614 alert_rules_path: an optional path for the location of alert rules
1615 files. Defaults to "./prometheus_alert_rules",
1616 resolved relative to the directory hosting the charm entry file.
1617 The alert rules are automatically updated on charm upgrade.
1618 refresh_event: an optional bound event or list of bound events which
1619 will be observed to re-set scrape job data (IP address and others)
1620 external_url: an optional argument that represents an external url that
1621 can be generated by an Ingress or a Proxy.
1622 lookaside_jobs_callable: an optional `Callable` which should be invoked
1623 when the job configuration is built as a secondary mapping. The callable
1624 should return a `List[Dict]` which is syntactically identical to the
1625 `jobs` parameter, but can be updated out of step initialization of
1626 this library without disrupting the 'global' job spec.
1627
1628 Raises:
1629 RelationNotFoundError: If there is no relation in the charm's metadata.yaml
1630 with the same name as provided via `relation_name` argument.
1631 RelationInterfaceMismatchError: The relation with the same name as provided
1632 via `relation_name` argument does not have the `prometheus_scrape` relation
1633 interface.
1634 RelationRoleMismatchError: If the relation with the same name as provided
1635 via `relation_name` argument does not have the `RelationRole.provides`
1636 role.
1637 """
1638 _validate_relation_by_interface_and_direction(
1639 charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.provides
1640 )
1641
1642 try:
1643 alert_rules_path = _resolve_dir_against_charm_path(charm, alert_rules_path)
1644 except InvalidAlertRulePathError as e:
1645 logger.debug(
1646 "Invalid Prometheus alert rules folder at %s: %s",
1647 e.alert_rules_absolute_path,
1648 e.message,
1649 )
1650
1651 super().__init__(charm, relation_name)
1652 self.topology = JujuTopology.from_charm(charm)
1653
1654 self._charm = charm
1655 self._alert_rules_path = alert_rules_path
1656 self._relation_name = relation_name
1657 # sanitize job configurations to the supported subset of parameters
1658 jobs = [] if jobs is None else jobs
1659 self._jobs = PrometheusConfig.sanitize_scrape_configs(jobs)
1660
1661 if external_url:
1662 external_url = (
1663 external_url if urlparse(external_url).scheme else ("http://" + external_url)
1664 )
1665 self.external_url = external_url
1666 self._lookaside_jobs = lookaside_jobs_callable
1667
1668 events = self._charm.on[self._relation_name]
1669 self.framework.observe(events.relation_changed, self._on_relation_changed)
1670
1671 if not refresh_event:
1672 # FIXME remove once podspec charms are verified.
1673 # `self.set_scrape_job_spec()` is called every re-init so this should not be needed.
1674 if len(self._charm.meta.containers) == 1:
1675 if "kubernetes" in self._charm.meta.series:
1676 # This is a podspec charm
1677 refresh_event = [self._charm.on.update_status]
1678 else:
1679 # This is a sidecar/pebble charm
1680 container = list(self._charm.meta.containers.values())[0]
1681 refresh_event = [self._charm.on[container.name.replace("-", "_")].pebble_ready]
1682 else:
1683 logger.warning(
1684 "%d containers are present in metadata.yaml and "
1685 "refresh_event was not specified. Defaulting to update_status. "
1686 "Metrics IP may not be set in a timely fashion.",
1687 len(self._charm.meta.containers),
1688 )
1689 refresh_event = [self._charm.on.update_status]
1690
1691 else:
1692 if not isinstance(refresh_event, list):
1693 refresh_event = [refresh_event]
1694
1695 self.framework.observe(events.relation_joined, self.set_scrape_job_spec)
1696 for ev in refresh_event:
1697 self.framework.observe(ev, self.set_scrape_job_spec)
1698
1699 def _on_relation_changed(self, event):
1700 """Check for alert rule messages in the relation data before moving on."""
1701 if self._charm.unit.is_leader():
1702 ev = json.loads(event.relation.data[event.app].get("event", "{}"))
1703
1704 if ev:
1705 valid = bool(ev.get("valid", True))
1706 errors = ev.get("errors", "")
1707
1708 if valid and not errors:
1709 self.on.alert_rule_status_changed.emit(valid=valid)
1710 else:
1711 self.on.alert_rule_status_changed.emit(valid=valid, errors=errors)
1712
1713 scrape_errors = ev.get("scrape_job_errors", None)
1714 if scrape_errors:
1715 self.on.invalid_scrape_job.emit(errors=scrape_errors)
1716
1717 def update_scrape_job_spec(self, jobs):
1718 """Update scrape job specification."""
1719 self._jobs = PrometheusConfig.sanitize_scrape_configs(jobs)
1720 self.set_scrape_job_spec()
1721
1722 def set_scrape_job_spec(self, _=None):
1723 """Ensure scrape target information is made available to prometheus.
1724
1725 When a metrics provider charm is related to a prometheus charm, the
1726 metrics provider sets specification and metadata related to its own
1727 scrape configuration. This information is set using Juju application
1728 data. In addition, each of the consumer units also sets its own
1729 host address in Juju unit relation data.
1730 """
1731 self._set_unit_ip()
1732
1733 if not self._charm.unit.is_leader():
1734 return
1735
1736 alert_rules = AlertRules(topology=self.topology)
1737 alert_rules.add_path(self._alert_rules_path, recursive=True)
1738 alert_rules_as_dict = alert_rules.as_dict()
1739
1740 for relation in self._charm.model.relations[self._relation_name]:
1741 relation.data[self._charm.app]["scrape_metadata"] = json.dumps(self._scrape_metadata)
1742 relation.data[self._charm.app]["scrape_jobs"] = json.dumps(self._scrape_jobs)
1743
1744 if alert_rules_as_dict:
1745 # Update relation data with the string representation of the rule file.
1746 # Juju topology is already included in the "scrape_metadata" field above.
1747 # The consumer side of the relation uses this information to name the rules file
1748 # that is written to the filesystem.
1749 relation.data[self._charm.app]["alert_rules"] = json.dumps(alert_rules_as_dict)
1750
1751 def _set_unit_ip(self, _=None):
1752 """Set unit host address.
1753
1754 Each time a metrics provider charm container is restarted it updates its own
1755 host address in the unit relation data for the prometheus charm.
1756
1757 The only argument specified is an event, and it ignored. This is for expediency
1758 to be able to use this method as an event handler, although no access to the
1759 event is actually needed.
1760 """
1761 for relation in self._charm.model.relations[self._relation_name]:
1762 unit_ip = str(self._charm.model.get_binding(relation).network.bind_address)
1763
1764 # TODO store entire url in relation data, instead of only select url parts.
1765
1766 if self.external_url:
1767 parsed = urlparse(self.external_url)
1768 unit_address = parsed.hostname
1769 path = parsed.path
1770 elif self._is_valid_unit_address(unit_ip):
1771 unit_address = unit_ip
1772 path = ""
1773 else:
1774 unit_address = socket.getfqdn()
1775 path = ""
1776
1777 relation.data[self._charm.unit]["prometheus_scrape_unit_address"] = unit_address
1778 relation.data[self._charm.unit]["prometheus_scrape_unit_path"] = path
1779 relation.data[self._charm.unit]["prometheus_scrape_unit_name"] = str(
1780 self._charm.model.unit.name
1781 )
1782
1783 def _is_valid_unit_address(self, address: str) -> bool:
1784 """Validate a unit address.
1785
1786 At present only IP address validation is supported, but
1787 this may be extended to DNS addresses also, as needed.
1788
1789 Args:
1790 address: a string representing a unit address
1791 """
1792 try:
1793 _ = ipaddress.ip_address(address)
1794 except ValueError:
1795 return False
1796
1797 return True
1798
1799 @property
1800 def _scrape_jobs(self) -> list:
1801 """Fetch list of scrape jobs.
1802
1803 Returns:
1804 A list of dictionaries, where each dictionary specifies a
1805 single scrape job for Prometheus.
1806 """
1807 jobs = self._jobs if self._jobs else [DEFAULT_JOB]
1808 if callable(self._lookaside_jobs):
1809 return jobs + PrometheusConfig.sanitize_scrape_configs(self._lookaside_jobs())
1810 return jobs
1811
1812 @property
1813 def _scrape_metadata(self) -> dict:
1814 """Generate scrape metadata.
1815
1816 Returns:
1817 Scrape configuration metadata for this metrics provider charm.
1818 """
1819 return self.topology.as_dict()
1820
1821
1822 class PrometheusRulesProvider(Object):
1823 """Forward rules to Prometheus.
1824
1825 This object may be used to forward rules to Prometheus. At present it only supports
1826 forwarding alert rules. This is unlike :class:`MetricsEndpointProvider`, which
1827 is used for forwarding both scrape targets and associated alert rules. This object
1828 is typically used when there is a desire to forward rules that apply globally (across
1829 all deployed charms and units) rather than to a single charm. All rule files are
1830 forwarded using the same 'prometheus_scrape' interface that is also used by
1831 `MetricsEndpointProvider`.
1832
1833 Args:
1834 charm: A charm instance that `provides` a relation with the `prometheus_scrape` interface.
1835 relation_name: Name of the relation in `metadata.yaml` that
1836 has the `prometheus_scrape` interface.
1837 dir_path: Root directory for the collection of rule files.
1838 recursive: Whether to scan for rule files recursively.
1839 """
1840
1841 def __init__(
1842 self,
1843 charm: CharmBase,
1844 relation_name: str = DEFAULT_RELATION_NAME,
1845 dir_path: str = DEFAULT_ALERT_RULES_RELATIVE_PATH,
1846 recursive=True,
1847 ):
1848 super().__init__(charm, relation_name)
1849 self._charm = charm
1850 self._relation_name = relation_name
1851 self._recursive = recursive
1852
1853 try:
1854 dir_path = _resolve_dir_against_charm_path(charm, dir_path)
1855 except InvalidAlertRulePathError as e:
1856 logger.debug(
1857 "Invalid Prometheus alert rules folder at %s: %s",
1858 e.alert_rules_absolute_path,
1859 e.message,
1860 )
1861 self.dir_path = dir_path
1862
1863 events = self._charm.on[self._relation_name]
1864 event_sources = [
1865 events.relation_joined,
1866 events.relation_changed,
1867 self._charm.on.leader_elected,
1868 self._charm.on.upgrade_charm,
1869 ]
1870
1871 for event_source in event_sources:
1872 self.framework.observe(event_source, self._update_relation_data)
1873
1874 def _reinitialize_alert_rules(self):
1875 """Reloads alert rules and updates all relations."""
1876 self._update_relation_data(None)
1877
1878 def _update_relation_data(self, _):
1879 """Update application relation data with alert rules for all relations."""
1880 if not self._charm.unit.is_leader():
1881 return
1882
1883 alert_rules = AlertRules()
1884 alert_rules.add_path(self.dir_path, recursive=self._recursive)
1885 alert_rules_as_dict = alert_rules.as_dict()
1886
1887 logger.info("Updating relation data with rule files from disk")
1888 for relation in self._charm.model.relations[self._relation_name]:
1889 relation.data[self._charm.app]["alert_rules"] = json.dumps(
1890 alert_rules_as_dict,
1891 sort_keys=True, # sort, to prevent unnecessary relation_changed events
1892 )
1893
1894
1895 class MetricsEndpointAggregator(Object):
1896 """Aggregate metrics from multiple scrape targets.
1897
1898 `MetricsEndpointAggregator` collects scrape target information from one
1899 or more related charms and forwards this to a `MetricsEndpointConsumer`
1900 charm, which may be in a different Juju model. However, it is
1901 essential that `MetricsEndpointAggregator` itself resides in the same
1902 model as its scrape targets, as this is currently the only way to
1903 ensure in Juju that the `MetricsEndpointAggregator` will be able to
1904 determine the model name and uuid of the scrape targets.
1905
1906 `MetricsEndpointAggregator` should be used in place of
1907 `MetricsEndpointProvider` in the following two use cases:
1908
1909 1. Integrating one or more scrape targets that do not support the
1910 `prometheus_scrape` interface.
1911
1912 2. Integrating one or more scrape targets through cross model
1913 relations. Although the [Scrape Config Operator](https://charmhub.io/cos-configuration-k8s)
1914 may also be used for the purpose of supporting cross model
1915 relations.
1916
1917 Using `MetricsEndpointAggregator` to build a Prometheus charm client
1918 only requires instantiating it. Instantiating
1919 `MetricsEndpointAggregator` is similar to `MetricsEndpointProvider` except
1920 that it requires specifying the names of three relations: the
1921 relation with scrape targets, the relation for alert rules, and
1922 that with the Prometheus charms. For example
1923
1924 ```python
1925 self._aggregator = MetricsEndpointAggregator(
1926 self,
1927 {
1928 "prometheus": "monitoring",
1929 "scrape_target": "prometheus-target",
1930 "alert_rules": "prometheus-rules"
1931 }
1932 )
1933 ```
1934
1935 `MetricsEndpointAggregator` assumes that each unit of a scrape target
1936 sets in its unit-level relation data two entries with keys
1937 "hostname" and "port". If it is required to integrate with charms
1938 that do not honor these assumptions, it is always possible to
1939 derive from `MetricsEndpointAggregator` overriding the `_get_targets()`
1940 method, which is responsible for aggregating the unit name, host
1941 address ("hostname") and port of the scrape target.
1942 `MetricsEndpointAggregator` also assumes that each unit of a
1943 scrape target sets in its unit-level relation data a key named
1944 "groups". The value of this key is expected to be the string
1945 representation of list of Prometheus Alert rules in YAML format.
1946 An example of a single such alert rule is
1947
1948 ```yaml
1949 - alert: HighRequestLatency
1950 expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
1951 for: 10m
1952 labels:
1953 severity: page
1954 annotations:
1955 summary: High request latency
1956 ```
1957
1958 Once again if it is required to integrate with charms that do not
1959 honour these assumptions about alert rules then an object derived
1960 from `MetricsEndpointAggregator` may be used by overriding the
1961 `_get_alert_rules()` method.
1962
1963 `MetricsEndpointAggregator` ensures that Prometheus scrape job
1964 specifications and alert rules are annotated with Juju topology
1965 information, just like `MetricsEndpointProvider` and
1966 `MetricsEndpointConsumer` do.
1967
1968 By default, `MetricsEndpointAggregator` ensures that Prometheus
1969 "instance" labels refer to Juju topology. This ensures that
1970 instance labels are stable over unit recreation. While it is not
1971 advisable to change this option, if required it can be done by
1972 setting the "relabel_instance" keyword argument to `False` when
1973 constructing an aggregator object.
1974 """
1975
1976 _stored = StoredState()
1977
1978 def __init__(
1979 self,
1980 charm,
1981 relation_names: Optional[dict] = None,
1982 relabel_instance=True,
1983 resolve_addresses=False,
1984 ):
1985 """Construct a `MetricsEndpointAggregator`.
1986
1987 Args:
1988 charm: a `CharmBase` object that manages this
1989 `MetricsEndpointAggregator` object. Typically, this is
1990 `self` in the instantiating class.
1991 relation_names: a dictionary with three keys. The value
1992 of the "scrape_target" and "alert_rules" keys are
1993 the relation names over which scrape job and alert rule
1994 information is gathered by this `MetricsEndpointAggregator`.
1995 And the value of the "prometheus" key is the name of
1996 the relation with a `MetricsEndpointConsumer` such as
1997 the Prometheus charm.
1998 relabel_instance: A boolean flag indicating if Prometheus
1999 scrape job "instance" labels must refer to Juju Topology.
2000 resolve_addresses: A boolean flag indiccating if the aggregator
2001 should attempt to perform DNS lookups of targets and append
2002 a `dns_name` label
2003 """
2004 self._charm = charm
2005
2006 relation_names = relation_names or {}
2007
2008 self._prometheus_relation = relation_names.get(
2009 "prometheus", "downstream-prometheus-scrape"
2010 )
2011 self._target_relation = relation_names.get("scrape_target", "prometheus-target")
2012 self._alert_rules_relation = relation_names.get("alert_rules", "prometheus-rules")
2013
2014 super().__init__(charm, self._prometheus_relation)
2015 self._stored.set_default(jobs=[], alert_rules=[])
2016
2017 self._relabel_instance = relabel_instance
2018 self._resolve_addresses = resolve_addresses
2019
2020 # manage Prometheus charm relation events
2021 prometheus_events = self._charm.on[self._prometheus_relation]
2022 self.framework.observe(prometheus_events.relation_joined, self._set_prometheus_data)
2023
2024 # manage list of Prometheus scrape jobs from related scrape targets
2025 target_events = self._charm.on[self._target_relation]
2026 self.framework.observe(target_events.relation_changed, self._on_prometheus_targets_changed)
2027 self.framework.observe(
2028 target_events.relation_departed, self._on_prometheus_targets_departed
2029 )
2030
2031 # manage alert rules for Prometheus from related scrape targets
2032 alert_rule_events = self._charm.on[self._alert_rules_relation]
2033 self.framework.observe(alert_rule_events.relation_changed, self._on_alert_rules_changed)
2034 self.framework.observe(alert_rule_events.relation_departed, self._on_alert_rules_departed)
2035
2036 def _set_prometheus_data(self, event):
2037 """Ensure every new Prometheus instances is updated.
2038
2039 Any time a new Prometheus unit joins the relation with
2040 `MetricsEndpointAggregator`, that Prometheus unit is provided
2041 with the complete set of existing scrape jobs and alert rules.
2042 """
2043 if not self._charm.unit.is_leader():
2044 return
2045
2046 jobs = [] + _type_convert_stored(
2047 self._stored.jobs
2048 ) # list of scrape jobs, one per relation
2049 for relation in self.model.relations[self._target_relation]:
2050 targets = self._get_targets(relation)
2051 if targets and relation.app:
2052 jobs.append(self._static_scrape_job(targets, relation.app.name))
2053
2054 groups = [] + _type_convert_stored(self._stored.alert_rules) # list of alert rule groups
2055 for relation in self.model.relations[self._alert_rules_relation]:
2056 unit_rules = self._get_alert_rules(relation)
2057 if unit_rules and relation.app:
2058 appname = relation.app.name
2059 rules = self._label_alert_rules(unit_rules, appname)
2060 group = {"name": self.group_name(appname), "rules": rules}
2061 groups.append(group)
2062
2063 event.relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
2064 event.relation.data[self._charm.app]["alert_rules"] = json.dumps({"groups": groups})
2065
2066 def _on_prometheus_targets_changed(self, event):
2067 """Update scrape jobs in response to scrape target changes.
2068
2069 When there is any change in relation data with any scrape
2070 target, the Prometheus scrape job, for that specific target is
2071 updated.
2072 """
2073 targets = self._get_targets(event.relation)
2074 if not targets:
2075 return
2076
2077 # new scrape job for the relation that has changed
2078 self.set_target_job_data(targets, event.relation.app.name)
2079
2080 def set_target_job_data(self, targets: dict, app_name: str, **kwargs) -> None:
2081 """Update scrape jobs in response to scrape target changes.
2082
2083 When there is any change in relation data with any scrape
2084 target, the Prometheus scrape job, for that specific target is
2085 updated. Additionally, if this method is called manually, do the
2086 same.
2087
2088 Args:
2089 targets: a `dict` containing target information
2090 app_name: a `str` identifying the application
2091 kwargs: a `dict` of the extra arguments passed to the function
2092 """
2093 if not self._charm.unit.is_leader():
2094 return
2095
2096 # new scrape job for the relation that has changed
2097 updated_job = self._static_scrape_job(targets, app_name, **kwargs)
2098
2099 for relation in self.model.relations[self._prometheus_relation]:
2100 jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]"))
2101 # list of scrape jobs that have not changed
2102 jobs = [job for job in jobs if updated_job["job_name"] != job["job_name"]]
2103 jobs.append(updated_job)
2104 relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
2105
2106 if not _type_convert_stored(self._stored.jobs) == jobs:
2107 self._stored.jobs = jobs
2108
2109 def _on_prometheus_targets_departed(self, event):
2110 """Remove scrape jobs when a target departs.
2111
2112 Any time a scrape target departs, any Prometheus scrape job
2113 associated with that specific scrape target is removed.
2114 """
2115 job_name = self._job_name(event.relation.app.name)
2116 unit_name = event.unit.name
2117 self.remove_prometheus_jobs(job_name, unit_name)
2118
2119 def remove_prometheus_jobs(self, job_name: str, unit_name: Optional[str] = ""):
2120 """Given a job name and unit name, remove scrape jobs associated.
2121
2122 The `unit_name` parameter is used for automatic, relation data bag-based
2123 generation, where the unit name in labels can be used to ensure that jobs with
2124 similar names (which are generated via the app name when scanning relation data
2125 bags) are not accidentally removed, as their unit name labels will differ.
2126 For NRPE, the job name is calculated from an ID sent via the NRPE relation, and is
2127 sufficient to uniquely identify the target.
2128 """
2129 if not self._charm.unit.is_leader():
2130 return
2131
2132 for relation in self.model.relations[self._prometheus_relation]:
2133 jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]"))
2134 if not jobs:
2135 continue
2136
2137 changed_job = [j for j in jobs if j.get("job_name") == job_name]
2138 if not changed_job:
2139 continue
2140 changed_job = changed_job[0]
2141
2142 # list of scrape jobs that have not changed
2143 jobs = [job for job in jobs if job.get("job_name") != job_name]
2144
2145 # list of scrape jobs for units of the same application that still exist
2146 configs_kept = [
2147 config
2148 for config in changed_job["static_configs"] # type: ignore
2149 if config.get("labels", {}).get("juju_unit") != unit_name
2150 ]
2151
2152 if configs_kept:
2153 changed_job["static_configs"] = configs_kept # type: ignore
2154 jobs.append(changed_job)
2155
2156 relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs)
2157
2158 if not _type_convert_stored(self._stored.jobs) == jobs:
2159 self._stored.jobs = jobs
2160
2161 def _job_name(self, appname) -> str:
2162 """Construct a scrape job name.
2163
2164 Each relation has its own unique scrape job name. All units in
2165 the relation are scraped as part of the same scrape job.
2166
2167 Args:
2168 appname: string name of a related application.
2169
2170 Returns:
2171 a string Prometheus scrape job name for the application.
2172 """
2173 return "juju_{}_{}_{}_prometheus_scrape".format(
2174 self.model.name, self.model.uuid[:7], appname
2175 )
2176
2177 def _get_targets(self, relation) -> dict:
2178 """Fetch scrape targets for a relation.
2179
2180 Scrape target information is returned for each unit in the
2181 relation. This information contains the unit name, network
2182 hostname (or address) for that unit, and port on which a
2183 metrics endpoint is exposed in that unit.
2184
2185 Args:
2186 relation: an `ops.model.Relation` object for which scrape
2187 targets are required.
2188
2189 Returns:
2190 a dictionary whose keys are names of the units in the
2191 relation. There values associated with each key is itself
2192 a dictionary of the form
2193 ```
2194 {"hostname": hostname, "port": port}
2195 ```
2196 """
2197 targets = {}
2198 for unit in relation.units:
2199 port = relation.data[unit].get("port", 80)
2200 hostname = relation.data[unit].get("hostname")
2201 if hostname:
2202 targets.update({unit.name: {"hostname": hostname, "port": port}})
2203
2204 return targets
2205
2206 def _static_scrape_job(self, targets, application_name, **kwargs) -> dict:
2207 """Construct a static scrape job for an application.
2208
2209 Args:
2210 targets: a dictionary providing hostname and port for all
2211 scrape target. The keys of this dictionary are unit
2212 names. Values corresponding to these keys are
2213 themselves a dictionary with keys "hostname" and
2214 "port".
2215 application_name: a string name of the application for
2216 which this static scrape job is being constructed.
2217 kwargs: a `dict` of the extra arguments passed to the function
2218
2219 Returns:
2220 A dictionary corresponding to a Prometheus static scrape
2221 job configuration for one application. The returned
2222 dictionary may be transformed into YAML and appended to
2223 the list of any existing list of Prometheus static configs.
2224 """
2225 juju_model = self.model.name
2226 juju_model_uuid = self.model.uuid
2227
2228 job = {
2229 "job_name": self._job_name(application_name),
2230 "static_configs": [
2231 {
2232 "targets": ["{}:{}".format(target["hostname"], target["port"])],
2233 "labels": {
2234 "juju_model": juju_model,
2235 "juju_model_uuid": juju_model_uuid,
2236 "juju_application": application_name,
2237 "juju_unit": unit_name,
2238 "host": target["hostname"],
2239 # Expanding this will merge the dicts and replace the
2240 # topology labels if any were present/found
2241 **self._static_config_extra_labels(target),
2242 },
2243 }
2244 for unit_name, target in targets.items()
2245 ],
2246 "relabel_configs": self._relabel_configs + kwargs.get("relabel_configs", []),
2247 }
2248 job.update(kwargs.get("updates", {}))
2249
2250 return job
2251
2252 def _static_config_extra_labels(self, target: Dict[str, str]) -> Dict[str, str]:
2253 """Build a list of extra static config parameters, if specified."""
2254 extra_info = {}
2255
2256 if self._resolve_addresses:
2257 try:
2258 dns_name = socket.gethostbyaddr(target["hostname"])[0]
2259 except OSError:
2260 logger.debug("Could not perform DNS lookup for %s", target["hostname"])
2261 dns_name = target["hostname"]
2262 extra_info["dns_name"] = dns_name
2263 label_re = re.compile(r'(?P<label>juju.*?)="(?P<value>.*?)",?')
2264
2265 try:
2266 with urlopen(f'http://{target["hostname"]}:{target["port"]}/metrics') as resp:
2267 data = resp.read().decode("utf-8").splitlines()
2268 for metric in data:
2269 for match in label_re.finditer(metric):
2270 extra_info[match.group("label")] = match.group("value")
2271 except (HTTPError, URLError, OSError, ConnectionResetError, Exception) as e:
2272 logger.debug("Could not scrape target: %s", e)
2273 return extra_info
2274
2275 @property
2276 def _relabel_configs(self) -> list:
2277 """Create Juju topology relabeling configuration.
2278
2279 Using Juju topology for instance labels ensures that these
2280 labels are stable across unit recreation.
2281
2282 Returns:
2283 a list of Prometheus relabeling configurations. Each item in
2284 this list is one relabel configuration.
2285 """
2286 return (
2287 [
2288 {
2289 "source_labels": [
2290 "juju_model",
2291 "juju_model_uuid",
2292 "juju_application",
2293 "juju_unit",
2294 ],
2295 "separator": "_",
2296 "target_label": "instance",
2297 "regex": "(.*)",
2298 }
2299 ]
2300 if self._relabel_instance
2301 else []
2302 )
2303
2304 def _on_alert_rules_changed(self, event):
2305 """Update alert rules in response to scrape target changes.
2306
2307 When there is any change in alert rule relation data for any
2308 scrape target, the list of alert rules for that specific
2309 target is updated.
2310 """
2311 unit_rules = self._get_alert_rules(event.relation)
2312 if not unit_rules:
2313 return
2314
2315 app_name = event.relation.app.name
2316 self.set_alert_rule_data(app_name, unit_rules)
2317
2318 def set_alert_rule_data(self, name: str, unit_rules: dict, label_rules: bool = True) -> None:
2319 """Update alert rule data.
2320
2321 The unit rules should be a dict, which is has additional Juju topology labels added. For
2322 rules generated by the NRPE exporter, they are pre-labeled so lookups can be performed.
2323 """
2324 if not self._charm.unit.is_leader():
2325 return
2326
2327 if label_rules:
2328 rules = self._label_alert_rules(unit_rules, name)
2329 else:
2330 rules = [unit_rules]
2331 updated_group = {"name": self.group_name(name), "rules": rules}
2332
2333 for relation in self.model.relations[self._prometheus_relation]:
2334 alert_rules = json.loads(relation.data[self._charm.app].get("alert_rules", "{}"))
2335 groups = alert_rules.get("groups", [])
2336 # list of alert rule groups that have not changed
2337 for group in groups:
2338 if group["name"] == updated_group["name"]:
2339 group["rules"] = [r for r in group["rules"] if r not in updated_group["rules"]]
2340 group["rules"].extend(updated_group["rules"])
2341
2342 if updated_group["name"] not in [g["name"] for g in groups]:
2343 groups.append(updated_group)
2344 relation.data[self._charm.app]["alert_rules"] = json.dumps({"groups": groups})
2345
2346 if not _type_convert_stored(self._stored.alert_rules) == groups:
2347 self._stored.alert_rules = groups
2348
2349 def _on_alert_rules_departed(self, event):
2350 """Remove alert rules for departed targets.
2351
2352 Any time a scrape target departs any alert rules associated
2353 with that specific scrape target is removed.
2354 """
2355 group_name = self.group_name(event.relation.app.name)
2356 unit_name = event.unit.name
2357 self.remove_alert_rules(group_name, unit_name)
2358
2359 def remove_alert_rules(self, group_name: str, unit_name: str) -> None:
2360 """Remove an alert rule group from relation data."""
2361 if not self._charm.unit.is_leader():
2362 return
2363
2364 for relation in self.model.relations[self._prometheus_relation]:
2365 alert_rules = json.loads(relation.data[self._charm.app].get("alert_rules", "{}"))
2366 if not alert_rules:
2367 continue
2368
2369 groups = alert_rules.get("groups", [])
2370 if not groups:
2371 continue
2372
2373 changed_group = [group for group in groups if group["name"] == group_name]
2374 if not changed_group:
2375 continue
2376 changed_group = changed_group[0]
2377
2378 # list of alert rule groups that have not changed
2379 groups = [group for group in groups if group["name"] != group_name]
2380
2381 # list of alert rules not associated with departing unit
2382 rules_kept = [
2383 rule
2384 for rule in changed_group.get("rules") # type: ignore
2385 if rule.get("labels").get("juju_unit") != unit_name
2386 ]
2387
2388 if rules_kept:
2389 changed_group["rules"] = rules_kept # type: ignore
2390 groups.append(changed_group)
2391
2392 relation.data[self._charm.app]["alert_rules"] = (
2393 json.dumps({"groups": groups}) if groups else "{}"
2394 )
2395
2396 if not _type_convert_stored(self._stored.alert_rules) == groups:
2397 self._stored.alert_rules = groups
2398
2399 def _get_alert_rules(self, relation) -> dict:
2400 """Fetch alert rules for a relation.
2401
2402 Each unit of the related scrape target may have its own
2403 associated alert rules. Alert rules for all units are returned
2404 indexed by unit name.
2405
2406 Args:
2407 relation: an `ops.model.Relation` object for which alert
2408 rules are required.
2409
2410 Returns:
2411 a dictionary whose keys are names of the units in the
2412 relation. There values associated with each key is a list
2413 of alert rules. Each rule is in dictionary format. The
2414 structure "rule dictionary" corresponds to single
2415 Prometheus alert rule.
2416 """
2417 rules = {}
2418 for unit in relation.units:
2419 unit_rules = yaml.safe_load(relation.data[unit].get("groups", ""))
2420 if unit_rules:
2421 rules.update({unit.name: unit_rules})
2422
2423 return rules
2424
2425 def group_name(self, unit_name: str) -> str:
2426 """Construct name for an alert rule group.
2427
2428 Each unit in a relation may define its own alert rules. All
2429 rules, for all units in a relation are grouped together and
2430 given a single alert rule group name.
2431
2432 Args:
2433 unit_name: string name of a related application.
2434
2435 Returns:
2436 a string Prometheus alert rules group name for the unit.
2437 """
2438 unit_name = re.sub(r"/", "_", unit_name)
2439 return "juju_{}_{}_{}_alert_rules".format(self.model.name, self.model.uuid[:7], unit_name)
2440
2441 def _label_alert_rules(self, unit_rules, app_name: str) -> list:
2442 """Apply juju topology labels to alert rules.
2443
2444 Args:
2445 unit_rules: a list of alert rules, where each rule is in
2446 dictionary format.
2447 app_name: a string name of the application to which the
2448 alert rules belong.
2449
2450 Returns:
2451 a list of alert rules with Juju topology labels.
2452 """
2453 labeled_rules = []
2454 for unit_name, rules in unit_rules.items():
2455 for rule in rules:
2456 # the new JujuTopology removed this, so build it up by hand
2457 matchers = {
2458 "juju_{}".format(k): v
2459 for k, v in JujuTopology(self.model.name, self.model.uuid, app_name, unit_name)
2460 .as_dict(excluded_keys=["charm_name"])
2461 .items()
2462 }
2463 rule["labels"].update(matchers.items())
2464 labeled_rules.append(rule)
2465
2466 return labeled_rules
2467
2468
2469 class CosTool:
2470 """Uses cos-tool to inject label matchers into alert rule expressions and validate rules."""
2471
2472 _path = None
2473 _disabled = False
2474
2475 def __init__(self, charm):
2476 self._charm = charm
2477
2478 @property
2479 def path(self):
2480 """Lazy lookup of the path of cos-tool."""
2481 if self._disabled:
2482 return None
2483 if not self._path:
2484 self._path = self._get_tool_path()
2485 if not self._path:
2486 logger.debug("Skipping injection of juju topology as label matchers")
2487 self._disabled = True
2488 return self._path
2489
2490 def apply_label_matchers(self, rules) -> dict:
2491 """Will apply label matchers to the expression of all alerts in all supplied groups."""
2492 if not self.path:
2493 return rules
2494 for group in rules["groups"]:
2495 rules_in_group = group.get("rules", [])
2496 for rule in rules_in_group:
2497 topology = {}
2498 # if the user for some reason has provided juju_unit, we'll need to honor it
2499 # in most cases, however, this will be empty
2500 for label in [
2501 "juju_model",
2502 "juju_model_uuid",
2503 "juju_application",
2504 "juju_charm",
2505 "juju_unit",
2506 ]:
2507 if label in rule["labels"]:
2508 topology[label] = rule["labels"][label]
2509
2510 rule["expr"] = self.inject_label_matchers(rule["expr"], topology)
2511 return rules
2512
2513 def validate_alert_rules(self, rules: dict) -> Tuple[bool, str]:
2514 """Will validate correctness of alert rules, returning a boolean and any errors."""
2515 if not self.path:
2516 logger.debug("`cos-tool` unavailable. Not validating alert correctness.")
2517 return True, ""
2518
2519 with tempfile.TemporaryDirectory() as tmpdir:
2520 rule_path = Path(tmpdir + "/validate_rule.yaml")
2521 rule_path.write_text(yaml.dump(rules))
2522
2523 args = [str(self.path), "validate", str(rule_path)]
2524 # noinspection PyBroadException
2525 try:
2526 self._exec(args)
2527 return True, ""
2528 except subprocess.CalledProcessError as e:
2529 logger.debug("Validating the rules failed: %s", e.output)
2530 return False, ", ".join(
2531 [
2532 line
2533 for line in e.output.decode("utf8").splitlines()
2534 if "error validating" in line
2535 ]
2536 )
2537
2538 def validate_scrape_jobs(self, jobs: list) -> bool:
2539 """Validate scrape jobs using cos-tool."""
2540 if not self.path:
2541 logger.debug("`cos-tool` unavailable. Not validating scrape jobs.")
2542 return True
2543 conf = {"scrape_configs": jobs}
2544 with tempfile.NamedTemporaryFile() as tmpfile:
2545 with open(tmpfile.name, "w") as f:
2546 f.write(yaml.safe_dump(conf))
2547 try:
2548 self._exec([str(self.path), "validate-config", tmpfile.name])
2549 except subprocess.CalledProcessError as e:
2550 logger.error("Validating scrape jobs failed: {}".format(e.output))
2551 raise
2552 return True
2553
2554 def inject_label_matchers(self, expression, topology) -> str:
2555 """Add label matchers to an expression."""
2556 if not topology:
2557 return expression
2558 if not self.path:
2559 logger.debug("`cos-tool` unavailable. Leaving expression unchanged: %s", expression)
2560 return expression
2561 args = [str(self.path), "transform"]
2562 args.extend(
2563 ["--label-matcher={}={}".format(key, value) for key, value in topology.items()]
2564 )
2565
2566 args.extend(["{}".format(expression)])
2567 # noinspection PyBroadException
2568 try:
2569 return self._exec(args)
2570 except subprocess.CalledProcessError as e:
2571 logger.debug('Applying the expression failed: "%s", falling back to the original', e)
2572 return expression
2573
2574 def _get_tool_path(self) -> Optional[Path]:
2575 arch = platform.machine()
2576 arch = "amd64" if arch == "x86_64" else arch
2577 res = "cos-tool-{}".format(arch)
2578 try:
2579 path = Path(res).resolve()
2580 path.chmod(0o777)
2581 return path
2582 except NotImplementedError:
2583 logger.debug("System lacks support for chmod")
2584 except FileNotFoundError:
2585 logger.debug('Could not locate cos-tool at: "{}"'.format(res))
2586 return None
2587
2588 def _exec(self, cmd) -> str:
2589 result = subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
2590 return result.stdout.decode("utf-8").strip()