From bfe7baf46317bd9e8ca3a37370d5fca068dc1b46 Mon Sep 17 00:00:00 2001 From: aguilard Date: Tue, 16 May 2023 13:59:55 +0200 Subject: [PATCH 1/3] Add section for debugging in NG-SA Change-Id: Ifb24ef9616244deda5df61c263d6854e08596e71 Signed-off-by: aguilard --- 09-troubleshooting.md | 38 +++++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/09-troubleshooting.md b/09-troubleshooting.md index 81d3379..78bcd28 100644 --- a/09-troubleshooting.md +++ b/09-troubleshooting.md @@ -464,7 +464,7 @@ You can make deployment of charms quicker by: - **CAUSE**: Normally a migration from release FIVE do not set properly the env for LCM - **SOLUTION**: Ensure variable **OSMLCM_VCA_PUBKEY** is properly set at file `/etc/osm/docker/lcm.env`. The value must match with the output of this command `cat $HOME/.local/share/juju/ssh/juju_id_rsa.pub`. If not, add or change it. Restart OSM, or just LCM service with `docker service update osm_lcm --force --env-add OSMLCM_VCA_PUBKEY=""` -## Common issues whwn interacting with NBI +## Common issues when interacting with NBI ### SSL certificate problem @@ -640,3 +640,39 @@ Please provide some context to your questions. As an example, find below some gu - NAT rules in the machine where OSM is running Common sense applies here, so you don't need to send everything, but just enough information to diagnose the issue and find a proper solution. + +## Debugging errors in NG-SA + +Since OSM release 14, the VNFs monitoring architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. This alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats it to adapt it to Airflow. So the alert workflow is this: DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG. + +In case of any kind of error related to monitoring, the first thing to check should be the metrics stored in Prometheus. Its graphical interface can be visited at the URL . Some useful metrics to review are the following: + +- `ns_topology`: metric generated by a DAG with the current topology (VNFs and NSs) of instantiated VDUs in OSM. +- `vm_status`: status (1: ok, 0: error) of the VMs in the VIMs registered in OSM. +- `vm_status_extended`: metric enriched from the two previous ones, so it includes data about VIM and OSM in the labels. +- `osm_*`: resource consumption metrics. Only intantiated VNFs that include monitoring parameters have these kind of metrics in Prometheus. + +In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here . On this page you can see the alerts rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules: + +- `vdu_down`: this alert is fired when a VDU remains in a not OK state for several minutes and triggers `alert_vdu` DAG. Its labels include information about NS, VNF, VIM, etc. +- `scalein_*`: these rules manage scale-in operations based on the resource consumption metrics and the number of VDU instances. They trigger `scalein_vdu` DAG. +- `scaleout_*`: these rules manage scale-out operations based on the resource consumption metrics and the number of VDU instances. They trigger `scaleout_vdu` DAG. + +Finally, it is also interesting for debugging to be able to view the logs of the execution of the DAGs. To do this, you must visit the Airflow website, which will be accessible on the port pointed by the `airflow-webserver` service in OSM's cluster (not a fixed port): + +```bash +kubectl -n osm get svc airflow-webserver +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +airflow-webserver NodePort 10.100.57.168 8080:19371/TCP 12d +``` + +When you open the URL (`19371` in the example above) in a browser, you will be prompted for the user and password (`admin`/`admin` by default). After that you will see the dashboard with the list of DAGs: + +- `alert_vdu`: it is executed when a VDU down alarm is fired or resolved. +- `scalein_vdu`, `scaleout_vdu`: executed when auto-scaling conditions in a VNF are met. +- `ns_topology`: this DAG is executed periodically for updating the topology metric in Prometheus of the instantiated NS. +- `vim_status_*`: there is one such DAG for each VIM in OSM. It checks VIM's reachability every few minutes. +- `vm_status_vim_*`: these DAGs (one per VIM) get VM status from VIM and store them in Prometheus. +- `vm_metrics_vim_*`: these DAGs (one per VIM) store in Prometheus resource consumption metrics from VIM. + +The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each task that makes up the DAG has its own logs. -- GitLab From a71f9865f61ea5531e3d734ac85f0d748e5f4f86 Mon Sep 17 00:00:00 2001 From: garciadeblas Date: Wed, 21 Jun 2023 14:30:57 +0000 Subject: [PATCH 2/3] Update 09-troubleshooting.md --- 09-troubleshooting.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/09-troubleshooting.md b/09-troubleshooting.md index 78bcd28..7632db4 100644 --- a/09-troubleshooting.md +++ b/09-troubleshooting.md @@ -641,18 +641,18 @@ Please provide some context to your questions. As an example, find below some gu Common sense applies here, so you don't need to send everything, but just enough information to diagnose the issue and find a proper solution. -## Debugging errors in NG-SA +## How to debug errors in the new Service Assurance architecture -Since OSM release 14, the VNFs monitoring architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. This alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats it to adapt it to Airflow. So the alert workflow is this: DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG. +Since OSM Release FOURTEEN, the Service Assurance architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. These alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats them to adapt to Airflow expected webhook endpoints. So the alert workflow is this: `DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG` In case of any kind of error related to monitoring, the first thing to check should be the metrics stored in Prometheus. Its graphical interface can be visited at the URL . Some useful metrics to review are the following: - `ns_topology`: metric generated by a DAG with the current topology (VNFs and NSs) of instantiated VDUs in OSM. - `vm_status`: status (1: ok, 0: error) of the VMs in the VIMs registered in OSM. -- `vm_status_extended`: metric enriched from the two previous ones, so it includes data about VIM and OSM in the labels. +- `vm_status_extended`: metric enriched from the two previous ones, so it includes data about VNF and NS the VM belongs to as part of the metric labels. - `osm_*`: resource consumption metrics. Only intantiated VNFs that include monitoring parameters have these kind of metrics in Prometheus. -In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here . On this page you can see the alerts rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules: +In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here . On this page you can see the alerting rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules: - `vdu_down`: this alert is fired when a VDU remains in a not OK state for several minutes and triggers `alert_vdu` DAG. Its labels include information about NS, VNF, VIM, etc. - `scalein_*`: these rules manage scale-in operations based on the resource consumption metrics and the number of VDU instances. They trigger `scalein_vdu` DAG. @@ -675,4 +675,4 @@ When you open the URL (`19371` in the example above) in a brow - `vm_status_vim_*`: these DAGs (one per VIM) get VM status from VIM and store them in Prometheus. - `vm_metrics_vim_*`: these DAGs (one per VIM) store in Prometheus resource consumption metrics from VIM. -The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each task that makes up the DAG has its own logs. +The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each DAG has a set of tasks, and each task has its own logs. -- GitLab From f4ec5d12b9a3f69eb9154c6a1a2dec93847b1425 Mon Sep 17 00:00:00 2001 From: garciadeblas Date: Wed, 21 Jun 2023 14:31:29 +0000 Subject: [PATCH 3/3] Update 09-troubleshooting.md --- 09-troubleshooting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/09-troubleshooting.md b/09-troubleshooting.md index 7632db4..925e7c8 100644 --- a/09-troubleshooting.md +++ b/09-troubleshooting.md @@ -641,7 +641,7 @@ Please provide some context to your questions. As an example, find below some gu Common sense applies here, so you don't need to send everything, but just enough information to diagnose the issue and find a proper solution. -## How to debug errors in the new Service Assurance architecture +## How to troubleshoot issues in the new Service Assurance architecture Since OSM Release FOURTEEN, the Service Assurance architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. These alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats them to adapt to Airflow expected webhook endpoints. So the alert workflow is this: `DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG` -- GitLab