diff --git a/09-troubleshooting.md b/09-troubleshooting.md index 81d33799d88c99057e09c40f34eda73dc966f9a5..925e7c83ff245bcb45f2e6269737291dad30731b 100644 --- a/09-troubleshooting.md +++ b/09-troubleshooting.md @@ -464,7 +464,7 @@ You can make deployment of charms quicker by: - **CAUSE**: Normally a migration from release FIVE do not set properly the env for LCM - **SOLUTION**: Ensure variable **OSMLCM_VCA_PUBKEY** is properly set at file `/etc/osm/docker/lcm.env`. The value must match with the output of this command `cat $HOME/.local/share/juju/ssh/juju_id_rsa.pub`. If not, add or change it. Restart OSM, or just LCM service with `docker service update osm_lcm --force --env-add OSMLCM_VCA_PUBKEY=""` -## Common issues whwn interacting with NBI +## Common issues when interacting with NBI ### SSL certificate problem @@ -640,3 +640,39 @@ Please provide some context to your questions. As an example, find below some gu - NAT rules in the machine where OSM is running Common sense applies here, so you don't need to send everything, but just enough information to diagnose the issue and find a proper solution. + +## How to troubleshoot issues in the new Service Assurance architecture + +Since OSM Release FOURTEEN, the Service Assurance architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. These alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats them to adapt to Airflow expected webhook endpoints. So the alert workflow is this: `DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG` + +In case of any kind of error related to monitoring, the first thing to check should be the metrics stored in Prometheus. Its graphical interface can be visited at the URL . Some useful metrics to review are the following: + +- `ns_topology`: metric generated by a DAG with the current topology (VNFs and NSs) of instantiated VDUs in OSM. +- `vm_status`: status (1: ok, 0: error) of the VMs in the VIMs registered in OSM. +- `vm_status_extended`: metric enriched from the two previous ones, so it includes data about VNF and NS the VM belongs to as part of the metric labels. +- `osm_*`: resource consumption metrics. Only intantiated VNFs that include monitoring parameters have these kind of metrics in Prometheus. + +In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here . On this page you can see the alerting rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules: + +- `vdu_down`: this alert is fired when a VDU remains in a not OK state for several minutes and triggers `alert_vdu` DAG. Its labels include information about NS, VNF, VIM, etc. +- `scalein_*`: these rules manage scale-in operations based on the resource consumption metrics and the number of VDU instances. They trigger `scalein_vdu` DAG. +- `scaleout_*`: these rules manage scale-out operations based on the resource consumption metrics and the number of VDU instances. They trigger `scaleout_vdu` DAG. + +Finally, it is also interesting for debugging to be able to view the logs of the execution of the DAGs. To do this, you must visit the Airflow website, which will be accessible on the port pointed by the `airflow-webserver` service in OSM's cluster (not a fixed port): + +```bash +kubectl -n osm get svc airflow-webserver +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +airflow-webserver NodePort 10.100.57.168 8080:19371/TCP 12d +``` + +When you open the URL (`19371` in the example above) in a browser, you will be prompted for the user and password (`admin`/`admin` by default). After that you will see the dashboard with the list of DAGs: + +- `alert_vdu`: it is executed when a VDU down alarm is fired or resolved. +- `scalein_vdu`, `scaleout_vdu`: executed when auto-scaling conditions in a VNF are met. +- `ns_topology`: this DAG is executed periodically for updating the topology metric in Prometheus of the instantiated NS. +- `vim_status_*`: there is one such DAG for each VIM in OSM. It checks VIM's reachability every few minutes. +- `vm_status_vim_*`: these DAGs (one per VIM) get VM status from VIM and store them in Prometheus. +- `vm_metrics_vim_*`: these DAGs (one per VIM) store in Prometheus resource consumption metrics from VIM. + +The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each DAG has a set of tasks, and each task has its own logs.