9. ANNEX 1: Troubleshooting

9.1. How to know the version of your current OSM installation

Run the following command to know the version of OSM client and OSM NBI:

osm version

In some circumstances, it could be useful to search the osm-devops package installed in your system, since osm-devops is the package used to drive installations:

dpkg -l osm-devops

||/ Name                     Version            Architecture          Description
+++-======================-=================-=====================-=====================================
ii  osm-devops             8.0.0-1            all

To know the current verion of the OSM client, you can also search the python3-osmclient package as a way to know your current version of OSM:

dpkg -l python3-osmclient
||/ Name                     Version            Architecture          Description
+++-======================-=================-=====================-=====================================
ii  python3-osmclient       8.0.0-1            all

9.2. Troubleshooting installation

9.2.1. Recommended installation to facilitate troubleshooting

It is highly recommended saving a log of your installation:

./install_osm.sh 2>&1 | tee osm_install_log.txt

9.2.2. Recommended checks after installation

9.2.2.1. Checking whether all processes/services are running in K8s

kubectl -n osm get all

All the deployments and statefulsets should have 1 replica: 1/1

9.2.3. Issues on standard installation

9.2.3.1. Juju

9.2.3.1.1. Juju bootstrap hangs

If the Juju bootstrap takes a long time, stuck at this status…

Installing Juju agent on bootstrap instance
Fetching Juju GUI 2.14.0
Waiting for address
Attempting to connect to 10.71.22.78:22
Connected to 10.71.22.78
Running machine configuration script...

…it usually indicates that the LXD container with the Juju controller is having trouble connecting to the internet.

Get the name of the LXD container. It will begin with ‘juju-’ and end with ‘-0’.

lxc list
+-----------------+---------+---------------------+------+------------+-----------+
|      NAME       |  STATE  |        IPV4         | IPV6 |    TYPE    | SNAPSHOTS |
+-----------------+---------+---------------------+------+------------+-----------+
| juju-0383f2-0   | RUNNING | 10.195.8.57 (eth0)  |      | PERSISTENT |           |
+-----------------+---------+---------------------+------+------------+-----------+

Next, tail the output of cloud-init to see where the bootstrap is stuck.

lxc exec juju-0383f2-0 -- tail -f /var/log/cloud-init-output.log

9.2.3.1.2. Is Juju running?

If running, you should see something like this:

$ juju status

Model    Controller  Cloud/Region         Version  SLA
default  osm         localhost/localhost  2.3.7    unsupported

9.2.3.1.3. ERROR controller osm already exists

Did OSM installation fail during juju installation with an error like “ERROR controller osm already exists” ?

$ ./install_osm.sh
...
ERROR controller "osm" already exists
ERROR try was stopped

### Jum Agu 24 15:19:33 WIB 2018 install_juju: FATAL error: Juju installation failed
BACKTRACE:
### FATAL /usr/share/osm-devops/jenkins/common/logging 39
### install_juju /usr/share/osm-devops/installers/full_install_osm.sh 564
### install_lightweight /usr/share/osm-devops/installers/full_install_osm.sh 741
### main /usr/share/osm-devops/installers/full_install_osm.sh 1033

Try to destroy the Juju controller and run the installation again:

$ juju destroy-controller osm --destroy-all-models -y
$ ./install_osm.sh

If it does not work, you can destroy Juju container and run the installation again

#Destroy the Juju container
lxc stop juju-*
lxc delete juju-*
#Unregister the controller since we’ve manually freed the resources associated with it
juju unregister -y osm
#Verify that there are no controllers
juju list-controllers
#Run the installation again
./install_osm.sh

9.2.3.1.4. No controllers registered

The following error appears when the user used for installation does not belong to some groups:

Finished installation of juju Password: sg: failed to crypt password with previous salt: Invalid argument ERROR No controllers registered.

To fix it, just add the non-root user used for installation in sudo , lxd, docker groups

9.2.3.2. LXD

9.2.3.2.1. ERROR profile default: `/etc/default/lxd-bridge` has IPv6 enabled

Make sure that you follow the instructions in the Quickstart.

When asked if you want to proceed with the installation and configuration of LXD, juju, docker CE and the initialization of a local docker swarm, as pre-requirements, Please answer “y”.

When dialog messages related to LXD configuration are shown, please answer in the following way:

Do you want to configure the LXD bridge? Yes
Do you want to setup an IPv4 subnet? Yes
<< Default values apply for next questions >>
Do you want to setup an IPv6 subnet? No

9.2.3.3. Docker Swarm

9.2.3.3.1. `network netosm could not be found`

The error is network "netosm" is declared as external, but could not be found. You need to create a swarm-scoped network before the stack is deployed

It usually happens when a docker system prune is done with the stack stopped. The following script will create it:

 #!/bin/bash
 # Create OSM Docker Network ...
 [ -z "$OSM_STACK_NAME" ] && OSM_STACK_NAME=osm
 OSM_NETWORK_NAME=net${OSM_STACK_NAME}
 echo Creating OSM Docker Network
 DEFAULT_INTERFACE=$(route -n | awk '$1~/^0.0.0.0/ {print $8}')
 DEFAULT_MTU=$(ip addr show $DEFAULT_INTERFACE | perl -ne 'if (/mtu\s(\d+)/) {print $1;}')
 echo \# OSM_STACK_NAME = $OSM_STACK_NAME
 echo \# OSM_NETWORK_NAME = $OSM_NETWORK_NAME
 echo \# DEFAULT_INTERFACE = $DEFAULT_INTERFACE
 echo \# DEFAULT_MTU = $DEFAULT_MTU
 sg docker -c "docker network create --driver=overlay --attachable \
                --opt com.docker.network.driver.mtu=${DEFAULT_MTU} \
                ${OSM_NETWORK_NAME}"

9.2.4. Issues on advanced installation (manual build of docker images)

9.2.4.1. Manual build of images. Were all docker images successfully built?

Although controlled by the installer, you can check that the following images exist:

$ docker image ls

REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
osm/ng-ui                latest              1988aa262a97        18 hours ago        710MB
osm/lcm                  latest              c9ad59bf96aa        46 hours ago        667MB
osm/ro                   latest              812c987fcb16        46 hours ago        791MB
osm/nbi                  latest              584b4e0084a7        46 hours ago        497MB
osm/pm                   latest              1ad1e4099f52        46 hours ago        462MB
osm/mon                  latest              b17efa3412e3        46 hours ago        725MB
wurstmeister/kafka       latest              7cfc4e57966c        10 days ago         293MB
mysql                    5                   0d16d0a97dd1        2 weeks ago         372MB
mongo                    latest              14c497d5c758        3 weeks ago         366MB
wurstmeister/zookeeper   latest              351aa00d2fe9        18 months ago       478MB

9.2.4.2. Docker image failed to build

9.2.4.2.1. Err:1 `http://archive.ubuntu.com/ubuntu xenial InRelease`

In some cases, DNS resolution works on the host but fails when building the Docker container. This is caused when Docker doesn’t automatically determine the DNS server to use.

Check if the following works:

docker run busybox nslookup archive.ubuntu.com

If it does not work, you have to configure Docker to use the available DNS.

# Get the IP address you’re using for DNS:
nmcli dev show | grep 'IP4.DNS'
# Create a new file, /etc/docker/daemon.json, that contains the following (but replace the DNS IP address with the output from the previous step:
{
   "dns": ["192.168.24.10"]
}
# Restart docker
sudo service docker restart
# Re-run
docker run busybox nslookup archive.ubuntu.com
# Now you should be able to re-run the installer and move past the DNS issue.

9.2.4.2.2. TypeError: `unsupported operand type(s) for -=: 'Retry' and 'int'`

In some cases, a MTU mismatch between the host and docker interfaces will cause this error while running pip. You can check this by running ifconfig and comparing the MTU of your host interface and the docker_gwbridge interface.

# Create a new file, /etc/docker/daemon.json, that contains the following (but replace the MTU value with that of your host interface from the previous step:
{
   "mtu": 1458
}
# Restart docker
sudo service docker restart

9.3. Common issues with VIMs

9.3.1. Is the VIM URL reachable and operational?

When there are problems to access the VIM URL, an error message similar to the following is shown after attempts to instantiate network services:

Error: "VIM Exception vimmconnConnectionException ConnectFailure: Unable to establish connection to <URL>"

In order to debug potential issues with the connection, in the case of an OpenStack VIM, you can install the OpenStack client in the OSM VM and run some basic tests. I.e.:

# Install the OpenStack client
sudo apt-get install python-openstackclient
# Load your OpenStack credentials. For instance, if your credentials are saved in a file named 'myVIM-openrc.sh', you can load them with:
source myVIM-openrc.sh
# Test if the VIM API is operational with a simple command. For instance:
openstack image list

If the openstack client works, then make sure that you can reach the VIM from the RO container:

# If running OSM on top of docker swarm, go to the container in docker swarm
docker exec -it osm_ro.1.xxxxx bash
# If running OSM on top of K8s, go to the RO deployment in kubernetes
kubectl -n osm exec -it deployment/ro bash
curl <URL_CONTROLLER>

In some cases, the errors come from the fact that the VIM was added to OSM using names in the URL that are not Fully Qualified Domain Names (FQDN).

When adding a VIM to OSM, you must use always FQDN or the IP addresses. It must be noted that “controller” or similar names are not proper FQDN (the suffix should be added). Non-FQDN names might be understood by docker’s dnsmasq as a docker container name to be resolved, which is not the case. In addition, all the VIM endpoints should also be FQDN or IP addresses, thus guaranteeing that all subsequent API calls can reach the appropriate endpoint.

Think of an NFV infrastructure with tens of VIMs, first you will have to use different names for each controller (controller1, controller2, etc.), then you will have to add to every machine trying to interact with the different VIMs, not only OSM, all those entries in the /etc/hosts file. This is bad practice.

However, it is useful to have a mean to work with lab environments using non-FQDN names. Three options here. Probably you are looking for the third one, but we recommend the first one:

Option 1. Change the admin URL and/or public URL of the endpoints to use an IP address or an FQDN. You might find this interesting if you want to bring your Openstack setup to production.
Option 2. Modify /etc/hosts in the docker RO container. This is not persistent after reboots or restarts.
Option 3a (for docker swarm). Modify /etc/osm/docker/docker-compose.yaml in the host, adding extra_hosts in the ro section with the entries that you want to add to /etc/hosts in the RO docker:
Option 3b (for kubernetes). Modify /etc/osm/docker/osm_pods/ro.yaml in the host, adding extra_hosts in the ro section with the entries that you want to add to /etc/hosts in the RO docker:

With docker swarm, the modification of /etc/osm/docker/docker-compose.yaml would be:

ro:
  extra_hosts:
    controller: 1.2.3.4

Then:

docker stack rm osm
docker stack deploy -c /etc/osm/docker/docker-compose.yaml osm

With kubernetes, the procedure is very similar. The modification of /etc/osm/docker/osm_pods/ro.yaml would be:

...
spec:
  ...
  hostAliases:
  - ip: "1.2.3.4"
    hostnames:
    - "controller"
  ...

Then:

kubectl -n osm apply -f /etc/osm/docker/osm_pods/ro.yaml

This is persistent after reboots and restarts.

9.3.2. VIM authentication

What should I check if the VIM authentication is failing?

Typically, you will get the following error message:

Error: "VIM Exception vimconnUnexpectedResponse Unauthorized: The request you have made requieres authentication. (HTTP 401)"

If your OpenStack URL is based on HTTPS, OSM will check by default the authenticity of your VIM using the appropriate public certificate. The recommended way to solve this is by modifying /etc/osm/docker/docker-compose.yaml in the host, sharing the host file (e.g. /home/ubuntu/cafile.crt) by adding a volume to the ro section as follows:

 ro:
   ...
   volumes:
     - /home/ubuntu/cafile.crt:/etc/osm/cafile.crt

Then, when creating the VIM, you should use the config option ca_cert as follows:

$ # Create the VIM with all the usual options, and add the config option to specify the certificate
$ osm vim-create VIM-NAME ... --config '{ca_cert: /etc/osm/cafile.crt}'

For casual testing, when adding the VIM account to OSM, you can use 'insecure: True' (without quotes) as part of the VIM config parameters:

$ osm vim-create VIM-NAME ... --config '{insecure: True}'

9.3.3. Issues when trying to access VM from OSM

Is the VIM management network reachable from OSM (e.g. via ssh, port 22)?

The simplest check would consist on deploying a VM attached to the management network and trying to access it via e.g. ssh from the OSM host.

For instance, in the case of an OpenStack VIM you could try something like this:

$ openstack server create --image ubuntu --flavor m1.small --nic mgmtnet test

If this does not work, typically it is due to one of these issues:

Security group policy in your VIM is blocking your traffic (contact your admin to fix it)
IP address space in the management network is not routable from outside (or in the reverse direction, for the ACKs).

9.4. Common issues with VCA/Juju

9.4.1. Juju status shows pending objects after deleting a NS

In extraordinary situations, the output of juju status could show pending units that should have been removed when deleting a NS. In those situations, you can clean up VCA by following the procedure below:

juju status -m <NS_ID>
juju remove-application -m <NS_ID> <application>
juju resolved -m <NS_ID> <unit> --no-retry        # You'll likely have to run it several times, as it will probably have an error in the next queued hook.Once the last hook is marked resolved, the charm will continue its removal

The following page also shows how to remove different Juju objects

9.4.2. Dump Juju Logs

To dump the Juju debug-logs, run this command:

juju debug-log --replay --no-tail > juju-debug.log
juju debug-log --replay --no-tail -m <NS_ID>
juju debug-log --replay --no-tail -m <NS_ID> --include <UNIT>

9.4.3. Manual recovery of Juju

If juju gets in a corrupt state and you cannot run juju status or contact the juju controller, you might need to remove manually the controller and register again, making OSM aware of the new controller.

# Stop and delete all juju containers, then unregister the controller
lxc list
lxc stop juju-*          #replace "*" by the right values
lxc delete juju-*        #replace "*" by the right values
juju unregister -y osm

# Create the controller again
sg lxd -c "juju bootstrap --bootstrap-series=xenial localhost osm"

# Get controller IP and update it in relevant OSM env files
controller_ip=$(juju show-controller osm|grep api-endpoints|awk -F\' '{print $2}'|awk -F\: '{print $1}')
sudo sed -i 's/^OSMMON_VCA_HOST.*$/OSMMON_VCA_HOST='$controller_ip'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_HOST.*$/OSMLCM_VCA_HOST='$controller_ip'/' /etc/osm/docker/lcm.env

#Get juju password and feed it to OSM env files
function parse_juju_password {
   password_file="${HOME}/.local/share/juju/accounts.yaml"
   local controller_name=$1
   local s='[[:space:]]*' w='[a-zA-Z0-9_-]*' fs=$(echo @|tr @ '\034')
   sed -ne "s|^\($s\):|\1|" \
        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $password_file |
   awk -F$fs -v controller=$controller_name '{
      indent = length($1)/2;
      vname[indent] = $2;
      for (i in vname) {if (i > indent) {delete vname[i]}}
      if (length($3) > 0) {
         vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
         if (match(vn,controller) && match($2,"password")) {
             printf("%s",$3);
         }
      }
   }'
}
juju_password=$(parse_juju_password osm)
sudo sed -i 's/^OSMMON_VCA_SECRET.*$/OSMMON_VCA_SECRET='$juju_password'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_SECRET.*$/OSMLCM_VCA_SECRET='$juju_password'/' /etc/osm/docker/lcm.env

juju_pubkey=$(cat $HOME/.local/share/juju/ssh/juju_id_rsa.pub)
sudo sed -i 's/^OSMLCM_VCA_PUBKEY.*$/OSMLCM_VCA_PUBKEY='$juju_pubkey'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_PUBKEY.*$/OSMLCM_VCA_PUBKEY='$juju_pubkey'/' /etc/osm/docker/lcm.env

#Restart OSM stack
docker stack rm osm
docker stack deploy -c /etc/osm/docker/docker-compose.yaml osm

9.4.4. Slow deployment of charms

You can make deployment of charms quicker by:

Upgrading your LXD installation to use ZFS:LXD configuration for OSM Release FIVE
- After LXD re-installation, you might need to reinstall the juju controller: Reinstall Juju controller
Preventing Juju from running apt-get update && apt-get upgrade when starting a machine: Disable OS upgrades in charms
Building periodically a custom image that will be used as base image for all the charms: Custom base image for charms

9.5. Common instantiation errors

9.5.1. File juju_id_rsa.pub not found

ERROR: ERROR creating VCA model name 'xxxx': Traceback (most recent call last): File "/usr/lib/python3/dist-packages/osm_lcm/ns.py", line 822, in instantiate await ... [Errno 2] No such file or directory: '/root/.local/share/juju/ssh/juju_id_rsa.pub'
CAUSE: Normally a migration from release FIVE do not set properly the env for LCM
SOLUTION: Ensure variable OSMLCM_VCA_PUBKEY is properly set at file /etc/osm/docker/lcm.env. The value must match with the output of this command cat $HOME/.local/share/juju/ssh/juju_id_rsa.pub. If not, add or change it. Restart OSM, or just LCM service with docker service update osm_lcm --force --env-add OSMLCM_VCA_PUBKEY=""

9.6. Common issues when interacting with NBI

9.6.1. SSL certificate problem

By default, OSM installer uses a self-signed certificate for HTTPS. That might lead to the error ‘SSL certificate problem: self signed certificate’ on the client side. For testing environments, you might want to ignore this error just by using the appropriate options to skip certificate validation (e.g. --insecure for curl, --no-check-certificate for wget, etc.). However, for more stable setups you might prefer to address this issue by installing the appropriate certificate in your client system.

These are the steps to install NBI certificate on the client side (tested for Ubuntu):

Get the certificate file cert.pem by any of these means:

From running docker container:

docker ps | grep nbi
docker cp <docker-id>:/app/NBI/osm_nbi/http/cert.pem .

From source code: NBI-folder/osm_nbi/http/cert.pem

From ETSI’s git:

wget -O cert.pem "https://osm.etsi.org/gitweb/?p=osm/NBI.git;a=blob_plain;f=osm_nbi/http/cert.pem;hb=refs/heads/v8.0"

Then, you should install this certificate:

  sudo cp cert.pem /usr/local/share/ca-certificates/osm_nbi_cert.pem.crt
  sudo update-ca-certificates
  # 1 added, 0 removed; done

Add to the list of /etc/hosts a host called “nbi” with the IP address where OSM is running.
- It can be localhost if client and server are the same machine.
- For localhost, you would need to add (or edit) these lines:
```
  127.0.0.1     localhost       nbi
  OSM-ip        nbi
```
Finally, for the URL, use the nbi as host name (i.e. httts://nbi:9999/osm).
- Do not use neither localhost nor 127.0.0.1.
- You can run a quick test with curl by:
```
curl https://nbi:9999/osm/version
```

9.6.2. Cannot login after migration to 6.0.2

ERROR: NBI always return “UNAUTHORIZED”. Cannot login neither with UI nor with CLI. CLI shows error “can't find a default project for this user” or “project admin not allowed for this user”.
CAUSE: Normally after a migration to release 6.0.2 There is a slight incompatibility with users created from older versions.
SOLUTION: Delete user admin and reboot NBI so that a new compatible user is created by running these commands:

curl --insecure https://localhost:9999/osm/test/db-clear/users
docker service update  osm_nbi --force

9.7. Other operational issues

9.7.1. Running out of disk space

If you are upgrading frequently your OSM installation, you might face that your disk is running out of space. The reason is that the previous dockers and docker images might be consuming some disk space. Running the following two commands should be enough to clear your docker setup:

docker system prune
docker image prune

If you are still experiencing issues with disk space, logs in one of the dockers could be the cause of your issue. Check the containers that are consuming more space (typically kafka-exporter)

du -sk /var/lib/docker/containers/* |sort -n
docker ps |grep <CONTAINER_ID>

Then, remove the stack and redeploy it again after doing a prune:

docker stack rm osm_metrics
docker system prune
docker image prune
docker stack deploy -c /etc/osm/docker/osm_metrics/docker-compose.yml osm_metrics

9.8. Logs

9.8.1. Checking the logs of OSM in Kubernetes

You can check the logs of any container with the following commands:

kubectl -n osm logs deployment/mon --all-containers=true
kubectl -n osm logs deployment/pol --all-containers=true
kubectl -n osm logs deployment/lcm --all-containers=true
kubectl -n osm logs deployment/nbi --all-containers=true
kubectl -n osm logs deployment/ng-ui --all-containers=true
kubectl -n osm logs deployment/ro --all-containers=true
kubectl -n osm logs deployment/grafana --all-containers=true
kubectl -n osm logs deployment/keystone --all-containers=true
kubectl -n osm logs statefulset/mysql --all-containers=true
kubectl -n osm logs statefulset/mongo --all-containers=true
kubectl -n osm logs statefulset/kafka --all-containers=true
kubectl -n osm logs statefulset/zookeeper --all-containers=true
kubectl -n osm logs statefulset/prometheus --all-containers=true

For live debugging, the following commands can be useful to save the log output to a file and show it in the screen:

kubectl -n osm logs -f deployment/mon --all-containers=true 2>&1 | tee mon-log.txt
kubectl -n osm logs -f deployment/pol --all-containers=true 2>&1 | tee pol-log.txt
kubectl -n osm logs -f deployment/lcm --all-containers=true 2>&1 | tee lcm-log.txt
kubectl -n osm logs -f deployment/nbi --all-containers=true 2>&1 | tee nbi-log.txt
kubectl -n osm logs -f deployment/ng-ui --all-containers=true 2>&1 | tee ng-log.txt
kubectl -n osm logs -f deployment/ro --all-containers=true 2>&1 | tee ro-log.txt
kubectl -n osm logs -f deployment/grafana --all-containers=true 2>&1 | tee grafana-log.txt
kubectl -n osm logs -f deployment/keystone --all-containers=true 2>&1 | tee keystone-log.txt
kubectl -n osm logs -f statefulset/mysql --all-containers=true 2>&1 | tee mysql-log.txt
kubectl -n osm logs -f statefulset/mongo --all-containers=true 2>&1 | tee mongo-log.txt
kubectl -n osm logs -f statefulset/kafka --all-containers=true 2>&1 | tee kafka-log.txt
kubectl -n osm logs -f statefulset/zookeeper --all-containers=true 2>&1 | tee zookeeper-log.txt
kubectl -n osm logs -f statefulset/prometheus --all-containers=true 2>&1 | tee prometheus-log.txt

9.8.2. Changing the log level

You can change the log level of any container, by updating the container with the right LOG_LEVEL env var.

Log levels are:

ERROR
WARNING
INFO
DEBUG

For instance, to set the log level to INFO for the MON in a deployment of OSM over K8s:

kubectl -n osm set env deployment mon OSMMON_GLOBAL_LOGLEVEL=INFO

For instance, to increase the log level to DEBUG for the NBI in a deployment of OSM over docker swarm:

docker service update --env-add OSMNBI_LOG_LEVEL=DEBUG osm_nbi

9.9. How to report an issue

If you have bugs or issues to be reported, please use Bugzilla

If you have questions or feedback, feel free to contact us through:

the mailing list OSM_TECH@list.etsi.org
the Slack work space

Please be patient. Answers may take a few days.

Please provide some context to your questions. As an example, find below some guidelines:

In case of an installation issue:
- The full command used to run the installer and the full output of the installer (or at least enough context) might help on finding the solution.
It is highly recommended to run the installer command capturing standard output and standard error, so that you can send them for analysis if needed. E.g.:

./install_osm.sh 2>&1 | tee osm_install.log

In case of operational issues, the following information might help:
- Version of OSM that you are using
Logs of the system. Check https://osm.etsi.org/wikipub/index.php/Common_issues_and_troubleshooting to know how to get them.
- Details on the actions you made to get that error so that we could reproduce it.
- IP network details in order to help troubleshooting potential network issues. For instance:
  - Client IP address (browser, command line client, etc.) from where you are trying to access OSM
  - IP address of the machine where OSM is running
  - IP addresses of the containers
  - NAT rules in the machine where OSM is running

Common sense applies here, so you don’t need to send everything, but just enough information to diagnose the issue and find a proper solution.

9.10. How to troubleshoot issues in the new Service Assurance architecture

Since OSM Release FOURTEEN, the Service Assurance architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. These alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats them to adapt to Airflow expected webhook endpoints. So the alert workflow is this: DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG

In case of any kind of error related to monitoring, the first thing to check should be the metrics stored in Prometheus. Its graphical interface can be visited at the URL http://$IP:9091/. Some useful metrics to review are the following:

ns_topology: metric generated by a DAG with the current topology (VNFs and NSs) of instantiated VDUs in OSM.
vm_status: status (1: ok, 0: error) of the VMs in the VIMs registered in OSM.
vm_status_extended: metric enriched from the two previous ones, so it includes data about VNF and NS the VM belongs to as part of the metric labels.
osm_*: resource consumption metrics. Only intantiated VNFs that include monitoring parameters have these kind of metrics in Prometheus.

In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here http://$IP:9091/alerts. On this page you can see the alerting rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules:

vdu_down: this alert is fired when a VDU remains in a not OK state for several minutes and triggers alert_vdu DAG. Its labels include information about NS, VNF, VIM, etc.
scalein_*: these rules manage scale-in operations based on the resource consumption metrics and the number of VDU instances. They trigger scalein_vdu DAG.
scaleout_*: these rules manage scale-out operations based on the resource consumption metrics and the number of VDU instances. They trigger scaleout_vdu DAG.

Finally, it is also interesting for debugging to be able to view the logs of the execution of the DAGs. To do this, you must visit the Airflow website, which will be accessible on the port pointed by the airflow-webserver service in OSM’s cluster (not a fixed port):

kubectl -n osm get svc airflow-webserver
NAME                TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
airflow-webserver   NodePort   10.100.57.168   <none>        8080:19371/TCP   12d

When you open the URL http://$IP:port (19371 in the example above) in a browser, you will be prompted for the user and password (admin/admin by default). After that you will see the dashboard with the list of DAGs:

alert_vdu: it is executed when a VDU down alarm is fired or resolved.
scalein_vdu, scaleout_vdu: executed when auto-scaling conditions in a VNF are met.
ns_topology: this DAG is executed periodically for updating the topology metric in Prometheus of the instantiated NS.
vim_status_*: there is one such DAG for each VIM in OSM. It checks VIM’s reachability every few minutes.
vm_status_vim_*: these DAGs (one per VIM) get VM status from VIM and store them in Prometheus.
vm_metrics_vim_*: these DAGs (one per VIM) store in Prometheus resource consumption metrics from VIM.

The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each DAG has a set of tasks, and each task has its own logs.