New feature request: Predictive threshold

Advanced monitoring to support AI Agents

Proposers

Subhankar Pal (subhankar.pal@altran.com)
Atul Agarwal (atul.agarwal@altran.com)
Felipe Vicens (felipe.vicens@atos.net)
Fabian Bravo (fbravo@whitestack.com)
Gianpietro Lavado (glavado@whitestack.com)
Ignacio Labrador (ignacio.labrador@atos.net)
Guillermo Gomez (guillermo.gomez.external@atos.net)
Paula Encinar (paula.encinar.external@atos.net)

Type

Feature

Target MDG/TF

MON

Description

The Service Assurance module in OSM (MON+POL) reactively manages the monitoring close loop. That means that the thresholds are defined per metric, and the system reacts when a threshold is surpassed.

With this Feature we want to prepare MON to leverage metrics provided by external services accessible via PrometheusDB. The objective is to evaluate a threshold of one of the existing metrics not only against the real time value but also against a predicted value associated to it which shall be stored in PrometheusDB with a known identification (E.g. _predicted suffix, to be discussed in the PAD). This process is envisioned initially as a sequence of two evaluations against the same threshold. That way, POL remains untouched and functioning seamlessly with reactive and proactive alarms.

In order to have this feature implemented, the following changes are devised:

MON: For one of the existing metrics (CPU utilization), a flag will be made available to indicate that there is a predicted version of the metric pushed to PrometheusDB by an AIAgent running in the EE. PrometheusDB will not be used to store historical data for prediction purposes

Given that the flag is ON and CPU utilization values are available in both, real time and predicted entries with PrometheusDB, MON will first evaluate the real time value and only if the alarms was not triggered it will evaluate using the predicted value.

The flag shall be able to correctly identify to which VNFi it is applicable.

If the flag is ON but no predicted data is found in the DB it could be due to a boot-up delay from the AI Agent in the EE. How many opportunities will be given is to be confirmed in the PAD of the feature.

Predicted metric values to be stored in PrometheusDB will be normalized to address:

Predicted metrics have a time-index with values in the future and therefore an offset is required in order to accommodate to PrometheusDB requirements. The final mechanism to store the data in DB will be discussed in the PAD of the Feature.
CollectionPeriod of the real-time variable shall be equal to the prediction time window in order to seamlessly use the same Aggregation and Scaling configurations.

Such normalization will be performed by an AIAgent running in the EE of the VNF. The proposed Feature will only cover the VNF level since EE is not provided for NS yet. More details about the latest AIAgent design are available in its feature https://osm.etsi.org/gerrit/#/c/osm/Features/+/9063/

Design details posted in the pad with visual examples. https://osm.etsi.org/pad/p/PredictiveThreshold

Demo or definition of done

This Feature shall be considered done when a UC leveraging CPU threshold prediction is able to yield an early scaling action on a VNF. In order to test this, a CPU predictor model will be trained to predict rush hour overloads which result in a very high level of stress over CPU. That way, when the VNF is instantiated near a rush hour, and the real time CPU utilization grows (manually injected), then a scaling action should be triggered before the real time metric exceeds the threshold.