Feature: AI in OSM for proactive monitoring
AI in OSM for proactive monitoring based on CPU predictions#
PAD: https://osm.etsi.org/pad/p/feature9063
Files (Plenary slides): https://portal.etsi.org/ngppapp/ContributionCreation.aspx?primarykeys=226673
Diagram (https://app.diagrams.net/): ProactiveMonitoringOSM-TSCMeeting.drawio
Proposer
- Subhankar Pal (Altran)
- Ignacio Labrador (ATOS)
- Guillermo Gomez (ATOS)
- Paula Encinar (ATOS)
Type
Feature
Target MDG/TF
MON, Mongo, NG-UI, osmclient, Devops, tests, NBI
Description
AI/ML continues to grow in the telecommunications sector enabling proactive decision making. However, the current approach to SA in OSM is purely reactive. Scale actions are based on what is currently happening based on a set of normalized metrics, such as cpu_utilization or average_memory_utilization. MON takes care of collecting the metrics from the VIM, makes them available to prometheus and then checks for threshold crossing events at fixed intervals. Prometheus has some limitations:
- Prometheus is a pull-based system, so no write operations are available.
- Prometheus is the one that assigns the timestamps (only real time is supported).
- Prometheus is not intended for long-term analysis (compacts data).
The proposed feature will introduce proactive monitoring in OSM targeting cpu_utilization. This means that a new normalized metric called cpu_utilization_prediction will be made available for user to use in the descriptors of VNFs. The feature will be achieved by means of adding a predictor process in MON just like the collector, the evaluator and the dashboard. This new proposal will enable the following:
- MON will include a way to collect and process forecasts
- MON will have a new sub-process, called Predictor, which will provide a way of scheduling the generation of forecasts.
- MON evaluator process will be able to check for threshold crossing both single value in real time and entire timeseries in forecasts
- A new DB will be introduced, the chosen one being influxDB compatible with prometheus, which is a high-performance time series database. This will allow for high availability and push-based metrics storage. In this way MON will be able to store all the forecast and plot it in Grafana against the data collected in real time.
- MON's dashboard will include forecasts on top of real-time metrics to provide evidence of decisions made for comparison.
- Predictions are not entered into Prometheus and evaluated one at a time, but will now be entered directly into the long-term TSDB and evaluated at once.
- OSM includes a new first-class citizen AI Model Server which requires a URL and auth credentials as well as AI/ML model available along with static settings (collection period required and endpoint). This will be a new field in the OSM GUI and the AI Model Server will be added in the same way as a new VIM is added today and this needs to be saved in MongoDB just like VIM params
Adding these new particularities to the feature, the reactive control will continue to work in the same way, it will be actively evaluated. Proactive monitoring, on the other hand, only evaluates when a forecast is requested, evaluates all received forecasts and then sleeps until a new request is scheduled.
As reactive monitoring will remain intact, the descriptors for reactive monitoring will remain intact as well as the settings for scaling. In proactive monitoring, scaling will be added in the same way as it is added in reactive monitoring with real-time CPU metric name in real-time and a suffix (_prediction). Both configurations will be in the same descriptor without any change in the Information Model, but proactive monitoring is enabled at instantiation and can be enabled/disabled or even change the AI/ML model used via instantiation parameters.
In order for this feature to be implemented, the following changes are envisaged:
-
MON
1.1: A new process has to be added for AI-Agent and its prediction algorithms.
1.2 Make changes to MON evaluator to take into account proactive monitoring.
1.3 Make changes to MON dashboarder to be able to display the forecasts.
-
NG-UI+osmclient+NBI: Add a new first-class citizen which will be the option to add a new AI Model Server in the same way as a VIM.
-
Devops: Add a new TSDB to store forecasts and provide high-availability of prometheus access and store
-
Documentation: Step-by-step documentation will be provided in order to use this functionality.
-
Test: unit tests in each repo shall be complemented with a set of Robottest
Demo or definition of done
This Feature shall be considered done when a UC leveraging CPU threshold prediction is able to yield an early scaling action on a VNF. In order to test this, a CPU predictor model will be trained to predict rush hour overloads which result in a very high level of stress over CPU. That way, when the VNF is instantiated near a rush hour, and the real time CPU utilization grows (manually injected), then a scaling action should be triggered before the real time metric exceeds the threshold.