From 1e7aea5746e2785d6541d9619ed413f14ba5f97d Mon Sep 17 00:00:00 2001
From: beierlm <mark.beierl@canonical.com>
Date: Thu, 29 Jul 2021 13:27:30 +0000
Subject: [PATCH] HA documentation

Things to consider for HA, fault tolerance, geo redunancy, security, etc
---
 18-production-considerations.md | 95 +++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)
 create mode 100644 18-production-considerations.md

diff --git a/18-production-considerations.md b/18-production-considerations.md
new file mode 100644
index 0000000..47a12c4
--- /dev/null
+++ b/18-production-considerations.md
@@ -0,0 +1,95 @@
+
+# Highly Available, Fault Tolerant OSM
+
+OSM is capable of running in a small footprint with all services on the same server.  While this is great for proof of concepts or research, it is not suitable for production usage.  This annex will cover failure domains in OSM and how to create a highly available, fault tolerant deployment.
+
+# Fault Domains
+
+In general terms a fault domain is a group of software components that share a single point of failure.  For example, in the introduction we described running all of OSM in a single VM.  This VM is a single point of failure, as loss of the single VM results in a complete loss of all OSM services.  When designing an OSM deployment, multiple fault domains need to be considered:
+
+* K8s infrastructure
+* LXD infrastructure
+* Databases
+* OSM software
+* JuJu
+* Service ingress
+
+## K8s Infrastructure
+
+The first layer to consider is how Kubernetes is installed.  There should be a separation of nodes that perform work from those that provide management, with a minimum of three each.  K8s also requires the use of etcd, which can be deployed as a separate fault domain using three additional nodes.
+
+As this type of K8s installation can take many forms and require custom configuration, OSM provides the ability to install its services into an existing K8s cluster by simply specifying the Kubernetes configuration file (`kubeconfig`) for the cluster as follows:
+
+```bash
+./install_osm.sh --charmed --k8s ~/.kube/config
+```
+
+Further reading:
+
+* https://ubuntu.com/kubernetes/docs/high-availability
+* https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology
+
+## LXD Infrastructure
+
+OSM uses LXD for execution of primitives and actions for deployed network services.
+
+
+
+Further reading:
+
+* https://lxd.readthedocs.io/en/latest/clustering/
+* https://ubuntu.com/blog/lxd-clusters-a-primer
+* [LXD Cluster](16-lxd-cluster.md)
+
+
+
+## Databases
+
+OSM uses two databases for storing its state: MongoDB and MariaDB.  Both of these technologies provide HA and geographically redundant solutions.  By isolating the databases and having replicas, it is possible to recover very quickly from a failed site by simply starting a new OSM instance in a new location, using the replicated databases.  The databases can be installed in any configuration, meaning they could be bare metal instances, VMs, K8s clusters or whatever suits the production environment best.
+
+OSM can be installed to point to the databases directly as part of a custom bundle, or the charmed installer can be run, and then remove the relation to the self-installed databases.
+
+Further reading:
+
+* https://docs.mongodb.com/manual/core/replica-set-architecture-geographically-distributed/
+* https://docs.mongodb.com/manual/tutorial/deploy-geographically-distributed-replica-set/
+* https://severalnines.com/database-blog/using-mysql-galera-cluster-replication-create-geo-distributed-cluster-part-one
+
+## OSM Software
+
+OSM itself is comprised of a number of different services.  Each one of these should be set to a minimum count of three, with anti-affinity rules specified in the K8s infrastructure to ensure that no two replicas of the same service run on the same worker node.  The core OSM services are as follows:
+
+* Grafana
+* Kakfa
+* Keystone
+* LCM
+* MON
+* NBI
+* NG-UI
+* PLA (if installed, optional)
+* POL
+* Prometheus
+* RO
+* Zookeeper
+
+Note that MariaDB and MongoDB are not listed here as they should be isolated in a separate fault domain.
+
+OSM provides an HA bundle that can be used to deploy OSM with replicas already configured.
+
+## Juju
+
+More than one Juju controller (VCA) can be registered with OSM.  This provides a number of benefits:
+
+* More fault domains: in the event of a failure of Juju in one location, others are still able to operate
+* Actions closer to the workload: with Juju and either LXD or K8s cloud colocated with the workloads, all actions can occur in the same location instead of being distributed across a L3 network
+
+## Service Ingress
+
+OSM exposes interfaces for the following components:
+
+* Grafana
+* NBI
+* NG-UI
+* Prometheus
+
+While these can be exposed directly from the pods themselves, that prevents HA and load balancing.  From a resiliency perspective, these services should be placed behind an ingress controller and endpoints should talk to the ingress instead.  OSM itself does not have any specific requirements for ingress controllers, and nginx has been shown to work well.  The ingress controller should also be secured with a properly signed certificate for HTTPS access.
-- 
GitLab