Observability and Alerting¶
Installation¶
Important
This page assumes that Seldon Enterprise Platform component is installed. The Prometheus integration and especially the deployment usage monitoring must be configured in production clusters.
The analytics component is configured with the Prometheus integration.
The monitoring for Seldon Enterprise Platform is based on the Prometheus Operator and the related PodMonitor
and PrometheusRule
resources.
Prepare seldon-monitoring namespace¶
We start with creating a namespace that will hold the monitoring components, conventionally called seldon-monitoring
.
kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"
Installing Prometheus Operator¶
Note
You can use your existing installation of Prometheus Operator and configure it with the PodMonitor
and PrometheusRules
provided later in this document.
Warning
We strongly recommend using a Prometheus setup in the same Kubernetes cluster as Enterprise Platform.
Running a monitoring stack outside your Kubernetes cluster, whether Prometheus or something else, creates a number of difficulties:
The monitoring tool needs to be configured to talk to the Kubernetes API.
The monitoring tool will need appropriate access rights for both the API and to monitor resources in the cluster.
Reaching those resources becomes a challenge, as they are not exposed to the outside world by default, and doing so presents potential security risks.
Not only does Prometheus require access to in-cluster resources, but Enterprise Platform also needs access to Prometheus so that it can query metrics. Enabling this second flow of traffic may require further work from a network and security standpoint.
Scraping from outside the cluster can be less reliable than from within, depending on your exact setup. This may lead to lost data and false-positive alerts, for example.
We’ll install the Prometheus Operator, as packaged by Bitnami.
For that purpose we prepare the prometheus-values.yaml
file
fullnameOverride: seldon-monitoring
kube-state-metrics:
extraArgs:
metric-labels-allowlist: pods=[*]
and use it to conduct helm
installation
helm upgrade --install prometheus kube-prometheus \
--version 8.3.6 \
--namespace seldon-monitoring \
--values prometheus-values.yaml \
--repo https://charts.bitnami.com/bitnami
Important
Note the presence of metric-labels-allowlist: pods=[*]
in the Helm values file.
If you are using your own Prometheus Operator installation you need to make sure that pods labels, especially the app.kubernetes.io/managed-by=seldon-core
, are included in the collected metrics as they are used to compute deployment usage rules.
Wait for Prometheus Operator to be ready:
kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operator
Configure Monitoring¶
To configure monitoring we need to create a dedicated PodMonitor
and PrometheusRule
resources.
Copy the default resources (and edit if required):
cp seldon-deploy-install/reference-configuration/metrics/seldon-monitor.yaml seldon-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/drift-monitor.yaml drift-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/deploy-monitor.yaml deploy-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/metrics-server-monitor.yaml metrics-server-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/deployment-usage-rules.yaml deployment-usage-rules.yaml
Apply configurations:
kubectl apply -n seldon-monitoring -f seldon-monitor.yaml
kubectl apply -n seldon-monitoring -f drift-monitor.yaml
kubectl apply -n seldon-monitoring -f deploy-monitor.yaml
kubectl apply -n seldon-monitoring -f metrics-server-monitor.yaml
kubectl apply -f deployment-usage-rules.yaml -n seldon-monitoring
Configure Alerting¶
Configuring Alertmanager¶
Create alertmanager.yaml
configuration for Alertmanager:
kind: Secret
apiVersion: v1
metadata:
name: alertmanager-seldon-monitoring-alertmanager
stringData:
alertmanager.yaml: |
receivers:
- name: default-receiver
- name: deploy-webhook
webhook_configs:
- url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
route:
group_wait: 10s
group_by: ['alertname']
group_interval: 5m
receiver: default-receiver
repeat_interval: 3h
routes:
- receiver: deploy-webhook
matchers:
- severity =~ "warning|critical"
- type =~ "user|infra"
If you are using App Level Authentication you need to add http_config
in the webhook_configs
section of alertmanager.yaml
.
This needs a client that has been configured to access the Seldon Enterprise Platform API.
The token_url
value may vary, depending on your OIDC provider.
webhook_configs:
- url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
http_config:
oauth2:
client_id: "${OIDC_CLIENT_ID}"
client_secret: "${OIDC_CLIENT_SECRET}"
scopes: [openid]
token_url: "${OIDC_HOST}/auth/realms/${OIDC_REALM}/protocol/openid-connect/token"
# Note: only needed if using a self-signed certificate on your OIDC provider
tls_config:
insecure_skip_verify: true
If you are using a self-signed certificate on your OIDC provider then you will need to set insecure_skip_verify
in the tls_config
of the oauth2
block as specified above.
Alternatively you can mount your CA certificate onto the Alertmanager instance to validate the server certificate using ca_file
as documented here.
Now, apply the Alertmanager configuration:
kubectl delete secret -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager || echo "Does not yet exist"
kubectl apply -f alertmanager.yaml -n seldon-monitoring
Creating default alerts¶
Copy the default alert configurations:
cp seldon-deploy-install/reference-configuration/metrics/user-alerts.yaml user-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/infra-alerts.yaml infra-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/drift-alerts.yaml drift-alerts.yaml
Apply default alert configurations:
kubectl apply -n seldon-monitoring -f infra-alerts.yaml
kubectl apply -n seldon-monitoring -f user-alerts.yaml
kubectl apply -n seldon-monitoring -f drift-alerts.yaml
We recommend installing these rules as they have useful alerts for both Enterprise Platform and your models. You can also extend them with your own rules.
Configure Seldon Enterprise Platform¶
Update file deploy-values.yaml
with your Helm values for the Seldon Enterprise Platform installation:
prometheus:
seldon:
namespaceMetricName: namespace
activeModelsNamespaceMetricName: exported_namespace
serviceMetricName: service
url: http://seldon-monitoring-prometheus.seldon-monitoring:9090/api/v1/
env:
ALERTMANAGER_URL: http://seldon-monitoring-alertmanager.seldon-monitoring:9093/api/v1/alerts
Execute this command to upgrade and apply the new configuration:
helm upgrade seldon-deploy seldon-charts/seldon-deploy \
-f deploy-values.yaml \
--namespace=seldon-system \
--version 2.3.1 \
--install
Alerting¶
Seldon Enterprise Platform, and any deployed models, expose metrics to Prometheus. By default, some alerting rules are configured along with an Alertmanager instance.
Alertmanager will notify the Enterprise Platform frontend when an SLO is breached on the Platform infrastructure or a deployed model.
Enterprise Platform also has an alerts page where you can view all currently firing alerts.
See the alerting demo to try it out, once installed.
Alerting integration¶
Alertmanager must be version
0.24.0
or greaterThe default installation described above provides an alerting setup out of the box, but changes to Prometheus and Alertmanager configuration will be required if not using the default installation.
Key configuration elements in
alertmanager.yml
from the installation section:OAuth2 config
Webhook receiver, this notifies the Enterprise Platform frontend when an alert fires
Grouping by
alertname
Alerting API¶
The alerting service currently provides endpoints to list all firing alerts and to initiate a test of the alerting flow. See the API reference page for more details.
Configuring external incident response tool¶
Alertmanager can be configured to also send alerts elsewhere, such as via email or Slack. It can also be integrated into an incident response tool.
Pagerduty integration¶
Click to expand how to integrate Pagerduty
To setup alerts in Pagerduty you need to follow these steps:
Follow Pagerduty’s guide to get your
integration key
. Make sure you follow theIntegrating With a PagerDuty Service
instructions rather than the ones for Global Event Routing.Add the following receiver to the
alertmanager.yml
file.- name: pagerduty-deploy pagerduty_configs: - service_key: <YOUR_INTEGRATION_KEY_HERE>
Add the following route to the provided
alertmanager.yml
file.- receiver: pagerduty-deploy match_re: severity: critical continue: true
Warning
If the receiver is not at the top of your list of receivers you will need to add
continue: true
to the receiver that comes before your new entry.Create the secret using:
kubectl create secret generic -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
If Alertmanager is already running you will need to portforward Alertmanager and reload the config, it may take a minute or so to complete
kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090 # In a new terminal run the following: curl -X POST http://localhost:9090/-/reload
Use the alerting flow test API endpoint to ensure it’s working.
curl http://<ENTERPRISE_PLATFORM_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
Opsgenie integration¶
Click to expand how to integrate Opsgenie
To setup alerts in Opsgenie you need to follow these steps:
Follow Opsgenie’s guide to get your
api key
Add the following receiver to the provided
alertmanager.yml
file- name: opsgenie-deploy opsgenie_configs: - api_key: <YOUR_API_KEY_HERE> teams: <YOUR_TEAM_HERE>
Add the following route to the provided
alertmanager.yml
file- receiver: opsgenie-deploy match_re: severity: critical continue: true
Warning
If the receiver is not at the top of your list of receivers you will need to add
continue: true
to the receiver that comes before your new entry.Create secret using:
kubectl create secret generic -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
If Alertmanager is already running you will need to portforward Alertmanager and reload the secret, it may take a minute or so to complete
kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093 # In a new terminal run the following: curl -X POST http://localhost:9093/-/reload
Use the alerting flow test API endpoint to ensure it’s working.
curl http://<ENTERPRISE_PLATFORM_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
Custom alerts¶
You can also define your own custom alerting rules in Prometheus.
Create a file called
custom-alert.yaml
that contains your new rule(s). There are examples to follow in the fileseldon-deploy-install/reference-configuration/metrics/user-alerts.yaml
Afterwards, the PrometheusRule(s) can be added using:
kubectl create -f custom-alert.yaml
You can portforward Prometheus like this:
kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090 # In a new terminal run the following: curl -X POST http://localhost:9090/-/reload
Visit the Prometheus UI at http://localhost:9090 and check that your new alerting rule is visible
Verification and Troubleshooting¶
Using the Prometheus web UI, we can delve into the state of the running system.
To gain access to the web UI, we can access it via port-forwarding, using:
kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
Then go to localhost:9090
in the browser.
To confirm the recording rules are present, go to Status > Rules
and search for deployment-usage
.
If you have a Seldon model running, go to Status > Targets
and search for seldon_app
or just seldon
. Any targets for Seldon models should be green.
On the /graph
page if you select from the insert metric at cursor
drop-down there should be metrics that begin with the names seldon
.