Observability & Alerting¶
Important
Before starting the installation procedure, please download installation resources as explained here and make sure that all pre-requisites are satisfied.
This page also assumes that main Seldon Core and Seldon Deploy components are installed.
Important
The Prometheus integration and especially the model usage monitoring must be configured in production clusters.
Installation¶
The analytics component is configured with the Prometheus integration.
The monitoring for Seldon Deploy is based on the Prometheus Operator and the related PodMonitor
and PrometheusRule
resources.
Installing Prometheus Operator¶
Note
You can use your existing installation of Prometheus Operator and configure it with the PodMonitor
and PrometheusRules
provided later in this document.
We install Prometheus Operator packaged by Bitnami.
For that purpose we prepare the values-kube-prometheus.yaml
file
fullnameOverride: seldon-monitoring
kube-state-metrics:
extraArgs:
metric-labels-allowlist: pods=[*]
and use it to conduct helm
installation
helm upgrade --install prometheus kube-prometheus \
--version 8.0.9 \
--namespace seldon-system \
--values values-kube-prometheus.yaml \
--repo https://charts.bitnami.com/bitnami
Important
Note the presence of metric-labels-allowlist: pods=[*]
in the Helm values file. If you are using your own Prometheus Operator installation you need to make sure that pods labels, especially the app.kubernetes.io/managed-by=seldon-core
, are included in the collected metrics as they are used to compute model usage rules.
Wait for Prometheus Operator to be ready:
kubectl rollout status -n seldon-system deployment/seldon-monitoring-operator
Configure Monitoring¶
To configure monitoring we need to create a dedicated PodMonitor
and PrometheusRule
resources.
Copy the default resources (and edit if required):
cp seldon-deploy-install/reference-configuration/metrics/seldon-monitor.yaml seldon-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/deploy-monitor.yaml deploy-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/metrics-server-monitor.yaml metrics-server-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/model-usage-rules.yaml model-usage-rules.yaml
Apply configurations:
kubectl apply -n seldon-system -f seldon-monitor.yaml
kubectl apply -n seldon-system -f deploy-monitor.yaml
kubectl apply -n seldon-system -f metrics-server-monitor.yaml
kubectl apply -f model-usage-rules.yaml -n seldon-system
Configure Alerting¶
Create alertmanager.yaml
configuration for alertmanager
:
kind: Secret
apiVersion: v1
metadata:
name: alertmanager-seldon-monitoring-alertmanager
stringData:
alertmanager.yaml: |
receivers:
- name: default-receiver
- name: deploy-webhook
webhook_configs:
- url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
route:
group_wait: 10s
group_by: ['alertname']
group_interval: 5m
receiver: default-receiver
repeat_interval: 3h
routes:
- receiver: deploy-webhook
matchers:
- severity =~ "warning|critical"
- type =~ "user|infra"
Note, if you are using App Level Authentication you need to add http_config
in the webhook_configs
section of alertmanager.yaml
:
webhook_configs:
- url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
http_config:
oauth2:
client_id: "${OIDC_CLIENT_ID}"
client_secret: "${OIDC_CLIENT_SECRET}"
scopes: [openid]
token_url: "${OIDC_HOST}/auth/realms/${OIDC_REALM}/protocol/openid-connect/token"
# Note: only needed if using a self-signed certificate on your OIDC provider
tls_config:
insecure_skip_verify: true
with configuration of client
that can access Seldon Deploy API. Note that token_url
value may depend on your OIDC provider.
If you are using a self-signed certificate on your OIDC provider then you will need to set insecure_skip_verify
in the tls_config
of the oauth2
block as specified above.
Alternatively you can mount your CA certificate onto the alertmanager instance to validate the server certificate using ca_file
as documented here.
Now, apply the alertmanager configuration
kubectl delete secret -n seldon-system alertmanager-seldon-monitoring-alertmanager || echo "Does not yet exist"
kubectl apply -f alertmanager.yaml -n seldon-system
Copy the default alert configurations:
cp seldon-deploy-install/reference-configuration/metrics/user-alerts.yaml user-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/infra-alerts.yaml infra-alerts.yaml
Apply default alert configurations:
kubectl apply -n seldon-system -f infra-alerts.yaml
kubectl apply -n seldon-system -f user-alerts.yaml
Configure Seldon Deploy¶
Update file deploy-values.yaml
with your Helm values for Seldon Deploy installation
prometheus:
seldon:
namespaceMetricName: namespace
activeModelsNamespaceMetricName: exported_namespace
serviceMetricName: service
url: http://seldon-monitoring-prometheus.seldon-system:9090/api/v1/
knative:
url: http://seldon-monitoring-prometheus.seldon-system:9090/api/v1/
env:
ALERTMANAGER_URL: http://seldon-monitoring-alertmanager.seldon-system:9093/api/v1/alerts
and execute helm upgrade ...
command to apply new configuration.
Alerting¶
Seldon Deploy and any deployed models record metrics to Prometheus, by default some alerting rules are also configured along with a default Alertmanager.
Alertmanager will notify the Deploy frontend when an SLO is breached on the Deploy infrastructure or a deployed model.
Deploy also has an alerts page where you can view all currently firing alerts.
See the alerting demo to try it out once installed.
Alerting integration¶
The default installation described above provides the configmaps and setup required for this to work out of the box, but changes to Prometheus and Alertmanager configuration will be required if not using the default installation.
Alertmanager must be version
0.24.0
or greater:
alertmanager:
image:
tag: v0.24.0
Default alerting rules are provided with downloaded resources in
deploy-alerts.rules.yml
file referenced in installation section:We recommend installing these rules as they have useful alerts for both Deploy and your Models - but you can also extend them with your own rules.
The related configmap named
deploy-alerts-rules
inseldon-system
namespace is referenced inextraConfigmapMounts
section ofanalytics-values.yaml
file in installation section.
Key configuration elements in
alertmanager.yml
from the installation section:Oauth2 config
Deploy webhook receiver, this notifies the Deploy frontend when an alert fires
Grouping by
alertname
Alerting API¶
The alerting service currently provides an endpoint to list all firing alerts, and the ability to initiate a test of the alerting flow. See the API reference page for more details.
Configuring external incident response tool¶
Alertmanager can be configured to also send alerts elsewhere, such as email or slack. It can also be integrated into an incident response tool.
Pagerduty integration¶
Click to expand how to integrate Pagerduty
To setup alerts in Pagerduty you need to follow these steps:
Follow Pagerduty’s guide to get your
integration key
. Make sure you follow theIntegrating With a PagerDuty Service
instructions rather than the ones for Global Event Routing.Retrieve the contents for your
alertmanager.yml
from your running cluster by running:kubectl describe configmap -n seldon-system seldon-core-analytics-alertmanager-configuration
Save the contents of
alertmanager.yml
locally.Add the following receiver to the
alertmanager.yml
file.- name: pagerduty-deploy pagerduty_configs: - service_key: <YOUR_INTEGRATION_KEY_HERE>
Add the following route to the provided
alertmanager.yml
file.- receiver: pagerduty-deploy match_re: severity: critical continue: true
Warning
If the receiver is not at the top of your list of receivers you will need to add
continue: true
to the receiver that comes before your new entry.Create the configmap using:
kubectl create configmap -n seldon-system seldon-core-analytics-alertmanager-configuration --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
If Alertmanager is already running you will need to portforward Alertmanager and reload the configmap, it may take a minute or so to complete
kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-alertmanager 9090:80 # In a new terminal run the following: curl -X POST http://localhost:9090/-/reload
Use the alerting flow test API endpoint to ensure it’s working.
curl http://<DEPLOY_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
Opsgenie integration¶
Click to expand how to integrate Opsgenie
To setup alerts in Opsgenie you need to follow these steps:
Follow Opsgenie’s guide to get your
api key
Retrieve the contents for your
alertmanager.yml
from your running cluster by running:kubectl describe configmap -n seldon-system seldon-core-analytics-alertmanager-configuration
Save the contents of
alertmanager.yml
locally.Add the following receiver to the provided
alertmanager.yml
file- name: opsgenie-deploy opsgenie_configs: - api_key: <YOUR_API_KEY_HERE> teams: <YOUR_TEAM_HERE>
Add the following route to the provided
alertmanager.yml
file- receiver: opsgenie-deploy match_re: severity: critical continue: true
Warning
If the receiver is not at the top of your list of receivers you will need to add
continue: true
to the receiver that comes before your new entry.Create the configmap using:
kubectl create configmap -n seldon-system seldon-core-analytics-alertmanager-configuration --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
If Alertmanager is already running you will need to portforward Alertmanager and reload the configmap, it may take a minute or so to complete
kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-alertmanager 9090:80 # In a new terminal run the following: curl -X POST http://localhost:9090/-/reload
Use the alerting flow test API endpoint to ensure it’s working.
curl http://<DEPLOY_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
Custom alerts¶
You can also define your own custom alerting rules in Prometheus.
Modify the
deploy-alerts.rules.yml
file to add your new rule, there are examples to follow in the fileCreate the configmap using:
kubectl create configmap -n seldon-system deploy-alerts-rules --from-file=deploy-alerts.rules.yml --dry-run=client -o yaml | kubectl apply -f -
If Prometheus is already running you will need to portforward Prometheus and reload the configmap, it may take a minute or so to complete
kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-seldon 8090:80 # In a new terminal run the following: curl -X POST http://localhost:8090/-/reload
Visit the Prometheus UI at http://localhost:8090 and check that your new alerting rule is visible
Bringing your own Prometheus¶
It is possible to use your own Prometheus instance.
Warning
We strongly recommend using a Prometheus setup in the same Kubernetes cluster as Deploy.
Running a monitoring stack outside your Kubernetes cluster, whether Prometheus or something else, creates a number of difficulties:
The monitoring tool needs to be configured to talk to the Kubernetes API.
The monitoring tool will need appropriate access rights for both the API and to monitor resources in the cluster.
Reaching those resources becomes a challenge, as they are not exposed to the outside world by default, and doing so presents potential security risks.
Not only does Prometheus require access to in-cluster resources, but Deploy also needs access to Prometheus so that it can query metrics. Enabling this second flow of traffic may require further work from a network and security standpoint.
Scraping from outside the cluster can be less reliable than from within, depending on your exact setup. This may lead to lost data and false-positive alerts, for example.
Configuring Prometheus Operator¶
Configuring your own Prometheus Operator to scrape Seldon metrics is recommended.
Follow the installation steps discussed above to create all required PodMonitor
and PrometheusRules
resources and then adjust provided Helm values accordingly.
Configuring vanilla Prometheus¶
Scraping model metrics¶
Models created through Seldon are automatically given the following annotations:
"prometheus.io/scrape": "true"
"prometheus.io/path": "/prometheus"
and define ports named metrics
.
These are used in conjunction to tell Prometheus that the model
should be scraped for metrics, and how to access those metrics.
These three pieces of information correspond to the meta-annotations
__meta_kubernetes_pod_annotation_prometheus_io_scrape
__meta_kubernetes_pod_annotation_prometheus_io_path
__meta_kubernetes_pod_container_port_name
in the Prometheus configuration’sscrape_configs
section.
These data match the filters used in the seldon-core-analytics
installation’s default configuration.
Ensure your own, custom scraping config is compatible with these
settings, or Deploy will not be able to display model metrics.
Note that the path /prometheus
differs from the default /metrics
.
If you have explicitly configured Prometheus to always expect a
particular path, you may need to update your scraping rules to allow
the /prometheus
path as well.
Note also that Seldon containers typically expose more than one port,
with only one being used for metrics.
The Status > Targets
page in the Prometheus UI may show these
extra ports as being down, even though the service is running normally.
You can use the __meta_kubernetes_pod_container_port_name
relabel
config shown below to remove this noisiness.
Example Prometheus config¶
The relevant section of your Prometheus config
needs to include a scraping job that looks like this seldon_models
config:
global:
...
scrape_configs:
- job_name: seldon_models
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
regex: true
action: keep
- source_labels:
- __meta_kubernetes_pod_container_port_name
regex: metrics(-.*)?
action: keep
- source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
regex: (.+)
action: replace
target_label: __metrics_path__
You could add this as a new scrape_config
and use the job_name
in recording rules and alerts to ignore duplicated entries from any
existing scraping jobs.
Alternatively, you could use the above to replace an existing config
which targets role: pod
.
If you already have the seldon-core-analytics
chart installed and
are looking to migrate to a different chart, you can retrieve the
current scraping config with the below, assuming you have jq
installed:
export PROMETHEUS_NAMESPACE='seldon-system'
export CHART_NAME='seldon-core-analytics'
helm get values --all -n $PROMETHEUS_NAMESPACE $CHART_NAME -o json \
| jq '.prometheus.serverFiles."prometheus.yml".scrape_configs'
If you are using a different chart, you can adjust the preceding commands to retrieve your existing configuration.
Remember to apply any changes you make with Helm.
Configuring Seldon Deploy¶
For configuring Deploy to query Prometheus, see the prometheus
section in the default values file:
./seldon-deploy-install/helm-charts/seldon-deploy/values.yaml
In particular, see the following:
prometheus.seldon.url
prometheus.seldon.resourceMetricsUrl
prometheus.knative.url
Seldon Deploy queries the recording rules defined in model-usage.rules.yml
.
Ensure that these have been installed.
Verification / Troubleshooting¶
We can port-forward Prometheus in order to check it. With seldon-core-analytics the Prometheus service we can do this with:
kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-seldon 9090:80
Then go to localhost:9090
in the browser.
To confirm the recording rules are present, go to Status > Rules
and search for model-usage
.
If you have a seldon model running, go to Status > Targets
and search for seldon_app
or just seldon
. Any targets for seldon models should be green.
On the /graph
page if you select from the insert metric at cursor
drop-down there should be metrics that begin with the names seldon.