Observability & Alerting

Important

Before starting the installation procedure, please download installation resources as explained here and make sure that all pre-requisites are satisfied.

This page also assumes that main Seldon Core and Seldon Deploy components are installed.

Important

The Prometheus integration and especially the model usage monitoring must be configured in production clusters.

Installation

The analytics component is configured with the Prometheus integration. The monitoring for Seldon Deploy is based on the Prometheus Operator and the related PodMonitor and PrometheusRule resources.

Installing Prometheus Operator

Note

You can use your existing installation of Prometheus Operator and configure it with the PodMonitor and PrometheusRules provided later in this document.

We install Prometheus Operator packaged by Bitnami.

For that purpose we prepare the values-kube-prometheus.yaml file

fullnameOverride: seldon-monitoring
kube-state-metrics:
  extraArgs:
    metric-labels-allowlist: pods=[*]

and use it to conduct helm installation

helm upgrade --install prometheus kube-prometheus \
    --version 8.0.9 \
    --namespace seldon-system \
    --values values-kube-prometheus.yaml \
    --repo https://charts.bitnami.com/bitnami

Important

Note the presence of metric-labels-allowlist: pods=[*] in the Helm values file. If you are using your own Prometheus Operator installation you need to make sure that pods labels, especially the app.kubernetes.io/managed-by=seldon-core, are included in the collected metrics as they are used to compute model usage rules.

Wait for Prometheus Operator to be ready:

kubectl rollout status -n seldon-system deployment/seldon-monitoring-operator

Configure Monitoring

To configure monitoring we need to create a dedicated PodMonitor and PrometheusRule resources.

Copy the default resources (and edit if required):

cp seldon-deploy-install/reference-configuration/metrics/seldon-monitor.yaml seldon-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/deploy-monitor.yaml deploy-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/metrics-server-monitor.yaml metrics-server-monitor.yaml

cp seldon-deploy-install/reference-configuration/metrics/model-usage-rules.yaml model-usage-rules.yaml

Apply configurations:

kubectl apply -n seldon-system -f seldon-monitor.yaml
kubectl apply -n seldon-system -f deploy-monitor.yaml
kubectl apply -n seldon-system -f metrics-server-monitor.yaml

kubectl apply -f model-usage-rules.yaml -n seldon-system

Configure Alerting

Create alertmanager.yaml configuration for alertmanager:

kind: Secret
apiVersion: v1
metadata:
  name: alertmanager-seldon-monitoring-alertmanager
stringData:
  alertmanager.yaml: |
    receivers:
      - name: default-receiver
      - name: deploy-webhook
        webhook_configs:
          - url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
    route:
      group_wait: 10s
      group_by: ['alertname']
      group_interval: 5m
      receiver: default-receiver
      repeat_interval: 3h
      routes:
        - receiver: deploy-webhook
          matchers:
            - severity =~ "warning|critical"
            - type =~ "user|infra"

Note, if you are using App Level Authentication you need to add http_config in the webhook_configs section of alertmanager.yaml:

          webhook_configs:
          - url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
            http_config:
              oauth2:
                client_id: "${OIDC_CLIENT_ID}"
                client_secret: "${OIDC_CLIENT_SECRET}"
                scopes: [openid]
                token_url: "${OIDC_HOST}/auth/realms/${OIDC_REALM}/protocol/openid-connect/token"
                # Note: only needed if using a self-signed certificate on your OIDC provider
                tls_config:
                  insecure_skip_verify: true

with configuration of client that can access Seldon Deploy API. Note that token_url value may depend on your OIDC provider.

If you are using a self-signed certificate on your OIDC provider then you will need to set insecure_skip_verify in the tls_config of the oauth2 block as specified above. Alternatively you can mount your CA certificate onto the alertmanager instance to validate the server certificate using ca_file as documented here.

Now, apply the alertmanager configuration

kubectl delete secret -n seldon-system alertmanager-seldon-monitoring-alertmanager || echo "Does not yet exist"
kubectl apply -f alertmanager.yaml -n seldon-system

Copy the default alert configurations:

cp seldon-deploy-install/reference-configuration/metrics/user-alerts.yaml user-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/infra-alerts.yaml infra-alerts.yaml

Apply default alert configurations:

kubectl apply -n seldon-system -f infra-alerts.yaml
kubectl apply -n seldon-system -f user-alerts.yaml

Configure Seldon Deploy

Update file deploy-values.yaml with your Helm values for Seldon Deploy installation

prometheus:
  seldon:
    namespaceMetricName: namespace
    activeModelsNamespaceMetricName: exported_namespace
    serviceMetricName: service
    url: http://seldon-monitoring-prometheus.seldon-system:9090/api/v1/
  knative:
    url: http://seldon-monitoring-prometheus.seldon-system:9090/api/v1/
env:
  ALERTMANAGER_URL: http://seldon-monitoring-alertmanager.seldon-system:9093/api/v1/alerts

and execute helm upgrade ... command to apply new configuration.

Alerting

Seldon Deploy and any deployed models record metrics to Prometheus, by default some alerting rules are also configured along with a default Alertmanager.

Alertmanager will notify the Deploy frontend when an SLO is breached on the Deploy infrastructure or a deployed model.

Deploy also has an alerts page where you can view all currently firing alerts.

See the alerting demo to try it out once installed.

Alerting integration

  • The default installation described above provides the configmaps and setup required for this to work out of the box, but changes to Prometheus and Alertmanager configuration will be required if not using the default installation.

  • Alertmanager must be version 0.24.0 or greater:

    alertmanager:
      image:
        tag: v0.24.0
  • Default alerting rules are provided with downloaded resources in deploy-alerts.rules.yml file referenced in installation section:

    • We recommend installing these rules as they have useful alerts for both Deploy and your Models - but you can also extend them with your own rules.

    • The related configmap named deploy-alerts-rules in seldon-system namespace is referenced in extraConfigmapMounts section of analytics-values.yaml file in installation section.

  • Key configuration elements in alertmanager.yml from the installation section:

    • Oauth2 config

    • Deploy webhook receiver, this notifies the Deploy frontend when an alert fires

    • Grouping by alertname

Alerting API

The alerting service currently provides an endpoint to list all firing alerts, and the ability to initiate a test of the alerting flow. See the API reference page for more details.

Configuring external incident response tool

Alertmanager can be configured to also send alerts elsewhere, such as email or slack. It can also be integrated into an incident response tool.

Pagerduty integration

Click to expand how to integrate Pagerduty

To setup alerts in Pagerduty you need to follow these steps:

  1. Follow Pagerduty’s guide to get your integration key. Make sure you follow the Integrating With a PagerDuty Service instructions rather than the ones for Global Event Routing.

  2. Retrieve the contents for your alertmanager.yml from your running cluster by running:

      kubectl describe configmap -n seldon-system seldon-core-analytics-alertmanager-configuration
    

    Save the contents of alertmanager.yml locally.

  3. Add the following receiver to the alertmanager.yml file.

    - name: pagerduty-deploy
      pagerduty_configs:
      - service_key: <YOUR_INTEGRATION_KEY_HERE>
    
  4. Add the following route to the provided alertmanager.yml file.

    - receiver: pagerduty-deploy
      match_re:
        severity: critical
      continue: true
    

    Warning

    If the receiver is not at the top of your list of receivers you will need to add continue: true to the receiver that comes before your new entry.

  5. Create the configmap using:

    kubectl create configmap -n seldon-system seldon-core-analytics-alertmanager-configuration --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
    
  6. If Alertmanager is already running you will need to portforward Alertmanager and reload the configmap, it may take a minute or so to complete

    kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-alertmanager 9090:80
    # In a new terminal run the following:
    curl -X POST http://localhost:9090/-/reload
    
  7. Use the alerting flow test API endpoint to ensure it’s working.

    curl http://<DEPLOY_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
    

Opsgenie integration

Click to expand how to integrate Opsgenie

To setup alerts in Opsgenie you need to follow these steps:

  1. Follow Opsgenie’s guide to get your api key

  2. Retrieve the contents for your alertmanager.yml from your running cluster by running:

      kubectl describe configmap -n seldon-system seldon-core-analytics-alertmanager-configuration
    

    Save the contents of alertmanager.yml locally.

  3. Add the following receiver to the provided alertmanager.yml file

    - name: opsgenie-deploy
      opsgenie_configs:
      - api_key: <YOUR_API_KEY_HERE>
        teams: <YOUR_TEAM_HERE>
    
  4. Add the following route to the provided alertmanager.yml file

    - receiver: opsgenie-deploy
      match_re:
        severity: critical
      continue: true
    

    Warning

    If the receiver is not at the top of your list of receivers you will need to add continue: true to the receiver that comes before your new entry.

  5. Create the configmap using:

    kubectl create configmap -n seldon-system seldon-core-analytics-alertmanager-configuration --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
    
  6. If Alertmanager is already running you will need to portforward Alertmanager and reload the configmap, it may take a minute or so to complete

    kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-alertmanager 9090:80
    # In a new terminal run the following:
    curl -X POST http://localhost:9090/-/reload
    
  7. Use the alerting flow test API endpoint to ensure it’s working.

    curl http://<DEPLOY_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
    

Custom alerts

You can also define your own custom alerting rules in Prometheus.

  1. Modify the deploy-alerts.rules.yml file to add your new rule, there are examples to follow in the file

  2. Create the configmap using:

    kubectl create configmap -n seldon-system deploy-alerts-rules --from-file=deploy-alerts.rules.yml --dry-run=client -o yaml | kubectl apply -f -

  3. If Prometheus is already running you will need to portforward Prometheus and reload the configmap, it may take a minute or so to complete

    kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-seldon 8090:80
    # In a new terminal run the following:
    curl -X POST http://localhost:8090/-/reload
    
  4. Visit the Prometheus UI at http://localhost:8090 and check that your new alerting rule is visible

Bringing your own Prometheus

It is possible to use your own Prometheus instance.

Warning

We strongly recommend using a Prometheus setup in the same Kubernetes cluster as Deploy.

Running a monitoring stack outside your Kubernetes cluster, whether Prometheus or something else, creates a number of difficulties:

  • The monitoring tool needs to be configured to talk to the Kubernetes API.

  • The monitoring tool will need appropriate access rights for both the API and to monitor resources in the cluster.

  • Reaching those resources becomes a challenge, as they are not exposed to the outside world by default, and doing so presents potential security risks.

  • Not only does Prometheus require access to in-cluster resources, but Deploy also needs access to Prometheus so that it can query metrics. Enabling this second flow of traffic may require further work from a network and security standpoint.

  • Scraping from outside the cluster can be less reliable than from within, depending on your exact setup. This may lead to lost data and false-positive alerts, for example.

Configuring Prometheus Operator

Configuring your own Prometheus Operator to scrape Seldon metrics is recommended. Follow the installation steps discussed above to create all required PodMonitor and PrometheusRules resources and then adjust provided Helm values accordingly.

Configuring vanilla Prometheus

Scraping model metrics

Models created through Seldon are automatically given the following annotations:

  • "prometheus.io/scrape": "true"

  • "prometheus.io/path": "/prometheus"

and define ports named metrics. These are used in conjunction to tell Prometheus that the model should be scraped for metrics, and how to access those metrics.

These three pieces of information correspond to the meta-annotations

  • __meta_kubernetes_pod_annotation_prometheus_io_scrape

  • __meta_kubernetes_pod_annotation_prometheus_io_path

  • __meta_kubernetes_pod_container_port_name in the Prometheus configuration’s scrape_configs section.

These data match the filters used in the seldon-core-analytics installation’s default configuration. Ensure your own, custom scraping config is compatible with these settings, or Deploy will not be able to display model metrics.

Note that the path /prometheus differs from the default /metrics. If you have explicitly configured Prometheus to always expect a particular path, you may need to update your scraping rules to allow the /prometheus path as well.

Note also that Seldon containers typically expose more than one port, with only one being used for metrics. The Status > Targets page in the Prometheus UI may show these extra ports as being down, even though the service is running normally. You can use the __meta_kubernetes_pod_container_port_name relabel config shown below to remove this noisiness.

Example Prometheus config

The relevant section of your Prometheus config needs to include a scraping job that looks like this seldon_models config:

global:
  ...
scrape_configs:
- job_name: seldon_models
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape
    regex: true
    action: keep
  - source_labels:
    - __meta_kubernetes_pod_container_port_name
    regex: metrics(-.*)?
    action: keep
  - source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    regex: (.+)
    action: replace
    target_label: __metrics_path__

You could add this as a new scrape_config and use the job_name in recording rules and alerts to ignore duplicated entries from any existing scraping jobs. Alternatively, you could use the above to replace an existing config which targets role: pod.

If you already have the seldon-core-analytics chart installed and are looking to migrate to a different chart, you can retrieve the current scraping config with the below, assuming you have jq installed:

export PROMETHEUS_NAMESPACE='seldon-system'
export CHART_NAME='seldon-core-analytics'

helm get values --all -n $PROMETHEUS_NAMESPACE $CHART_NAME -o json \
  | jq '.prometheus.serverFiles."prometheus.yml".scrape_configs'

If you are using a different chart, you can adjust the preceding commands to retrieve your existing configuration.

Remember to apply any changes you make with Helm.

Configuring Seldon Deploy

For configuring Deploy to query Prometheus, see the prometheus section in the default values file:

./seldon-deploy-install/helm-charts/seldon-deploy/values.yaml

In particular, see the following:

  • prometheus.seldon.url

  • prometheus.seldon.resourceMetricsUrl

  • prometheus.knative.url

Seldon Deploy queries the recording rules defined in model-usage.rules.yml. Ensure that these have been installed.

Verification / Troubleshooting

We can port-forward Prometheus in order to check it. With seldon-core-analytics the Prometheus service we can do this with:

kubectl port-forward -n seldon-system svc/seldon-core-analytics-prometheus-seldon 9090:80

Then go to localhost:9090 in the browser.

To confirm the recording rules are present, go to Status > Rules and search for model-usage.

If you have a seldon model running, go to Status > Targets and search for seldon_app or just seldon. Any targets for seldon models should be green.

On the /graph page if you select from the insert metric at cursor drop-down there should be metrics that begin with the names seldon.