Observability & Alerting

Installation

Important

This page assumes that Seldon Core v1 and Seldon Deploy components are installed. The Prometheus integration and especially the deployment usage monitoring must be configured in production clusters.

The analytics component is configured with the Prometheus integration. The monitoring for Seldon Deploy is based on the Prometheus Operator and the related PodMonitor and PrometheusRule resources.

Prepare seldon-monitoring namespace

We start with creating a namespace that will hold the monitoring components, conventionally called seldon-monitoring.

kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"

Installing Prometheus Operator

Note

You can use your existing installation of Prometheus Operator and configure it with the PodMonitor and PrometheusRules provided later in this document.

Warning

We strongly recommend using a Prometheus setup in the same Kubernetes cluster as Deploy.

Running a monitoring stack outside your Kubernetes cluster, whether Prometheus or something else, creates a number of difficulties:

  • The monitoring tool needs to be configured to talk to the Kubernetes API.

  • The monitoring tool will need appropriate access rights for both the API and to monitor resources in the cluster.

  • Reaching those resources becomes a challenge, as they are not exposed to the outside world by default, and doing so presents potential security risks.

  • Not only does Prometheus require access to in-cluster resources, but Deploy also needs access to Prometheus so that it can query metrics. Enabling this second flow of traffic may require further work from a network and security standpoint.

  • Scraping from outside the cluster can be less reliable than from within, depending on your exact setup. This may lead to lost data and false-positive alerts, for example.

We install Prometheus Operator packaged by Bitnami.

For that purpose we prepare the values-kube-prometheus.yaml file

fullnameOverride: seldon-monitoring
kube-state-metrics:
  extraArgs:
    metric-labels-allowlist: pods=[*]

and use it to conduct helm installation

helm upgrade --install prometheus kube-prometheus \
    --version 8.3.6 \
    --namespace seldon-monitoring \
    --values values-kube-prometheus.yaml \
    --repo https://charts.bitnami.com/bitnami

Important

Note the presence of metric-labels-allowlist: pods=[*] in the Helm values file. If you are using your own Prometheus Operator installation you need to make sure that pods labels, especially the app.kubernetes.io/managed-by=seldon-core, are included in the collected metrics as they are used to compute deployment usage rules.

Wait for Prometheus Operator to be ready:

kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operator

Configure Monitoring

To configure monitoring we need to create a dedicated PodMonitor and PrometheusRule resources.

Copy the default resources (and edit if required):

cp seldon-deploy-install/reference-configuration/metrics/seldon-monitor.yaml seldon-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/deploy-monitor.yaml deploy-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/metrics-server-monitor.yaml metrics-server-monitor.yaml

cp seldon-deploy-install/reference-configuration/metrics/deployment-usage-rules.yaml deployment-usage-rules.yaml

Apply configurations:

kubectl apply -n seldon-monitoring -f seldon-monitor.yaml
kubectl apply -n seldon-monitoring -f deploy-monitor.yaml
kubectl apply -n seldon-monitoring -f metrics-server-monitor.yaml

kubectl apply -f deployment-usage-rules.yaml -n seldon-monitoring

Configure Alerting

Create alertmanager.yaml configuration for alertmanager:

kind: Secret
apiVersion: v1
metadata:
  name: alertmanager-seldon-monitoring-alertmanager
stringData:
  alertmanager.yaml: |
    receivers:
      - name: default-receiver
      - name: deploy-webhook
        webhook_configs:
          - url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
    route:
      group_wait: 10s
      group_by: ['alertname']
      group_interval: 5m
      receiver: default-receiver
      repeat_interval: 3h
      routes:
        - receiver: deploy-webhook
          matchers:
            - severity =~ "warning|critical"
            - type =~ "user|infra"

Note, if you are using App Level Authentication you need to add http_config in the webhook_configs section of alertmanager.yaml:

          webhook_configs:
          - url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
            http_config:
              oauth2:
                client_id: "${OIDC_CLIENT_ID}"
                client_secret: "${OIDC_CLIENT_SECRET}"
                scopes: [openid]
                token_url: "${OIDC_HOST}/auth/realms/${OIDC_REALM}/protocol/openid-connect/token"
                # Note: only needed if using a self-signed certificate on your OIDC provider
                tls_config:
                  insecure_skip_verify: true

with configuration of client that can access Seldon Deploy API. Note that token_url value may depend on your OIDC provider.

If you are using a self-signed certificate on your OIDC provider then you will need to set insecure_skip_verify in the tls_config of the oauth2 block as specified above. Alternatively you can mount your CA certificate onto the alertmanager instance to validate the server certificate using ca_file as documented here.

Now, apply the alertmanager configuration

kubectl delete secret -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager || echo "Does not yet exist"
kubectl apply -f alertmanager.yaml -n seldon-monitoring

Copy the default alert configurations:

cp seldon-deploy-install/reference-configuration/metrics/user-alerts.yaml user-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/infra-alerts.yaml infra-alerts.yaml

Apply default alert configurations:

kubectl apply -n seldon-monitoring -f infra-alerts.yaml
kubectl apply -n seldon-monitoring -f user-alerts.yaml

We recommend installing these rules as they have useful alerts for both Deploy and your models. You can also extend them with your own rules.

Configure Seldon Deploy

Update file deploy-values.yaml with your Helm values for Seldon Deploy installation

prometheus:
  seldon:
    namespaceMetricName: namespace
    activeModelsNamespaceMetricName: exported_namespace
    serviceMetricName: service
    url: http://seldon-monitoring-prometheus.seldon-monitoring:9090/api/v1/
env:
  ALERTMANAGER_URL: http://seldon-monitoring-alertmanager.seldon-monitoring:9093/api/v1/alerts

and execute helm upgrade ... command to apply new configuration.

Alerting

Seldon Deploy and any deployed models expose metrics to Prometheus. By default, some alerting rules are configured along with an Alertmanager instance.

Alertmanager will notify the Deploy frontend when an SLO is breached on the Deploy infrastructure or a deployed model.

Deploy also has an alerts page where you can view all currently firing alerts.

See the alerting demo to try it out once installed.

Alerting integration

  • Alertmanager must be version 0.24.0 or greater

  • The default installation described above provides an alerting setup out of the box, but changes to Prometheus and Alertmanager configuration will be required if not using the default installation.

  • Key configuration elements in alertmanager.yml from the installation section:

    • Oauth2 config

    • Deploy webhook receiver, this notifies the Deploy frontend when an alert fires

    • Grouping by alertname

Alerting API

The alerting service currently provides an endpoint to list all firing alerts, and the ability to initiate a test of the alerting flow. See the API reference page for more details.

Configuring external incident response tool

Alertmanager can be configured to also send alerts elsewhere, such as email or slack. It can also be integrated into an incident response tool.

Pagerduty integration

Click to expand how to integrate Pagerduty

To setup alerts in Pagerduty you need to follow these steps:

  1. Follow Pagerduty’s guide to get your integration key. Make sure you follow the Integrating With a PagerDuty Service instructions rather than the ones for Global Event Routing.

  2. Add the following receiver to the alertmanager.yml file.

    - name: pagerduty-deploy
      pagerduty_configs:
      - service_key: <YOUR_INTEGRATION_KEY_HERE>
    
  3. Add the following route to the provided alertmanager.yml file.

    - receiver: pagerduty-deploy
      match_re:
        severity: critical
      continue: true
    

    Warning

    If the receiver is not at the top of your list of receivers you will need to add continue: true to the receiver that comes before your new entry.

  4. Create the secret using:

    kubectl create secret generic -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
    
  5. If Alertmanager is already running you will need to portforward Alertmanager and reload the config, it may take a minute or so to complete

    kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
    # In a new terminal run the following:
    curl -X POST http://localhost:9090/-/reload
    
  6. Use the alerting flow test API endpoint to ensure it’s working.

    curl http://<DEPLOY_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
    

Opsgenie integration

Click to expand how to integrate Opsgenie

To setup alerts in Opsgenie you need to follow these steps:

  1. Follow Opsgenie’s guide to get your api key

  2. Add the following receiver to the provided alertmanager.yml file

    - name: opsgenie-deploy
      opsgenie_configs:
      - api_key: <YOUR_API_KEY_HERE>
        teams: <YOUR_TEAM_HERE>
    
  3. Add the following route to the provided alertmanager.yml file

    - receiver: opsgenie-deploy
      match_re:
        severity: critical
      continue: true
    

    Warning

    If the receiver is not at the top of your list of receivers you will need to add continue: true to the receiver that comes before your new entry.

  4. Create secret using:

    kubectl create secret generic -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
    
  5. If Alertmanager is already running you will need to portforward Alertmanager and reload the secret, it may take a minute or so to complete

    kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093
    # In a new terminal run the following:
    curl -X POST http://localhost:9093/-/reload
    
  6. Use the alerting flow test API endpoint to ensure it’s working.

    curl http://<DEPLOY_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
    

Custom alerts

You can also define your own custom alerting rules in Prometheus.

  1. Create a file called custom-alert.yaml that contains your new rule(s), there are examples to follow in the file user-alerts.yaml

  2. Afterwards, the PrometheusRule(s) can be added using:

    kubectl create -f custom-alert.yaml

  3. You can portforward Prometheus like this:

    kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
    # In a new terminal run the following:
    curl -X POST http://localhost:9090/-/reload
    
  4. Visit the Prometheus UI at http://localhost:9090 and check that your new alerting rule is visible

Verification / Troubleshooting

We can port-forward Prometheus using:

kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090

Then go to localhost:9090 in the browser.

To confirm the recording rules are present, go to Status > Rules and search for deployment-usage.

If you have a seldon model running, go to Status > Targets and search for seldon_app or just seldon. Any targets for seldon models should be green.

On the /graph page if you select from the insert metric at cursor drop-down there should be metrics that begin with the names seldon.