Observability and Alerting

Installation

Important

This page assumes that Seldon Enterprise Platform component is installed. The Prometheus integration and especially the deployment usage monitoring must be configured in production clusters.

The analytics component is configured with the Prometheus integration. The monitoring for Seldon Enterprise Platform is based on the Prometheus Operator and the related PodMonitor and PrometheusRule resources.

Prepare seldon-monitoring namespace

We start with creating a namespace that will hold the monitoring components, conventionally called seldon-monitoring.

kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"

Installing Prometheus Operator

Note

You can use your existing installation of Prometheus Operator and configure it with the PodMonitor and PrometheusRules provided later in this document.

Warning

We strongly recommend using a Prometheus setup in the same Kubernetes cluster as Enterprise Platform.

Running a monitoring stack outside your Kubernetes cluster, whether Prometheus or something else, creates a number of difficulties:

  • The monitoring tool needs to be configured to talk to the Kubernetes API.

  • The monitoring tool will need appropriate access rights for both the API and to monitor resources in the cluster.

  • Reaching those resources becomes a challenge, as they are not exposed to the outside world by default, and doing so presents potential security risks.

  • Not only does Prometheus require access to in-cluster resources, but Enterprise Platform also needs access to Prometheus so that it can query metrics. Enabling this second flow of traffic may require further work from a network and security standpoint.

  • Scraping from outside the cluster can be less reliable than from within, depending on your exact setup. This may lead to lost data and false-positive alerts, for example.

We’ll install the Prometheus Operator, as packaged by Bitnami.

For that purpose we prepare the prometheus-values.yaml file

fullnameOverride: seldon-monitoring
kube-state-metrics:
  extraArgs:
    metric-labels-allowlist: pods=[*]

and use it to conduct helm installation

helm upgrade --install prometheus kube-prometheus \
    --version 8.3.6 \
    --namespace seldon-monitoring \
    --values prometheus-values.yaml \
    --repo https://charts.bitnami.com/bitnami

Important

Note the presence of metric-labels-allowlist: pods=[*] in the Helm values file. If you are using your own Prometheus Operator installation you need to make sure that pods labels, especially the app.kubernetes.io/managed-by=seldon-core, are included in the collected metrics as they are used to compute deployment usage rules.

Wait for Prometheus Operator to be ready:

kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operator

Configure Monitoring

To configure monitoring we need to create a dedicated PodMonitor and PrometheusRule resources.

Copy the default resources (and edit if required):

cp seldon-deploy-install/reference-configuration/metrics/seldon-monitor.yaml seldon-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/drift-monitor.yaml drift-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/deploy-monitor.yaml deploy-monitor.yaml
cp seldon-deploy-install/reference-configuration/metrics/metrics-server-monitor.yaml metrics-server-monitor.yaml

cp seldon-deploy-install/reference-configuration/metrics/deployment-usage-rules.yaml deployment-usage-rules.yaml

Apply configurations:

kubectl apply -n seldon-monitoring -f seldon-monitor.yaml
kubectl apply -n seldon-monitoring -f drift-monitor.yaml
kubectl apply -n seldon-monitoring -f deploy-monitor.yaml
kubectl apply -n seldon-monitoring -f metrics-server-monitor.yaml

kubectl apply -f deployment-usage-rules.yaml -n seldon-monitoring

Configure Alerting

Configuring Alertmanager

Create alertmanager.yaml configuration for Alertmanager:

kind: Secret
apiVersion: v1
metadata:
  name: alertmanager-seldon-monitoring-alertmanager
stringData:
  alertmanager.yaml: |
    receivers:
      - name: default-receiver
      - name: deploy-webhook
        webhook_configs:
          - url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
    route:
      group_wait: 10s
      group_by: ['alertname']
      group_interval: 5m
      receiver: default-receiver
      repeat_interval: 3h
      routes:
        - receiver: deploy-webhook
          matchers:
            - severity =~ "warning|critical"
            - type =~ "user|infra"

If you are using App Level Authentication you need to add http_config in the webhook_configs section of alertmanager.yaml. This needs a client that has been configured to access the Seldon Enterprise Platform API. The token_url value may vary, depending on your OIDC provider.

          webhook_configs:
          - url: "http://seldon-deploy.seldon-system:80/seldon-deploy/api/v1alpha1/webhooks/firing-alert"
            http_config:
              oauth2:
                client_id: "${OIDC_CLIENT_ID}"
                client_secret: "${OIDC_CLIENT_SECRET}"
                scopes: [openid]
                token_url: "${OIDC_HOST}/auth/realms/${OIDC_REALM}/protocol/openid-connect/token"
                # Note: only needed if using a self-signed certificate on your OIDC provider
                tls_config:
                  insecure_skip_verify: true

If you are using a self-signed certificate on your OIDC provider then you will need to set insecure_skip_verify in the tls_config of the oauth2 block as specified above. Alternatively you can mount your CA certificate onto the Alertmanager instance to validate the server certificate using ca_file as documented here.

Now, apply the Alertmanager configuration:

kubectl delete secret -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager || echo "Does not yet exist"
kubectl apply -f alertmanager.yaml -n seldon-monitoring

Creating default alerts

Copy the default alert configurations:

cp seldon-deploy-install/reference-configuration/metrics/user-alerts.yaml user-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/infra-alerts.yaml infra-alerts.yaml
cp seldon-deploy-install/reference-configuration/metrics/drift-alerts.yaml drift-alerts.yaml

Apply default alert configurations:

kubectl apply -n seldon-monitoring -f infra-alerts.yaml
kubectl apply -n seldon-monitoring -f user-alerts.yaml
kubectl apply -n seldon-monitoring -f drift-alerts.yaml

We recommend installing these rules as they have useful alerts for both Enterprise Platform and your models. You can also extend them with your own rules.

Configure Seldon Enterprise Platform

Update file deploy-values.yaml with your Helm values for the Seldon Enterprise Platform installation:

prometheus:
  seldon:
    namespaceMetricName: namespace
    activeModelsNamespaceMetricName: exported_namespace
    serviceMetricName: service
    url: http://seldon-monitoring-prometheus.seldon-monitoring:9090/api/v1/
env:
  ALERTMANAGER_URL: http://seldon-monitoring-alertmanager.seldon-monitoring:9093/api/v1/alerts

Execute this command to upgrade and apply the new configuration:

helm upgrade seldon-deploy seldon-charts/seldon-deploy \
    -f deploy-values.yaml \
    --namespace=seldon-system \
    --version 2.3.1 \
    --install

Alerting

Seldon Enterprise Platform, and any deployed models, expose metrics to Prometheus. By default, some alerting rules are configured along with an Alertmanager instance.

Alertmanager will notify the Enterprise Platform frontend when an SLO is breached on the Platform infrastructure or a deployed model.

Enterprise Platform also has an alerts page where you can view all currently firing alerts.

See the alerting demo to try it out, once installed.

Alerting integration

  • Alertmanager must be version 0.24.0 or greater

  • The default installation described above provides an alerting setup out of the box, but changes to Prometheus and Alertmanager configuration will be required if not using the default installation.

  • Key configuration elements in alertmanager.yml from the installation section:

    • OAuth2 config

    • Webhook receiver, this notifies the Enterprise Platform frontend when an alert fires

    • Grouping by alertname

Alerting API

The alerting service currently provides endpoints to list all firing alerts and to initiate a test of the alerting flow. See the API reference page for more details.

Configuring external incident response tool

Alertmanager can be configured to also send alerts elsewhere, such as via email or Slack. It can also be integrated into an incident response tool.

Pagerduty integration

Click to expand how to integrate Pagerduty

To setup alerts in Pagerduty you need to follow these steps:

  1. Follow Pagerduty’s guide to get your integration key. Make sure you follow the Integrating With a PagerDuty Service instructions rather than the ones for Global Event Routing.

  2. Add the following receiver to the alertmanager.yml file.

    - name: pagerduty-deploy
      pagerduty_configs:
      - service_key: <YOUR_INTEGRATION_KEY_HERE>
    
  3. Add the following route to the provided alertmanager.yml file.

    - receiver: pagerduty-deploy
      match_re:
        severity: critical
      continue: true
    

    Warning

    If the receiver is not at the top of your list of receivers you will need to add continue: true to the receiver that comes before your new entry.

  4. Create the secret using:

    kubectl create secret generic -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
    
  5. If Alertmanager is already running you will need to portforward Alertmanager and reload the config, it may take a minute or so to complete

    kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
    # In a new terminal run the following:
    curl -X POST http://localhost:9090/-/reload
    
  6. Use the alerting flow test API endpoint to ensure it’s working.

    curl http://<ENTERPRISE_PLATFORM_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
    

Opsgenie integration

Click to expand how to integrate Opsgenie

To setup alerts in Opsgenie you need to follow these steps:

  1. Follow Opsgenie’s guide to get your api key

  2. Add the following receiver to the provided alertmanager.yml file

    - name: opsgenie-deploy
      opsgenie_configs:
      - api_key: <YOUR_API_KEY_HERE>
        teams: <YOUR_TEAM_HERE>
    
  3. Add the following route to the provided alertmanager.yml file

    - receiver: opsgenie-deploy
      match_re:
        severity: critical
      continue: true
    

    Warning

    If the receiver is not at the top of your list of receivers you will need to add continue: true to the receiver that comes before your new entry.

  4. Create secret using:

    kubectl create secret generic -n seldon-monitoring alertmanager-seldon-monitoring-alertmanager --from-file=alertmanager.yml --dry-run=client -o yaml | kubectl apply -f -
    
  5. If Alertmanager is already running you will need to portforward Alertmanager and reload the secret, it may take a minute or so to complete

    kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093
    # In a new terminal run the following:
    curl -X POST http://localhost:9093/-/reload
    
  6. Use the alerting flow test API endpoint to ensure it’s working.

    curl http://<ENTERPRISE_PLATFORM_IP>/seldon-deploy/api/v1alpha1/alerting/test -X POST -H "Authorization: Bearer $TOKEN"
    

Custom alerts

You can also define your own custom alerting rules in Prometheus.

  1. Create a file called custom-alert.yaml that contains your new rule(s). There are examples to follow in the file seldon-deploy-install/reference-configuration/metrics/user-alerts.yaml

  2. Afterwards, the PrometheusRule(s) can be added using:

    kubectl create -f custom-alert.yaml
    
  3. You can portforward Prometheus like this:

    kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
    # In a new terminal run the following:
    curl -X POST http://localhost:9090/-/reload
    
  4. Visit the Prometheus UI at http://localhost:9090 and check that your new alerting rule is visible

Verification and Troubleshooting

Using the Prometheus web UI, we can delve into the state of the running system.

To gain access to the web UI, we can access it via port-forwarding, using:

kubectl port-forward -n seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090

Then go to localhost:9090 in the browser.

To confirm the recording rules are present, go to Status > Rules and search for deployment-usage.

If you have a Seldon model running, go to Status > Targets and search for seldon_app or just seldon. Any targets for Seldon models should be green.

On the /graph page if you select from the insert metric at cursor drop-down there should be metrics that begin with the names seldon.