Monitoring/telemetry Working Group

Hey folks!

The Pulp team has been doing some thinking around what improvements Pulp needs to make it more administrator-friendly. One of our gaps there is observability - how to know what’s going on inside Pulp, from outside Pulp.

At our Virtual PulpCon this year, Daniel Alley did an overview on observability and OpenTelemetry.[0][1] We got some great feedback/interest in adding this to Pulp. To that end, we’re starting up an “OpenTelemetry and Pulp” working group.

This is a call for interest/participation - if you’d like to be involved in helping us work on the right way to add OT, and prioritize the first monitoring probes we add, please respond to this thread.

I have scheduled a 30-min organizational meeting for 6-DEC, 1000 EST/1500 UTC/1600 CET. We can hammer out how often/long to meet and what an initial POC might look like.

We’re especially interested in contributions from folk who are running Pulp for real-world workloads. What would you like to see on an admin’s dashboard? Let us know!

Thanks!

[0] Observability and the OpenTelemetry Project, video, PulpCon 2022
[1] Observability and the OpenTelemetry Project, slides, PulpCon2022

2 Likes

Morning Grant

Please include me.

1 Like

Hi would like to be included here, we currently running a large scale deployment on Pulp 2 and now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.

We always quite in the dark with Pulp and have some of our own tools and now looking at what you have with the Pulp Operator, but it would be good to see

  • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
  • Able to trigger alerts from these metrics
  • Performance for triggers knowing when to scale up/down
    e.g. no of tasks, task waiting etc.
  • Content Counts and no. of requests to that content
2 Likes

Hi @woolsgrs,

Here are some info just to keep you updated on the Operator status:

We always quite in the dark with Pulp and have some of our own tools and now looking at what you have with the Pulp Operator, but it would be good to see

  • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.

The current version of the Operator provides the .status.conditions[] field, which can be helpful to get this information. For example, checking this picture we can see that all (api/content/worker) deployments are in a READY state (all of their replicas are running and ready to serve requests):

We are also investigating the possibility of creating Red Hat Insights rules to help with the troubleshooting

  • by using the k8s events generated by the Operator and/or
  • the .status.conditions fields

These rules could be used, for example, by the support team to get an overview of the Operator status and check the possible fixes suggested by Insights.

  • Performance for triggers knowing when to scale up/down e.g. no of tasks, task waiting etc.
  • Content Counts and no. of requests to that content

As soon as we can retrieve these metrics from Pulp, we will start to work on creating k8s HPA through the Operator (https://github.com/pulp/pulp-operator/issues/761).

  • Able to trigger alerts from these metrics

This is something that we didn’t have a deep investigation yet, but we will check the possibility of creating custom OCP monitoring dashboards (which can include alerts) through the Operator.

1 Like

Done!

Welcome @woolsgrs ! Great suggestions there, will be adding them to our working-group doc (once I have one set up). If you’re interested in attending the meeting(s), message me your email and I’ll add you to the invite!

I’d like to join the working group as optional. Please don’t schedule the time around my calendar if its at all difficult to schedule.

1 Like

Please include me as optional. I’d like to contribute and learn as the time allows.

1 Like

2022-12-06 1000-1030 GMT-5

Attendees: bmbouter, dalley, ggainey

Regrets:

Agenda:

  • Previous AIs:
    • N/A
  • Organizational Meeting
    • What do we want this group to accomplish?
    • How often do we want to meet?
    • What should our next meeting look like?
    • Where would we like to be in 3 months?
  • Comments from woolsgrs on Discourse:
    • currently running a large scale deployment on Pulp 2
    • now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.
    • would like to see:
      • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
      • Able to trigger alerts from these metrics
      • Performance for triggers knowing when to scale up/down
        • e.g. no of tasks, task waiting etc.
          Content Counts and no. of requests to that content

Notes

  • dalley: first step: get a POC of “followed the tutorial for a django app (apiserver?) and get some basic instrumentation available”
  • bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs’ requests)
  • launch POC as a tech-preview
  • bmbouter: can this group give high-level goal instead of being prescriptive?
    • get POC up very quickly
    • ggainey, dalley approve
  • dalley: let’s get basic infrastructure in place, and then start iterating
  • TIMEFRAME
    • what’s more important than this?
      • satellite support
      • HCaaS
      • AH
      • operator/container work
      • rpm pytest
    • similar importance
      • other pytest conversion
    • do we have a ballpark for “POC as a PR” date?
      • “Q1” would be good
      • how do we understand who is assigned to what, who is doing what, and what their time-commitment is?
      • it “feels like” we have a couple of folk who could be freed up to work on this?
      • let’s bring this up at the core-team mtg next week?
      • or at “sprint” planning?
      • AI: [ggainey] add to team agenda for 12-DEC
      • AI: [ggainey] get this on 3-month planning doc for Q1
  • How often should we meet?
    • Proposal: not before 2nd week in Jan
      • AI: [ggainey] to set up another 30-min mtg that week
      • wing it from there
  • Proposal: need an issue/feature “Add basic telemtry suport to Pulp3”
    • AI: [dalley] to open

Action Items:

  • AI: [ggainey] add to team agenda for 12-DEC
  • AI: [ggainey] get this on 3-month planning doc for Q1
  • AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
  • AI: [dalley] to open issue/feature to get this work started
  • AI: [ggainey] get minutes into Discourse thread
1 Like

Note that I’ve scheduled a “next meeting” for 2024-01-11 1100-1130 GMT-5 - please ping me if you’d like to be on the invite!