Monitoring/telemetry Working Group

Hey folks!

The Pulp team has been doing some thinking around what improvements Pulp needs to make it more administrator-friendly. One of our gaps there is observability - how to know what’s going on inside Pulp, from outside Pulp.

At our Virtual PulpCon this year, Daniel Alley did an overview on observability and OpenTelemetry.[0][1] We got some great feedback/interest in adding this to Pulp. To that end, we’re starting up an “OpenTelemetry and Pulp” working group.

This is a call for interest/participation - if you’d like to be involved in helping us work on the right way to add OT, and prioritize the first monitoring probes we add, please respond to this thread.

I have scheduled a 30-min organizational meeting for 6-DEC, 1000 EST/1500 UTC/1600 CET. We can hammer out how often/long to meet and what an initial POC might look like.

We’re especially interested in contributions from folk who are running Pulp for real-world workloads. What would you like to see on an admin’s dashboard? Let us know!

Thanks!

[0] Observability and the OpenTelemetry Project, video, PulpCon 2022
[1] Observability and the OpenTelemetry Project, slides, PulpCon2022

2 Likes

Morning Grant

Please include me.

1 Like

Hi would like to be included here, we currently running a large scale deployment on Pulp 2 and now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.

We always quite in the dark with Pulp and have some of our own tools and now looking at what you have with the Pulp Operator, but it would be good to see

  • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
  • Able to trigger alerts from these metrics
  • Performance for triggers knowing when to scale up/down
    e.g. no of tasks, task waiting etc.
  • Content Counts and no. of requests to that content
2 Likes

Hi @woolsgrs,

Here are some info just to keep you updated on the Operator status:

We always quite in the dark with Pulp and have some of our own tools and now looking at what you have with the Pulp Operator, but it would be good to see

  • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.

The current version of the Operator provides the .status.conditions[] field, which can be helpful to get this information. For example, checking this picture we can see that all (api/content/worker) deployments are in a READY state (all of their replicas are running and ready to serve requests):

We are also investigating the possibility of creating Red Hat Insights rules to help with the troubleshooting

  • by using the k8s events generated by the Operator and/or
  • the .status.conditions fields

These rules could be used, for example, by the support team to get an overview of the Operator status and check the possible fixes suggested by Insights.

  • Performance for triggers knowing when to scale up/down e.g. no of tasks, task waiting etc.
  • Content Counts and no. of requests to that content

As soon as we can retrieve these metrics from Pulp, we will start to work on creating k8s HPA through the Operator (https://github.com/pulp/pulp-operator/issues/761).

  • Able to trigger alerts from these metrics

This is something that we didn’t have a deep investigation yet, but we will check the possibility of creating custom OCP monitoring dashboards (which can include alerts) through the Operator.

1 Like

Done!

Welcome @woolsgrs ! Great suggestions there, will be adding them to our working-group doc (once I have one set up). If you’re interested in attending the meeting(s), message me your email and I’ll add you to the invite!

I’d like to join the working group as optional. Please don’t schedule the time around my calendar if its at all difficult to schedule.

1 Like

Please include me as optional. I’d like to contribute and learn as the time allows.

1 Like

2022-12-06 1000-1030 GMT-5

Attendees: bmbouter, dalley, ggainey

Regrets:

Agenda:

  • Previous AIs:
    • N/A
  • Organizational Meeting
    • What do we want this group to accomplish?
    • How often do we want to meet?
    • What should our next meeting look like?
    • Where would we like to be in 3 months?
  • Comments from woolsgrs on Discourse:
    • currently running a large scale deployment on Pulp 2
    • now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.
    • would like to see:
      • Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
      • Able to trigger alerts from these metrics
      • Performance for triggers knowing when to scale up/down
        • e.g. no of tasks, task waiting etc.
          Content Counts and no. of requests to that content

Notes

  • dalley: first step: get a POC of “followed the tutorial for a django app (apiserver?) and get some basic instrumentation available”
  • bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs’ requests)
  • launch POC as a tech-preview
  • bmbouter: can this group give high-level goal instead of being prescriptive?
    • get POC up very quickly
    • ggainey, dalley approve
  • dalley: let’s get basic infrastructure in place, and then start iterating
  • TIMEFRAME
    • what’s more important than this?
      • satellite support
      • HCaaS
      • AH
      • operator/container work
      • rpm pytest
    • similar importance
      • other pytest conversion
    • do we have a ballpark for “POC as a PR” date?
      • “Q1” would be good
      • how do we understand who is assigned to what, who is doing what, and what their time-commitment is?
      • it “feels like” we have a couple of folk who could be freed up to work on this?
      • let’s bring this up at the core-team mtg next week?
      • or at “sprint” planning?
      • AI: [ggainey] add to team agenda for 12-DEC
      • AI: [ggainey] get this on 3-month planning doc for Q1
  • How often should we meet?
    • Proposal: not before 2nd week in Jan
      • AI: [ggainey] to set up another 30-min mtg that week
      • wing it from there
  • Proposal: need an issue/feature “Add basic telemtry suport to Pulp3”
    • AI: [dalley] to open

Action Items:

  • AI: [ggainey] add to team agenda for 12-DEC
  • AI: [ggainey] get this on 3-month planning doc for Q1
  • AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
  • AI: [dalley] to open issue/feature to get this work started
  • AI: [ggainey] get minutes into Discourse thread
1 Like

Note that I’ve scheduled a “next meeting” for 2024-01-11 1100-1130 GMT-5 - please ping me if you’d like to be on the invite!

We had a brainstorming session for OpenTelemtry in Pulp. @bmbouter is in the midst of a POC and showed us some great results! Plan is to reconvene once a week; let @ggainey or @bmbouter know if you want an invite!

Here are minutes from the session:

2023-02-23 1000-1030 GMT-5

Attendees: ggainey, bmbouter, decko

Regrets:

Agenda:

  • Previous AIs: all handled

Notes

  • bmbouter reports
    • got django auto-implementation running in api-workers
    • importing into Jaeger
    • configured oci-env w/ OT visulaization
  • Let’s think about what we really want to get out of this effort?
    • AI-all for next mtg
  • Notes from user-discussions
    • users need to be able to turn it off and on
  • practical problems
    • have otel be its own oci-env profile
      • loads Jaeger as a side-container
    • base img needs a way to turn OT off and on
    • what if we had an instrumented img?
      • leads to combinatorics-fun
      • users want their own imgs
    • what if there was an “instrument this” env-var? (OTEL_ENABLED)
      • otel-pieces installed always, just not always “on” unless asked for via this var
    • let cfg run via env-vars
      • “Here’s the OTEL docs, use their env-vars to control behavior”
  • https://www.aspecto.io/blog/opentelemetry-collector-guide/
    • discussion of “direct to collector” vs “agent to collector” architectures
      • allows batching, allows data-transformation, allows redaction
  • open question: how should a pulp dev-env be configured?
  • example “interesting” metric : https://github.com/pulp/pulpcore/issues/3389
  • let’s think about the specific packages that might need “RPM-izing”
    • may already be RPM’d in Fedora (maybe RHEL?)
  • next step:
  • prob going to meet once/wk

Action Items:

  • add notes to discourse

Minutes from our last OpenTelemetry meeting/discussion. Next meeting will be 9-MAR, ping any of the attendees if you want to be on the invite:

2023-03-02 1000-1030 GMT-5

Attendees: ggainey, bmbouter, decko, dralley

Regrets:

Agenda:

Notes

Action Items:

  • bmbouter to do little perf-test
  • bmbouter/decko to work together to get oci-env to run otel setup
  • decko to go from the above, to instrumenting workers
  • aiohttp/asgi auto-instrument bakeoff
  • ggainey to sched for an hour next Thurs

2023-03-09 1000-1030 GMT-5

Attendees: ggainey, decko, dralley, bmbouter

Regrets:

Agenda:

  • Previous AIs:
    • bmbouter to do little perf-test
    • bmbouter/decko to work together to get oci-env to run otel setup
    • decko to go from the above, to instrumenting workers
    • aiohttp/asgi auto-instrument bakeoff
    • ggainey to sched for an hour next Thurs
  • (insert topics here)

Notes

  • some data on performance-impact of traces : https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1556
    • discussion ensues
    • maybe we want to test more?
  • if our worker span instrumentation is “feature-flippable”, we’re implying a direct dependency on otel being packaged
    • prob want to discuss at pulpcore mtgs
  • oci-env work in progress
    • really close to having an otel-env
    • work continues
  • how’s the worker-instrumentation going to work?
    • can we get a span that covers creation-to-end?
    • dispatch-to-start is a good thing to know
    • just span for run-to-complete is “easy”
    • can we use the correlation-id as a span-id?
      • as opposed to task-uuid?
      • BUT - think about dispatching-a-task-group
  • what about metrics (as opposed to spans)
    • auto-instrumentation setup has its own metrics
    • are there Things we’d like to add to our code specifically?
      • for tasking system, almost certainly
      • per-worker metric(s)
        • fail-rate
        • task-throughput
        • what happens when workers go-away?
          • attach metrics to worker-names?
        • missing-worker-events
          • interpretation is key
        • system-metrics as a whole
          • wait-q-size
          • waiting-lock-evaluation (“concurrency opportunity”)
            • ratio tasks/possible_concurrency
            • discussion around how workers dispatch themselves
    • thinking like an admin
      • do I have too much hardware in use?
      • not enough?
      • how do I know “something is going wrong”?
    • “service-level-indicator”: how much time does a task wait before start
    • “possible concurrency”: how many could start, assuming enough workers?
    • “utilization”: what percentage of workers are “typically” busy?

Action Items:

  • oci-env/otel work to continue to completion
  • decko to go from the above, to instrumenting workers
  • aiohttp/asgi auto-instrument bakeoff
  • add notes to discourse
  • ggainey to schedule next for one week out
1 Like

2023-03-16 1400-1500 GMT-4

Attendees: ggainey, dalley, decko, bmbouter

Regrets:

Agenda:

  • Previous AIs:
    • oci-env/otel work to continue to completion
    • decko to go from the above, to instrumenting workers
    • aiohttp/asgi auto-instrument bakeoff

Notes

  • discussion RE decko’s experiences
    • worker-trace-example! Woo!
    • looking at metrics in grafana - double wooo!
  • How do we get actual-services-admins involved in setting up kinds-of metrics/visualizations
    • AI: [ggainey] invite jsherrill to come demo their Grafana env for us
  • What would be nice:
    • Docs written from an Operational perspective
      • “Here’s a Thing you want to know, here are graphs that will help you answer it”
      • Example: “Is pulp serving content correctly? - visualize content-app status codes”
  • Next-steps sequence
    • finish oci-env profile
    • start workingon some “standard” grpahs
    • work on how-to docs
    • work on demos
    • how can we merge better w/ pulpcore?
      • right way to merge new libs to project?
      • responding to various installation-scenarios
    • discussion
      • single-container - s6-svc
      • what if users don’t want to spin up otel? What happens to the app?
      • pulp-otel-enabled variable - default to False
        • what does that mean?
        • does not mean that otel-libs aren’t installed (are they direct-deps or not? will be incl in img regardless)
      • multiprocess container - there’s another svc running
      • docs should call out/link to docs RE feature-flip vars for the auto-instr libs
  • “direct dependency vs not” discussion
    • if it is, you can’t uninstall it
    • not everything has to be a hard-dep
    • maybe start with not-required
  • prioritizing aiohttp server PR might be worthwhile
    • acceptance is out of our control
    • will take more time to get an aiohttp-lib w/ the support released
  • auto-instr pkgs aren’t going to include correlation-id-support for pulp’s cids

Action Items:

  • AI: [ggainey] invite jsherrill to come demo their Grafana env for us
  • add notes to discourse
1 Like

2023-03-22 1330-1400 GMT-4

Attendees: decko, jsherrill, dralley, ggainey

Regrets:

Agenda:

  • jsherrill to show us what his team is doing w/ monitoring/metrics
  • NOTES
    • might be “some” guidance available
    • we’re all making this up as we go along
    • http status/latency
    • msg-latency/error-rate (like tasks?)
    • some “analytics” info mixed in
    • grafana dashboard to visualize
      • started from a template
      • export JSON in order to import into app
    • PR for visualization changes is currently “exciting”
    • having a full-time data visualization expert would be a Good Thing
    • discussion around SLOs (uptime/breach rules/alerting)
    • “best practices” still “up in the air”?
    • there are tests for alerts
    • review your output - sometimes, there are bugs
  • QUESTIONS
    • app implements a /metrics endpoint
    • gathered metrics thrown into prometheus
      • How are metrics produced to make them available to /metrics?
        • prometheus-client in go does the heavy lifting
    • AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group

Action Items:

  • AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
  • add notes to discourse
  • schedule mtg for next week
1 Like

2023-03-31 1330-1400 GMT-5

Attendees: dalley, decko, ggainey

Regrets:

Agenda:

  • Previous AIs:
    • AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
      • There exists a Red Hat Observability CoP!

Notes

Action Items:

  • add notes to discourse
  • ggainey to sched next mtg for next Thurs

2023-04-06 1000-1030 GMT-5

Attendees: decko, dalley, ggainey

Regrets:

Agenda:

  • Previous AIs:

Notes

  • PRs are in-progress, to be submitted this week
  • HMS mtg taught some things we’ll prob steal :slight_smile:
  • discussion around a plugin-approach to making otel available
    • maybe just hooks in core, that do nothing w/out pulp_otel installed?

Action Items:

  • add notes to discourse
2 Likes

2023-04-13 1300-1330 GMT-5

Attendees: ggainey, decko, ggainey

Regrets:

Agenda:

Notes

  • review/discussion of some test failures
  • current kind-of-a-plan for aiohttp-metrics-work
    • move tests to pytest, get them running clean (#soon)
    • add metrics taking advantage of this fork
    • think on what tests we prob should have in addition, write them, get them running clean
    • submit otel-aiohttp PR upstream
    • continue adding metrics to “our” fork independently
  • AI: ggainey to start using oci_env profile PR for this
  • AI: review https://github.com/pulp/pulp-oci-images/pull/469
  • AI: decko to add what is missing from the #pulpcore/3632

Action Items:

  • [ggainey] to start using oci_env profile PR for this
  • [decko] get tests running locally to see why docker-side fails when podman-side succeeds
  • [any] review https://github.com/pulp/pulp-oci-images/pull/469
  • [decko] to add what is missing from the #pulpcore/3632
  • [ggainey] schedule next mtg for next week
  • [ggainey]add notes to discourse
1 Like

2023-04-20 1300-1330 GMT-5

Attendees: decko, dralley, ggainey

Regrets:

Agenda:

  • Previous AIs:

Notes

  • see instructions in the profile-readme in the oci-env PR for a how-to

Action Items:

  • add notes to discourse
1 Like

2023-04-27 1300-1330 GMT-5

Attendees: decko, bmbouter, dralley, ggainey

Regrets:

Agenda:

  • Previous AIs:

Notes

  • decko showed off his progress
  • discussion around what can we do further?
    • can we set things up to pre-load Grafana dashboard from prepared JSON, at start-time
  • what are, say, “4 things we want in a Grafana dashboard”?
    • content-app
      • response-codes over time
        • organize by class (400/500/OK)?
        • show my 500s only?
      • req-latency
        • avg
        • P95
        • P99
      • “cost” items
        • how many bytes have been served
          • 202 is diff than 301
        • can we gather metrics per-domain?
          • where do/can we record that?
          • /pulp/content/DOMAIN/content-URL
          • upstream (aiohttp) vs downstream (in-pulp)
          • can we attach this data as a header to the request, and record that header?
      • proposal: ‘launch’ w/ response/-codes/latency, in pretty “basic” graphs, preloaded into oci-env profile
  • discussion: what needs to happen to get aiohttp-instrumentation-PR merged?
  • discussion (brief) around a “pulp_telemetry” ‘shim’

Action Items:

  • add notes to discourse
1 Like