Monitoring/telemetry Working Group

ggainey · May 22, 2023, 2:57pm

Hey folks!

The Pulp team has been doing some thinking around what improvements Pulp needs to make it more administrator-friendly. One of our gaps there is observability - how to know what’s going on inside Pulp, from outside Pulp.

At our Virtual PulpCon this year, Daniel Alley did an overview on observability and OpenTelemetry.[0][1] We got some great feedback/interest in adding this to Pulp. To that end, we’re starting up an “OpenTelemetry and Pulp” working group.

This is a call for interest/participation - if you’d like to be involved in helping us work on the right way to add OT, and prioritize the first monitoring probes we add, please respond to this thread.

I have scheduled a 30-min organizational meeting for 6-DEC, 1000 EST/1500 UTC/1600 CET. We can hammer out how often/long to meet and what an initial POC might look like.

We’re especially interested in contributions from folk who are running Pulp for real-world workloads. What would you like to see on an admin’s dashboard? Let us know!

Thanks!

[0] Observability and the OpenTelemetry Project, video, PulpCon 2022
[1] Observability and the OpenTelemetry Project, slides, PulpCon2022

wibbit · November 18, 2022, 8:20am

Morning Grant

Please include me.

woolsgrs · November 21, 2022, 11:49am

Hi would like to be included here, we currently running a large scale deployment on Pulp 2 and now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.

We always quite in the dark with Pulp and have some of our own tools and now looking at what you have with the Pulp Operator, but it would be good to see

Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
Able to trigger alerts from these metrics
Performance for triggers knowing when to scale up/down
e.g. no of tasks, task waiting etc.
Content Counts and no. of requests to that content

hyagi · November 21, 2022, 7:15pm

Hi @woolsgrs,

Here are some info just to keep you updated on the Operator status:

We always quite in the dark with Pulp and have some of our own tools and now looking at what you have with the Pulp Operator, but it would be good to see

Overall status of each deployment, understand its functioning correctly, capacity metrics around that.

The current version of the Operator provides the .status.conditions[] field, which can be helpful to get this information. For example, checking this picture we can see that all (api/content/worker) deployments are in a READY state (all of their replicas are running and ready to serve requests):

We are also investigating the possibility of creating Red Hat Insights rules to help with the troubleshooting

by using the k8s events generated by the Operator and/or
the .status.conditions fields

These rules could be used, for example, by the support team to get an overview of the Operator status and check the possible fixes suggested by Insights.

Performance for triggers knowing when to scale up/down e.g. no of tasks, task waiting etc.

Content Counts and no. of requests to that content

As soon as we can retrieve these metrics from Pulp, we will start to work on creating k8s HPA through the Operator (https://github.com/pulp/pulp-operator/issues/761).

Able to trigger alerts from these metrics

This is something that we didn’t have a deep investigation yet, but we will check the possibility of creating custom OCP monitoring dashboards (which can include alerts) through the Operator.

ggainey · November 29, 2022, 1:00pm

Done!

ggainey · November 29, 2022, 1:02pm

Welcome @woolsgrs ! Great suggestions there, will be adding them to our working-group doc (once I have one set up). If you’re interested in attending the meeting(s), message me your email and I’ll add you to the invite!

bmbouter · November 29, 2022, 4:42pm

I’d like to join the working group as optional. Please don’t schedule the time around my calendar if its at all difficult to schedule.

ipanova · November 29, 2022, 5:38pm

Please include me as optional. I’d like to contribute and learn as the time allows.

ggainey · December 6, 2022, 5:12pm

2022-12-06 1000-1030 GMT-5

Attendees: bmbouter, dalley, ggainey

Regrets:

Agenda:

Previous AIs:
- N/A
Organizational Meeting
- What do we want this group to accomplish?
- How often do we want to meet?
- What should our next meeting look like?
- Where would we like to be in 3 months?
Comments from woolsgrs on Discourse:
- currently running a large scale deployment on Pulp 2
- now in the process of rolling out Pulp 3 in and hybrid Cloud/On-Prem setup.
- would like to see:
  - Overall status of each deployment, understand its functioning correctly, capacity metrics around that.
  - Able to trigger alerts from these metrics
  - Performance for triggers knowing when to scale up/down
    - e.g. no of tasks, task waiting etc.
      Content Counts and no. of requests to that content

Notes

dalley: first step: get a POC of “followed the tutorial for a django app (apiserver?) and get some basic instrumentation available”
bmbouter: content-app first (smaller surface, more valuable, answers one of woolsgrs’ requests)
launch POC as a tech-preview
bmbouter: can this group give high-level goal instead of being prescriptive?
- get POC up very quickly
- ggainey, dalley approve
dalley: let’s get basic infrastructure in place, and then start iterating
TIMEFRAME
- what’s more important than this?
  - satellite support
  - HCaaS
  - AH
  - operator/container work
  - rpm pytest
- similar importance
  - other pytest conversion
- do we have a ballpark for “POC as a PR” date?
  - “Q1” would be good
  - how do we understand who is assigned to what, who is doing what, and what their time-commitment is?
  - it “feels like” we have a couple of folk who could be freed up to work on this?
  - let’s bring this up at the core-team mtg next week?
  - or at “sprint” planning?
  - AI: [ggainey] add to team agenda for 12-DEC
  - AI: [ggainey] get this on 3-month planning doc for Q1
How often should we meet?
- Proposal: not before 2nd week in Jan
  - AI: [ggainey] to set up another 30-min mtg that week
  - wing it from there
Proposal: need an issue/feature “Add basic telemtry suport to Pulp3”
- AI: [dalley] to open

Action Items:

AI: [ggainey] add to team agenda for 12-DEC
AI: [ggainey] get this on 3-month planning doc for Q1
AI: [ggainey] to set up another 30-min mtg for 2nd week Jan
AI: [dalley] to open issue/feature to get this work started
AI: [ggainey] get minutes into Discourse thread

ggainey · December 6, 2022, 5:28pm

Note that I’ve scheduled a “next meeting” for 2024-01-11 1100-1130 GMT-5 - please ping me if you’d like to be on the invite!

ggainey · February 23, 2023, 7:26pm

We had a brainstorming session for OpenTelemtry in Pulp. @bmbouter is in the midst of a POC and showed us some great results! Plan is to reconvene once a week; let @ggainey or @bmbouter know if you want an invite!

Here are minutes from the session:

2023-02-23 1000-1030 GMT-5

Attendees: ggainey, bmbouter, decko

Regrets:

Agenda:

Previous AIs: all handled

Notes

bmbouter reports
- got django auto-implementation running in api-workers
- importing into Jaeger
- configured oci-env w/ OT visulaization
Let’s think about what we really want to get out of this effort?
- AI-all for next mtg
Notes from user-discussions
- users need to be able to turn it off and on
practical problems
- have otel be its own oci-env profile
  - loads Jaeger as a side-container
- base img needs a way to turn OT off and on
- what if we had an instrumented img?
  - leads to combinatorics-fun
  - users want their own imgs
- what if there was an “instrument this” env-var? (OTEL_ENABLED)
  - otel-pieces installed always, just not always “on” unless asked for via this var
- let cfg run via env-vars
  - “Here’s the OTEL docs, use their env-vars to control behavior”
https://www.aspecto.io/blog/opentelemetry-collector-guide/
- discussion of “direct to collector” vs “agent to collector” architectures
  - allows batching, allows data-transformation, allows redaction
open question: how should a pulp dev-env be configured?
example “interesting” metric : https://github.com/pulp/pulpcore/issues/3389
let’s think about the specific packages that might need “RPM-izing”
- may already be RPM’d in Fedora (maybe RHEL?)
next step:
- instrumenting workers (tasking-system)
  - makes optional-otel-dependency more problematic
- what about aiohttp-server side? (content-svc)
  - https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942
  - maybe we just manually instrument?
  - puts more burden on plugin-writers
- decko maybe picks up aiohttpserver tracing?
prob going to meet once/wk

Action Items:

add notes to discourse

ggainey · March 2, 2023, 7:04pm

Minutes from our last OpenTelemetry meeting/discussion. Next meeting will be 9-MAR, ping any of the attendees if you want to be on the invite:

2023-03-02 1000-1030 GMT-5

Attendees: ggainey, bmbouter, decko, dralley

Regrets:

Agenda:

Notes

updates
- experimented using wsgi_autoinstrumentation
  - works better than django-auotoinstr
  - correctly nested/subspanned things like postgres-spans
- is there any reason to use django-auto?
  - look at their issues, maybe there’s a known prob
  - we don’t know of anything we’re missing
  - maybe compare the two codebases?
  - how do metrics compare to tracing output?
how will we add this into our container?
- optional vs non-optional dependencies?
- need to id what the new dependencies are?
- what’s the perf-overhead if you’re not gathering trace-info (if any)
  - dralley: there is perf-overhead when tracing
  - bmbouter: is there a perf-impact when you’re not collecting telemetry-output
what do metrics look like (as opposed to tracing)
- bmbouter has gotten Pulp reporting metrics to an otel-container, and then shipping those to Prometheus
- next step is visualizing in grafana
- what are “the right” metrics?
discussion around oci-env work/changes to support
- Prio #1: get oci-env profile in place to sup-port otel
  - needs mikedep’s PR for oci-images #449 to be merged for our images
discussion around django-prometheus
- no traces, just metrics
- is this maybe a path to be getting insight/inspiration for metrics?
- https://github.com/korfuri/django-prometheus
discussion around current-monitoring-use by an actual (large) user of OTel
- detailed metrics-discussion w/ this user on 15-MAR
- bmbouter plans to have a demo available prior
discussion around asgi-otel-instrumentation
- decko/bmbouter to do deeper discussion awesome
links

Action Items:

bmbouter to do little perf-test
bmbouter/decko to work together to get oci-env to run otel setup
decko to go from the above, to instrumenting workers
aiohttp/asgi auto-instrument bakeoff
ggainey to sched for an hour next Thurs

ggainey · March 9, 2023, 7:01pm

2023-03-09 1000-1030 GMT-5

Attendees: ggainey, decko, dralley, bmbouter

Regrets:

Agenda:

Previous AIs:
- bmbouter to do little perf-test
- bmbouter/decko to work together to get oci-env to run otel setup
- decko to go from the above, to instrumenting workers
- aiohttp/asgi auto-instrument bakeoff
- ggainey to sched for an hour next Thurs
(insert topics here)

Notes

some data on performance-impact of traces : https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1556
- discussion ensues
- maybe we want to test more?
if our worker span instrumentation is “feature-flippable”, we’re implying a direct dependency on otel being packaged
- prob want to discuss at pulpcore mtgs
oci-env work in progress
- really close to having an otel-env
- work continues
how’s the worker-instrumentation going to work?
- can we get a span that covers creation-to-end?
- dispatch-to-start is a good thing to know
- just span for run-to-complete is “easy”
- can we use the correlation-id as a span-id?
  - as opposed to task-uuid?
  - BUT - think about dispatching-a-task-group
what about metrics (as opposed to spans)
- auto-instrumentation setup has its own metrics
- are there Things we’d like to add to our code specifically?
  - for tasking system, almost certainly
  - per-worker metric(s)
    - fail-rate
    - task-throughput
    - what happens when workers go-away?
      - attach metrics to worker-names?
    - missing-worker-events
      - interpretation is key
    - system-metrics as a whole
      - wait-q-size
      - waiting-lock-evaluation (“concurrency opportunity”)
        
        ratio tasks/possible_concurrency
        
        discussion around how workers dispatch themselves
- thinking like an admin
  - do I have too much hardware in use?
  - not enough?
  - how do I know “something is going wrong”?
- “service-level-indicator”: how much time does a task wait before start
- “possible concurrency”: how many could start, assuming enough workers?
- “utilization”: what percentage of workers are “typically” busy?
  - The USE Method

Action Items:

oci-env/otel work to continue to completion
decko to go from the above, to instrumenting workers
aiohttp/asgi auto-instrument bakeoff
add notes to discourse
ggainey to schedule next for one week out

ggainey · March 20, 2023, 8:09pm

2023-03-16 1400-1500 GMT-4

Attendees: ggainey, dalley, decko, bmbouter

Regrets:

Agenda:

Previous AIs:
- oci-env/otel work to continue to completion
- decko to go from the above, to instrumenting workers
- aiohttp/asgi auto-instrument bakeoff

Notes

discussion RE decko’s experiences
- worker-trace-example! Woo!
- looking at metrics in grafana - double wooo!
How do we get actual-services-admins involved in setting up kinds-of metrics/visualizations
- AI: [ggainey] invite jsherrill to come demo their Grafana env for us
What would be nice:
- Docs written from an Operational perspective
  - “Here’s a Thing you want to know, here are graphs that will help you answer it”
  - Example: “Is pulp serving content correctly? - visualize content-app status codes”
Next-steps sequence
- finish oci-env profile
- start workingon some “standard” grpahs
- work on how-to docs
- work on demos
- how can we merge better w/ pulpcore?
  - right way to merge new libs to project?
  - responding to various installation-scenarios
- discussion
  - single-container - s6-svc
  - what if users don’t want to spin up otel? What happens to the app?
  - pulp-otel-enabled variable - default to False
    - what does that mean?
    - does not mean that otel-libs aren’t installed (are they direct-deps or not? will be incl in img regardless)
  - multiprocess container - there’s another svc running
  - docs should call out/link to docs RE feature-flip vars for the auto-instr libs
    - able to toggle collect-data or not, for various auto-instr pieces
    - example: Django Instrumentation — OpenTelemetry Python documentation
“direct dependency vs not” discussion
- if it is, you can’t uninstall it
- not everything has to be a hard-dep
- maybe start with not-required
prioritizing aiohttp server PR might be worthwhile
- acceptance is out of our control
- will take more time to get an aiohttp-lib w/ the support released
auto-instr pkgs aren’t going to include correlation-id-support for pulp’s cids
- look at (eg) https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/instrumentation/opentelemetry-instrumentation-wsgi/src/opentelemetry/instrumentation/wsgi/init.py#L85
- wsgi and aiohttp should be enhanced this way
- what’s the realtionship between trace-id and cid? Can we make them the same?
  - spans might end up w/ dup ids? - Prob OK
  - need to experiment/investigate

Action Items:

AI: [ggainey] invite jsherrill to come demo their Grafana env for us
add notes to discourse

ggainey · March 22, 2023, 6:21pm

2023-03-22 1330-1400 GMT-4

Attendees: decko, jsherrill, dralley, ggainey

Regrets:

Agenda:

jsherrill to show us what his team is doing w/ monitoring/metrics
NOTES
- might be “some” guidance available
- we’re all making this up as we go along
- http status/latency
- msg-latency/error-rate (like tasks?)
- some “analytics” info mixed in
- grafana dashboard to visualize
  - started from a template
  - export JSON in order to import into app
- PR for visualization changes is currently “exciting”
- having a full-time data visualization expert would be a Good Thing
- discussion around SLOs (uptime/breach rules/alerting)
- “best practices” still “up in the air”?
- there are tests for alerts
- review your output - sometimes, there are bugs
QUESTIONS
- app implements a /metrics endpoint
- gathered metrics thrown into prometheus
  - How are metrics produced to make them available to /metrics?
    - prometheus-client in go does the heavy lifting
- AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group

Action Items:

AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
add notes to discourse
schedule mtg for next week

ggainey · March 31, 2023, 6:06pm

2023-03-31 1330-1400 GMT-5

Attendees: dalley, decko, ggainey

Regrets:

Agenda:

Previous AIs:
- AI: [decko] investigate if there is a CoP here already around metrics/monitoring/etc, report back to group
  - There exists a Red Hat Observability CoP!

Notes

pulp-content/aiohttp instrumentation demo from decko
- traces working, still trying to get metrics
things in flight
- aiohttp package w/ instrumentation
- metrics labels
- instrumenting workers
getting pulp-api metrics, but not from punp-content, need to understand why
A Plan:
- finish oci-env profile for otel
- figure out why we’re not getting some wsgi-instr labels
  - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/http-metrics.md
- get a working aiohttp-instr PR submitted (based on the work of the existing ‘abandoned’ PR)

Action Items:

add notes to discourse
ggainey to sched next mtg for next Thurs

ggainey · April 6, 2023, 5:30pm

2023-04-06 1000-1030 GMT-5

Attendees: decko, dalley, ggainey

Regrets:

Agenda:

Previous AIs:

Notes

PRs are in-progress, to be submitted this week
HMS mtg taught some things we’ll prob steal
discussion around a plugin-approach to making otel available
- maybe just hooks in core, that do nothing w/out pulp_otel installed?

Action Items:

add notes to discourse

ggainey · April 14, 2023, 3:59pm

2023-04-13 1300-1330 GMT-5

Attendees: ggainey, decko, ggainey

Regrets:

Agenda:

Notes

review/discussion of some test failures
current kind-of-a-plan for aiohttp-metrics-work
- move tests to pytest, get them running clean (#soon)
- add metrics taking advantage of this fork
- think on what tests we prob should have in addition, write them, get them running clean
- submit otel-aiohttp PR upstream
- continue adding metrics to “our” fork independently
AI: ggainey to start using oci_env profile PR for this
- https://github.com/pulp/oci_env/pull/98
- AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds
AI: review https://github.com/pulp/pulp-oci-images/pull/469
AI: decko to add what is missing from the #pulpcore/3632

Action Items:

[ggainey] to start using oci_env profile PR for this
[decko] get tests running locally to see why docker-side fails when podman-side succeeds
[any] review https://github.com/pulp/pulp-oci-images/pull/469
[decko] to add what is missing from the #pulpcore/3632
[ggainey] schedule next mtg for next week
[ggainey]add notes to discourse

ggainey · April 20, 2023, 5:42pm

2023-04-20 1300-1330 GMT-5

Attendees: decko, dralley, ggainey

Regrets:

Agenda:

Previous AIs:
- AI: ggainey to start using oci_env profile PR for this
  - https://github.com/pulp/oci_env/pull/98
  - no progress to report
- ~~AI: review https://github.com/pulp/pulp-oci-images/pull/469~~
  - merged
- Tabled for later investigation:
  - AI: decko to add what is missing from the #pulpcore/3632
  - AI: [decko] get tests running locally to see why docker-side fails when podman-side succeeds
    - still can’t figure out why docker “occasionally” fails

Notes

see instructions in the profile-readme in the oci-env PR for a how-to

Action Items:

add notes to discourse

ggainey · April 27, 2023, 6:41pm

2023-04-27 1300-1330 GMT-5

Attendees: decko, bmbouter, dralley, ggainey

Regrets:

Agenda:

Previous AIs:

Notes

decko showed off his progress
discussion around what can we do further?
- can we set things up to pre-load Grafana dashboard from prepared JSON, at start-time
what are, say, “4 things we want in a Grafana dashboard”?
- content-app
  - response-codes over time
    - organize by class (400/500/OK)?
    - show my 500s only?
  - req-latency
    - avg
    - P95
    - P99
  - “cost” items
    - how many bytes have been served
      - 202 is diff than 301
    - can we gather metrics per-domain?
      - where do/can we record that?
      - /pulp/content/DOMAIN/content-URL
      - upstream (aiohttp) vs downstream (in-pulp)
      - can we attach this data as a header to the request, and record that header?
  - proposal: ‘launch’ w/ response/-codes/latency, in pretty “basic” graphs, preloaded into oci-env profile
discussion: what needs to happen to get aiohttp-instrumentation-PR merged?
- open a new PR from decko’s branch w/ orig commits against aiohttp repo?
  - (remember, based on https://github.com/open-telemetry/opentelemetry-python-contrib/pull/942)
  - https://github.com/open-telemetry/opentelemetry-python-contrib/issues/1714
discussion (brief) around a “pulp_telemetry” ‘shim’

Action Items:

add notes to discourse