Reducing Memory Usage during Sync

I want to collect ideas about how to reduce memory usage during sync. I’ve written up one idea that came up during a brainstorming session with @dralley.

Story #9635: As a user, I can specify the desired maximum amount of memory usage - Pulp ← I’d like to nominate adding this to pulpcore for 3.18 possibly.

What other ideas do we have?

FWIW I’m not 100% sure how much we’ll need to worry about this in practice once 1) David gets that one screwed-up MS repo fixed and 2) I get createrepo_c fixed up to stream packages properly - depends on which other plugins are having these issues if any.

I do feel like there’s a decent chance this would require overall sync pipeline changes that might force us to wait until 3.20, but even still it would be good to start working on it now along with other any other sync pipeline improvements we want to do and let it bake.

First i kind of dislike the idea of specifying the memory to use on the remote. I’d rather limit the amount per worker in config. We don’t want to provide task-worker-memory-quota to our users (way too complicated).
Second, making parts of the pipeline wait (other than the batching mechanism) is really a very complicated thing. Somehow you’d need to be able to release unfinished batches at the end of the pipeline when you run out of memory. What if the next stage closest to the end that can handle an incomplete batch needs an insane amount of fresh memory to actually get those items through?
The other thing, we can do is look closely at first stages and make sure, they handle the metadata in a streaming manner (not reading the whole PULP_MANIFEST in one go into memory).
What i like about this idea is that the whole amount of memory is capped.

There’s one other issue that we should perhaps even focus on first: Issue #9262: Django Lifecycle doubles memory usage of all "Content" objects due to change-tracking - Pulp

I won’t be at pulpcore’s meeting tomorrow so I want to leave some thoughts here. I want to revise the original idea after hearing about the deadlock concerns from @x9c4. Here’s a much simpler version of my original proposal, along with some realted-ish thoughts at the end.

Have the queue that runs after first stage be the only queue that implements memory restriction, and have that restriction occur in the put() call in first stage. The idea is to halt first stage from creating more in-memory objects and let the stages pipeline drain thereby releasing memory. Waiting on the put() will be a deadlock though if the queues and the stages in between the queues are empty, nothing would drain. We should count the objects in all the queues and the stages bathes to determine if we should expect the pipeline to drain or not. In the event the memory would exceed the limit and there are items in the queues we should cause the batching algorithm to no longer batch. In the event the memory limit is exceeded and the queue and batches of the pipeline is empty, we should consider what to do in that situation. The choices I think are, allow the memory to go higher than it should, or fail.

I’m starting to think that failing is maybe the right thing to do in that case for two reasons. (1) If an admin has specified each sync should take a max of e.g. 800MB I think that users shouldn’t be in a position to supercede that expectation. (2) in environments like k8s, they have resource limits too and if pulp were configure to e.g. 800MB per process, then it’s very likely k8s is also configured at which points its going to OOM the process anyway.

Another aspect here to consider is the memory contribution of first stage. It’s very likely that a poorly written first stage could pin so much memory that even the very first object put() would exceed the memory usage. To help identify those cases we should likely put in the failure the amount of memory in the objects of first stage to help identify that the failure here isn’t the data being processed, but the data being processed in addition to the amount of memory first stage is pinning. This will help identify that first stage just has to be more efficient to run withing that kind of memory constraint.

Two other unrelated points, +1 to the idea that an admin is the one setting this, not a user, and that it’s system wide (it’s rare I recommend a setting to work like this). Also from a convo on matrix I believe @dralley and I established that Django lifecycle is using shallow copies so in practice it’s using a “small” amount of memory and isn’t going to save much if we got rid of it. I’m glad about (2) because we use django lifecycle for the roles permissions creation significantly.

This was written hastily, so apologies. I’m in favor of a PoC of any kind (this idea or others). Also I could make a PoC if we can wait a week or two for me to return from PTO and get the other things done I’m supposed to deliver, but I’m in favor of anyone trying anything out.

I did some analysis on this issue and made the PoC. I ran some experiments and recorded their results here:

I have some conclusions and next-steps proposed here:

1 Like

Actually a huge improvement was made here: 27% memory reduction in pulp_rpm for example.