Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with sync.Pool in JetStreams #6442

Open
chocolacula opened this issue Feb 3, 2025 · 5 comments
Open

Memory leak with sync.Pool in JetStreams #6442

chocolacula opened this issue Feb 3, 2025 · 5 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@chocolacula
Copy link

chocolacula commented Feb 3, 2025

Observed behavior

We have a few consumers for a stream with a high volume of messages, each averaging a few KB in size. The stream works perfectly fine when only the producer adds messages. However, when we add a couple of consumers to the stream, it starts leaking memory, about 500MB in 10 minutes.

Other streams with lighter workloads don't add troubles. Sometimes, memory consumption either slows down or crashes rapidly when we add consumers to other random streams. Currently, we must disable this stream because it consumes all 128GB of memory on the host.

We set GOMEMLIMIT, but to be honest, we expected it to help more, the memory growth slowed but didn't stop.

After some profiling and investigation, I found interesting behavior in server/stream.go. It seems you intended to optimize garbage generation using sync.Pool, but the implementation confused me a bit. I expected to see New() in the definition, but it wasn’t there. My understanding is that you Put a new object into the pool first, then retrieve it with Get elsewhere. This approach is complex and makes it hard to track where objects are created, Get, and Put into the pool, which might be the source of the bug. See Steps to reproduce, the number of Put operations exceeds the number of Get operations.

If I removed sync.Pool lines:

func getJSPubMsgFromPool() *jsPubMsg {
	return new(jsPubMsg)
}

func (pm *jsPubMsg) returnToPool() {
}

GC works as expected and memory doesn't leak.

I hope it will help you, thanks.

Expected behavior

Memory is collected by GC

Server and client version

github.com/nats-io/nats-server different versions including main branch
github.com/nats-io/nats.go v1.37.0

Host environment

Ubuntu 20.04 x86 without containers.

Steps to reproduce

To investigate that I ran BenchmarkJetStreamConsumeWithFilters with a few updates:

  • added an atomic counter for Put in returnToPool
  • added an atomic counter for Get in getJSPubMsgFromPool

After the benchmark finished I got the result:

Message get counter = 2671222
Message put counter = 3342140
The get/put difference: 670918

The branch with the code.

@chocolacula chocolacula added the defect Suspected defect such as a bug or regression label Feb 3, 2025
@MauriceVanVeen
Copy link
Member

Ran BenchmarkJetStreamConsumeWithFilters with the changes from the branch, but some additions are needed:

  • There's also newJSPubMsg that can get from jsPubMsgPool and needs to increment GetCount as well.
  • Should also count all creations in both newJSPubMsg(..) and getJSPubMsgFromPool().

See also: main...maurice/6442

The jsPubMsgPool starts with no entries at all, so some need to be created first. Then those entries can get reused by getting them. And once the system is done with them it returns them to the pool to be reused again.
So given there's no leak that should mean creates + gets = puts/returnToPool.

Looking at the output now, that does appear to be the case:

Message get counter = 3130181
Message create counter = 7223
Message put counter = 3137404
The get/put difference: 7223
The create/put difference: 3130181

Could you share the original reproducer instead of the benchmark? Likely the benchmark doesn't cover some part that your reproducer did.

@chocolacula
Copy link
Author

I see, thanks for the example. I cannot share original reproducer unfortunately because it's a part of production infrastructure including services. I can share memory profiles and maybe NATS logs but they are almost empty. Could it be helpful?

@MauriceVanVeen
Copy link
Member

Feel free to share those, [email protected]

Could you also share nats stream info and nats consumer info of one of the consumers on that stream? What's the amount of messages per second?

@chocolacula
Copy link
Author

Thanks! I sent profiles to your email, let me share statistics here:

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                    Streams                                                   │
├──────────────────────────────────────┬─────────────┬─────────────────────┬──────────┬─────────┬──────────────┤
│ Name                                 │ Description │ Created             │ Messages │ Size    │ Last Message │
├──────────────────────────────────────┼─────────────┼─────────────────────┼──────────┼─────────┼──────────────┤
│ object_info                          │             │ 2024-12-27 09:28:05 │ 168,399  │ 53 MiB  │ 149ms        │
│ components_info                      │             │ 2025-01-08 08:10:36 │ 88       │ 136 MiB │ 5.90s        │
╰──────────────────────────────────────┴─────────────┴─────────────────────┴──────────┴─────────┴──────────────╯

It is a part of stream ls output with most interesting streams. It's a test environment where we can reproduce the problem, in production environment the number of messages is about 10 times bigger.

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                             JetStream Summary                                             │
├───────────┬─────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬─────────────┤
│ Server    │ Cluster │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err     │
├───────────┼─────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│ nats      │         │ 63      │ 3,132     │ 172,671  │ 254 MiB │ 0 B    │ 254 MiB │ 7,174   │ 18 / 0.250% │
├───────────┼─────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│           │         │ 63      │ 3,132     │ 172,671  │ 254 MIB │ 0 B    │ 254 MIB │ 7,174   │ 18          │
╰───────────┴─────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴─────────────╯

As you can see these two streams take most of our workloads. A service is reading messages from object_info, aggregate them and produce to components_info, a new message size is about 2-3MB. The NATS max message size is set to 8MB.

I attached to the email another profile - the moment when we add a new object_info consumer with DeliveryAll policy, a simple echo service which just print all the messages to stdout and memory consumptions decreases.

@MauriceVanVeen
Copy link
Member

Could you share that stream/consumer info still? Would like to see the full configuration outputs from those commands.
Do you perhaps use consumers with DeliverLastPerSubject, for example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants