Memory leak with `sync.Pool` in JetStreams #6442

chocolacula · 2025-02-03T07:37:20Z

Observed behavior

We have a few consumers for a stream with a high volume of messages, each averaging a few KB in size. The stream works perfectly fine when only the producer adds messages. However, when we add a couple of consumers to the stream, it starts leaking memory, about 500MB in 10 minutes.

Other streams with lighter workloads don't add troubles. Sometimes, memory consumption either slows down or crashes rapidly when we add consumers to other random streams. Currently, we must disable this stream because it consumes all 128GB of memory on the host.

We set GOMEMLIMIT, but to be honest, we expected it to help more, the memory growth slowed but didn't stop.

After some profiling and investigation, I found interesting behavior in server/stream.go. It seems you intended to optimize garbage generation using sync.Pool, but the implementation confused me a bit. I expected to see New() in the definition, but it wasn’t there. My understanding is that you Put a new object into the pool first, then retrieve it with Get elsewhere. This approach is complex and makes it hard to track where objects are created, Get, and Put into the pool, which might be the source of the bug. See Steps to reproduce, the number of Put operations exceeds the number of Get operations.

If I removed sync.Pool lines:

func getJSPubMsgFromPool() *jsPubMsg {
	return new(jsPubMsg)
}

func (pm *jsPubMsg) returnToPool() {
}

GC works as expected and memory doesn't leak.

I hope it will help you, thanks.

Expected behavior

Memory is collected by GC

Server and client version

github.com/nats-io/nats-server different versions including main branch
github.com/nats-io/nats.go v1.37.0

Host environment

Ubuntu 20.04 x86 without containers.

Steps to reproduce

To investigate that I ran BenchmarkJetStreamConsumeWithFilters with a few updates:

added an atomic counter for Put in returnToPool
added an atomic counter for Get in getJSPubMsgFromPool

After the benchmark finished I got the result:

Message get counter = 2671222
Message put counter = 3342140
The get/put difference: 670918

The branch with the code.

The text was updated successfully, but these errors were encountered:

MauriceVanVeen · 2025-02-03T08:41:55Z

Ran BenchmarkJetStreamConsumeWithFilters with the changes from the branch, but some additions are needed:

There's also newJSPubMsg that can get from jsPubMsgPool and needs to increment GetCount as well.
Should also count all creations in both newJSPubMsg(..) and getJSPubMsgFromPool().

See also: main...maurice/6442

The jsPubMsgPool starts with no entries at all, so some need to be created first. Then those entries can get reused by getting them. And once the system is done with them it returns them to the pool to be reused again.
So given there's no leak that should mean creates + gets = puts/returnToPool.

Looking at the output now, that does appear to be the case:

Message get counter = 3130181
Message create counter = 7223
Message put counter = 3137404
The get/put difference: 7223
The create/put difference: 3130181

Could you share the original reproducer instead of the benchmark? Likely the benchmark doesn't cover some part that your reproducer did.

chocolacula · 2025-02-03T10:56:06Z

I see, thanks for the example. I cannot share original reproducer unfortunately because it's a part of production infrastructure including services. I can share memory profiles and maybe NATS logs but they are almost empty. Could it be helpful?

MauriceVanVeen · 2025-02-03T11:37:45Z

Feel free to share those, [email protected]

Could you also share nats stream info and nats consumer info of one of the consumers on that stream? What's the amount of messages per second?

chocolacula · 2025-02-05T15:47:48Z

Thanks! I sent profiles to your email, let me share statistics here:

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                    Streams                                                   │
├──────────────────────────────────────┬─────────────┬─────────────────────┬──────────┬─────────┬──────────────┤
│ Name                                 │ Description │ Created             │ Messages │ Size    │ Last Message │
├──────────────────────────────────────┼─────────────┼─────────────────────┼──────────┼─────────┼──────────────┤
│ object_info                          │             │ 2024-12-27 09:28:05 │ 168,399  │ 53 MiB  │ 149ms        │
│ components_info                      │             │ 2025-01-08 08:10:36 │ 88       │ 136 MiB │ 5.90s        │
╰──────────────────────────────────────┴─────────────┴─────────────────────┴──────────┴─────────┴──────────────╯

It is a part of stream ls output with most interesting streams. It's a test environment where we can reproduce the problem, in production environment the number of messages is about 10 times bigger.

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                             JetStream Summary                                             │
├───────────┬─────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬─────────────┤
│ Server    │ Cluster │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err     │
├───────────┼─────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│ nats      │         │ 63      │ 3,132     │ 172,671  │ 254 MiB │ 0 B    │ 254 MiB │ 7,174   │ 18 / 0.250% │
├───────────┼─────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│           │         │ 63      │ 3,132     │ 172,671  │ 254 MIB │ 0 B    │ 254 MIB │ 7,174   │ 18          │
╰───────────┴─────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴─────────────╯

As you can see these two streams take most of our workloads. A service is reading messages from object_info, aggregate them and produce to components_info, a new message size is about 2-3MB. The NATS max message size is set to 8MB.

I attached to the email another profile - the moment when we add a new object_info consumer with DeliveryAll policy, a simple echo service which just print all the messages to stdout and memory consumptions decreases.

MauriceVanVeen · 2025-02-09T14:14:05Z

Could you share that stream/consumer info still? Would like to see the full configuration outputs from those commands.
Do you perhaps use consumers with DeliverLastPerSubject, for example?

chocolacula added the defect Suspected defect such as a bug or regression label Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with `sync.Pool` in JetStreams #6442

Memory leak with `sync.Pool` in JetStreams #6442

chocolacula commented Feb 3, 2025 •

edited

Loading

MauriceVanVeen commented Feb 3, 2025

chocolacula commented Feb 3, 2025

MauriceVanVeen commented Feb 3, 2025

chocolacula commented Feb 5, 2025

MauriceVanVeen commented Feb 9, 2025

Memory leak with sync.Pool in JetStreams #6442

Memory leak with sync.Pool in JetStreams #6442

Comments

chocolacula commented Feb 3, 2025 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

MauriceVanVeen commented Feb 3, 2025

chocolacula commented Feb 3, 2025

MauriceVanVeen commented Feb 3, 2025

chocolacula commented Feb 5, 2025

MauriceVanVeen commented Feb 9, 2025

Memory leak with `sync.Pool` in JetStreams #6442

Memory leak with `sync.Pool` in JetStreams #6442

chocolacula commented Feb 3, 2025 •

edited

Loading