Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting timeouts on publishing many events in parallel #113

Open
Dylan-DutchAndBold opened this issue Dec 8, 2023 · 5 comments
Open

Getting timeouts on publishing many events in parallel #113

Dylan-DutchAndBold opened this issue Dec 8, 2023 · 5 comments
Assignees

Comments

@Dylan-DutchAndBold
Copy link

Dylan-DutchAndBold commented Dec 8, 2023

We are having issues on our production systems where we utilise Rebus with RabbitMQ.

Problematic scenario

We have a connecting 3rd party system which posts events over HTTP to our service which will take it and publish an event for it using Rebus.

The 3rd party system fires around 100 HTTP calls to our system at once, and unfortunately this results in timeout errors from Rebus/RabbitMQ.

This should not be an uncommon scenario.

The exception

[2023-12-08 13:47:20Z] fail: Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.
System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell`1.WaitForValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body)
   at RabbitMQ.Client.Framing.Impl.Model._Private_ChannelOpen(String outOfBand)
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.CreateNonRecoveringModel()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.CreateModel()
   at Rebus.RabbitMq.RabbitMqTransport.CreateChannel()
   at Rebus.Internals.WriterModelPoolPolicy.Create()
   at Rebus.Internals.ModelObjectPool.Get()
   at Rebus.RabbitMq.RabbitMqTransport.SendOutgoingMessages(IEnumerable`1 outgoingMessages, ITransactionContext context)
   at Rebus.Transport.AbstractRebusTransport.<>c__DisplayClass3_1.<<Send>b__1>d.MoveNext()

Sample project for reproduction

We have setup a sample project which can reproduce this error. The test scenario needs a little more than 100 simultaneous request to fail on my local system so I have set it to 1000. The failure will unfortunately only occur when in a similar scenario as our production system. Meaning it is in the context of an HTTP call being handled by .NET.

We tried to reproduce the error more isolated without being in an HTTP context, but this will not make it fail with the timeout. However, these tests will still show that publishing 1000 messages in parallel will take a very long time to complete. Too long if compared to a similar library (MassTransit) which takes ~ 2 seconds as where Rebus will take ~ 40 seconds to complete.

https://github.com/Dylan-DutchAndBold/demonstrate-rebus-timeout-issue

Version information

Software Version
Rebus 9.0.1
Rebus.ServiceProvider 10.0.0
Rebus.RabbitMq 9.0.1
RabbitMQ 3.12.10
.NET 7
@mookid8000 mookid8000 self-assigned this Dec 11, 2023
@mookid8000
Copy link
Member

Thanks for your detailed report and repro. I will have time to check it out tonight. Meanwhile, could you tell me which delivery guarantee you are using with MassTransit when you get it to send 100 messages in 2 s? Does it use publisher confirms?

@Dylan-DutchAndBold
Copy link
Author

Thanks for your detailed report and repro. I will have time to check it out tonight. Meanwhile, could you tell me which delivery guarantee you are using with MassTransit when you get it to send 100 messages in 2 s? Does it use publisher confirms?

You're very welcome, thank you for taking the time!

We have left Rebus and Masstransit at all defaults in the test cases. And what I can find for Masstransit is that it does have publish confirms enabled by default https://masstransit.io/documentation/configuration/transports/rabbitmq#host-configuration

The Masstransit variant is included in the demonstration project. You can run the unit test for Masstransit and compare it to the unit test for Rebus to get this timing difference. It's actually even a 1000 messages. Because on my local system it needed a bit more to reproduce.

It's the highlighted tests below which (when in parallel) have this significant difference in timing. When doing it in a for loop the difference is not that steep. The API integration tests is what gives us the actual timeout and compares most to our production environment. For this test there is also a Masstransit version which does not timeout.
img

@simongullberg
Copy link

I'm not sure that this is the issue here but we experienced a performance improvement when configuring minimum threads on the .NET Threadpool to a higher value than default. You can read more about SetMinThreads here. https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-8.0.

The underlying RabbitMQ.Client library that is used in Rebus.RabbitMq is using the .NET Threadpool so it is up to you to make sure you have enough threads to handle your load. RabbitMQ.Client is also sync and blocking so threads are just waiting when doing I/O.

Also, the setting MaxWriterPoolSize might also come in to play here. Maybe you should set it to something more than the default value? https://github.com/rebus-org/Rebus.RabbitMq/blob/c4afc55891128aded5f61bb8a4c5c40bdb6e6aa1/Rebus.RabbitMq/Config/RabbitMqOptionsBuilder.cs#L234C12-L234C35

@Dylan-DutchAndBold
Copy link
Author

I'm not sure that this is the issue here but we experienced a performance improvement when configuring minimum threads on the .NET Threadpool to a higher value than default. You can read more about SetMinThreads here. https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-8.0.

The underlying RabbitMQ.Client library that is used in Rebus.RabbitMq is using the .NET Threadpool so it is up to you to make sure you have enough threads to handle your load. RabbitMQ.Client is also sync and blocking so threads are just waiting when doing I/O.

Also, the setting MaxWriterPoolSize might also come in to play here. Maybe you should set it to something more than the default value? https://github.com/rebus-org/Rebus.RabbitMq/blob/c4afc55891128aded5f61bb8a4c5c40bdb6e6aa1/Rebus.RabbitMq/Config/RabbitMqOptionsBuilder.cs#L234C12-L234C35

Thanks Simon for the suggestion. It's really appreciated. I have tried doubling the defaults in the demo project and it still fails with timeouts.

Also I think since Masstransit.RabbitMQ uses the same RabbitMQ library with its defaults and it comparing so differently to Rebus.RabbitMQ, I do think (with some doubt) that the problem is within Rebus.

@addouglas
Copy link

We've been experiencing this issue under certain load conditions as well. We are attempting the work around suggested to bump SetMinThreads, but just wanted to update this with news that there are changes in the RabbitMQ.Client 7.0 to address this (however it is unfortunately still in RC). Discussion with results of patching to 7.0 here: https://groups.google.com/g/rabbitmq-users/c/m2ur2f-foqc and the likely fix inside 7.0 here: rabbitmq/rabbitmq-dotnet-client#650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants