You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scaling_factor is always going to be a number between 0.25-0.35. That is not a very wide spread.
Given that the default number of retries is 3 and the scaling factor is relatively narrow, calc(count) for counts {0,1,2} is going to return backoffs of approximately 0.5 sec, 1 sec, and 2 seconds, respectively. Therefore a temporary (~5 second) spike in traffic could cause records to be dropped via a ProvisionedThroughputExceededException, by the default configuration.
We're getting around this by setting retries_on_batch_request to 7 in order to give ourselves 30+ seconds of retries, but I think that one could reasonably argue that the defaults here for backoffs & retries are not great.
The text was updated successfully, but these errors were encountered:
Thank you for your feedback from the real world. I'll consider to change the default value, but it is an incompatible change which will affect to all users who use the default value. So, I'll update it when I bump the major version.
By the way, do you need more configurable parameters to adjust backoff logic, such as base_of_scaling_factor?
re: changing the behavior w/a major version bump - Yea that makes sense.
re: more configurable parameters - Yea, I was going to suggest something like this. Seems ok to me. Or just clearly document to folks in the README section for retries_on_batch_request what the behavior of the default retry configuration will result in.
Was looking through the backoff logic after we hit some "ProvisionedThroughputExceededException" errors. We see the following issues:
From
aws-fluent-plugin-kinesis/lib/fluent/plugin/kinesis_helper/api.rb
Lines 154 to 160 in b9ab13a
scaling_factor
is always going to be a number between 0.25-0.35. That is not a very wide spread.calc(count)
for counts {0,1,2} is going to return backoffs of approximately 0.5 sec, 1 sec, and 2 seconds, respectively. Therefore a temporary (~5 second) spike in traffic could cause records to be dropped via aProvisionedThroughputExceededException
, by the default configuration.We're getting around this by setting
retries_on_batch_request
to 7 in order to give ourselves 30+ seconds of retries, but I think that one could reasonably argue that the defaults here for backoffs & retries are not great.The text was updated successfully, but these errors were encountered: