-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor when Kinesis is over capacity #150
Comments
@benkap my solution thus-far has been to monkey-patch the gem to add in instrumentation reporting ourselves, but it would be good if the plugin had some built-in hooks to allow this rather than having to do something custom. FWIW, you can see the ProvisionedThroughputExceededException errors in CloudWatch for the Kinesis stream, if that suits your use-case. |
@adammw Thanks, We basically did the same. |
I'm also glad to monitor with monitor_agent as well. Currently, to notice the problem, is to set CWAlarm on ProvisionedThroughputExceededException with a fairly large retries_on_batch_request on a plugin side. If monitor_agent can monitor, it will be very useful because it can be centrally managed like retry monitoring of other plugins. |
FYI, this is one of the issues that https://github.com/atlassian/fluent-plugin-kinesis-aggregation addresses. |
Hi, thank you for your feedback! One option is to add configuration which the plugin raise error to fluentd, not retry itself. However, Kinesis PutRecords API may return a partially successful response. Another option is to update monitoring metrics from internal retry mechanism. However, we have to define more detailed requirements. Do you have more detailed requirements? We would appreciate it if you could give us feedback. |
I think that as to monitor fluentd, the back-off or partial success are less important metrics. While you can count the back-off as a retry event I think it may increase the metric and generate false positive. The metric that I'm sure will be always helpful is "count of records failed" either by max attempt reached or other (network/permissions etc). I don't think that output the error to log (which is required regardless) is a good method to monitor fluentd that's why it has it's monitor agent, however if I'm not mistaken the monitor agent does not expose/count "failed records" without code modification. Regarding the retry metric, if the exponential back-off will increment the retry count then one can use this metric to monitor the shard capacity which is can be useful information to have. |
We will add raise_error_on_batch_request_failure configuration to Fluent::Plugin::KinesisHelper::API#give_up_retries method. This configuration enables you to choose give up records (current behavior) or raise error and return chunk to Fluentd for retrying (new behavior, may duplicate putting records by Fluentd retrying). We would appreciate your feedback. Thanks! |
…d_records_after_batch_request_retries with default true - #150
We've just published this monitoring feature as v3.4.0.rc1. We will collect a little more feedback from RC release, and will release as main stream version. |
We published v3.4.0.rc2. We will collect a little more feedback from RC release, and will release as main stream version. |
We just published v3.4.0 including this monitoring feature. Thank you! |
Hi,
When our stream is over capacity we're getting the following errors in fluent log (like documented):
2018-06-19 10:09:44 -0400 [error]: #0 Could not put record, Error: ProvisionedThroughputExceededException/Rate exceeded for shard shardId-000000000192 in stream xxx under account xxxxxxxxxxx., Record: ["xxxxxxxx"]
We are trying to monitor those events using fluent "monitor_agent" and" "fluent-plugin-prometheus"
In both plugins the metrics:
Are always "zero" and does not indicate any problems.
Is this by design?
What is the preferred way to monitor fluentd when upstream to Kinesis?
The text was updated successfully, but these errors were encountered: