Monitor when Kinesis is over capacity #150

benkap · 2018-06-19T14:16:58Z

Hi,

When our stream is over capacity we're getting the following errors in fluent log (like documented):

2018-06-19 10:09:44 -0400 [error]: #0 Could not put record, Error: ProvisionedThroughputExceededException/Rate exceeded for shard shardId-000000000192 in stream xxx under account xxxxxxxxxxx., Record: ["xxxxxxxx"]

We are trying to monitor those events using fluent "monitor_agent" and" "fluent-plugin-prometheus"
In both plugins the metrics:

retry_count (monitor_agent)
fluentd_output_status_num_errors
fluentd_output_status_retry_count
fluentd_output_status_retry_wait

Are always "zero" and does not indicate any problems.

Is this by design?
What is the preferred way to monitor fluentd when upstream to Kinesis?

adammw · 2018-07-19T19:18:59Z

@benkap my solution thus-far has been to monkey-patch the gem to add in instrumentation reporting ourselves, but it would be good if the plugin had some built-in hooks to allow this rather than having to do something custom.

FWIW, you can see the ProvisionedThroughputExceededException errors in CloudWatch for the Kinesis stream, if that suits your use-case.

benkap · 2018-07-22T08:32:13Z

@adammw Thanks, We basically did the same.
Just by adding "@num_errors = retry_count+1" to "batch_request_with_retry" function on api.rb we can update the retry count on the build in in_monitor_agent.rb agent metrics. For errors, there is a need to also patch the core fluentd or monitor the log file...

cynipe · 2019-05-17T02:24:41Z

I'm also glad to monitor with monitor_agent as well. Currently, to notice the problem, is to set CWAlarm on ProvisionedThroughputExceededException with a fairly large retries_on_batch_request on a plugin side. If monitor_agent can monitor, it will be very useful because it can be centrally managed like retry monitoring of other plugins.

wryun · 2020-01-18T23:45:55Z

FYI, this is one of the issues that https://github.com/atlassian/fluent-plugin-kinesis-aggregation addresses.

simukappu · 2020-01-30T09:47:16Z

Hi, thank you for your feedback!
As you know, this plugin has internal retry mechanism. The plugin itself retries to send failed records as exponential backoff, and does not raise error to fluentd.
https://github.com/awslabs/aws-fluent-plugin-kinesis/blob/master/lib/fluent/plugin/kinesis_helper/api.rb#L89-L109
That's why monitor_agent cannot observe retry_count metrics.

One option is to add configuration which the plugin raise error to fluentd, not retry itself. However, Kinesis PutRecords API may return a partially successful response.
https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
When the plugin raise the error, fluentd will retry to send all records in failed chunk and sent data may be duplicate.

Another option is to update monitoring metrics from internal retry mechanism. However, we have to define more detailed requirements.

Do you have more detailed requirements? We would appreciate it if you could give us feedback.
Thank you!

benkap · 2020-03-19T12:08:06Z

I think that as to monitor fluentd, the back-off or partial success are less important metrics. While you can count the back-off as a retry event I think it may increase the metric and generate false positive.

The metric that I'm sure will be always helpful is "count of records failed" either by max attempt reached or other (network/permissions etc). I don't think that output the error to log (which is required regardless) is a good method to monitor fluentd that's why it has it's monitor agent, however if I'm not mistaken the monitor agent does not expose/count "failed records" without code modification.

Regarding the retry metric, if the exponential back-off will increment the retry count then one can use this metric to monitor the shard capacity which is can be useful information to have.

simukappu · 2020-11-01T07:59:08Z

We will add raise_error_on_batch_request_failure configuration to Fluent::Plugin::KinesisHelper::API#give_up_retries method. This configuration enables you to choose give up records (current behavior) or raise error and return chunk to Fluentd for retrying (new behavior, may duplicate putting records by Fluentd retrying).
When raise_error_on_batch_request_failure is false (default), give_up_retries method will call @counter_mutex.synchronize{ @num_errors += 1 } to enable you to monitor batch request failure from "monitor_agent" or "fluent-plugin-prometheus".

We would appreciate your feedback. Thanks!

simukappu · 2020-11-28T02:21:46Z

Hi @benkap, @wryun, @cynipe, @adammw

Thank you for your feedback! I've added #211 pull request including feature enhancement to monitor batch request failure and retries. Could you review this?

…d_records_after_batch_request_retries with default true - #150

simukappu · 2021-03-07T06:50:57Z

We've just published this monitoring feature as v3.4.0.rc1. We will collect a little more feedback from RC release, and will release as main stream version.
If you could try v3.4.0.rc1 and give us any feedback, we would appreciate it. Thank you!

simukappu · 2021-04-18T07:53:20Z

We published v3.4.0.rc2. We will collect a little more feedback from RC release, and will release as main stream version.
If you could try v3.4.0.rc2 and give us any feedback, we would appreciate it. Thank you!

simukappu · 2021-05-03T12:54:21Z

We just published v3.4.0 including this monitoring feature. Thank you!

simukappu added the enhancement label Mar 20, 2019

simukappu added the help wanted label Jan 30, 2020

simukappu added the feature requests label Apr 9, 2020

simukappu added working in progress and removed help wanted labels Nov 1, 2020

simukappu mentioned this issue Nov 2, 2020

Implement web identity credentials for irsa #208

Merged

simukappu added a commit that referenced this issue Nov 27, 2020

Enable to monitor batch request failure and retries - #150

09877eb

simukappu mentioned this issue Nov 28, 2020

Enable to monitor batch request failure and retries #211

Closed

simukappu added waiting for feedback and removed working in progress labels Dec 23, 2020

simukappu added a commit that referenced this issue Feb 6, 2021

Rename raise_error_on_batch_request_failure config name to drop_faile…

4ca3d5f

…d_records_after_batch_request_retries with default true - #150

simukappu removed the waiting for feedback label May 3, 2021

simukappu closed this as completed May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor when Kinesis is over capacity #150

Monitor when Kinesis is over capacity #150

benkap commented Jun 19, 2018

adammw commented Jul 19, 2018

benkap commented Jul 22, 2018

cynipe commented May 17, 2019 •

edited

Loading

wryun commented Jan 18, 2020

simukappu commented Jan 30, 2020 •

edited

Loading

benkap commented Mar 19, 2020

simukappu commented Nov 1, 2020 •

edited

Loading

simukappu commented Nov 28, 2020

simukappu commented Mar 7, 2021

simukappu commented Apr 18, 2021

simukappu commented May 3, 2021

Monitor when Kinesis is over capacity #150

Monitor when Kinesis is over capacity #150

Comments

benkap commented Jun 19, 2018

adammw commented Jul 19, 2018

benkap commented Jul 22, 2018

cynipe commented May 17, 2019 • edited Loading

wryun commented Jan 18, 2020

simukappu commented Jan 30, 2020 • edited Loading

benkap commented Mar 19, 2020

simukappu commented Nov 1, 2020 • edited Loading

simukappu commented Nov 28, 2020

simukappu commented Mar 7, 2021

simukappu commented Apr 18, 2021

simukappu commented May 3, 2021

cynipe commented May 17, 2019 •

edited

Loading

simukappu commented Jan 30, 2020 •

edited

Loading

simukappu commented Nov 1, 2020 •

edited

Loading