Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor when Kinesis is over capacity #150

Closed
benkap opened this issue Jun 19, 2018 · 11 comments
Closed

Monitor when Kinesis is over capacity #150

benkap opened this issue Jun 19, 2018 · 11 comments

Comments

@benkap
Copy link

benkap commented Jun 19, 2018

Hi,

When our stream is over capacity we're getting the following errors in fluent log (like documented):

2018-06-19 10:09:44 -0400 [error]: #0 Could not put record, Error: ProvisionedThroughputExceededException/Rate exceeded for shard shardId-000000000192 in stream xxx under account xxxxxxxxxxx., Record: ["xxxxxxxx"]

We are trying to monitor those events using fluent "monitor_agent" and" "fluent-plugin-prometheus"
In both plugins the metrics:

  1. retry_count (monitor_agent)
  2. fluentd_output_status_num_errors
  3. fluentd_output_status_retry_count
  4. fluentd_output_status_retry_wait

Are always "zero" and does not indicate any problems.

Is this by design?
What is the preferred way to monitor fluentd when upstream to Kinesis?

@adammw
Copy link
Contributor

adammw commented Jul 19, 2018

@benkap my solution thus-far has been to monkey-patch the gem to add in instrumentation reporting ourselves, but it would be good if the plugin had some built-in hooks to allow this rather than having to do something custom.

FWIW, you can see the ProvisionedThroughputExceededException errors in CloudWatch for the Kinesis stream, if that suits your use-case.

@benkap
Copy link
Author

benkap commented Jul 22, 2018

@adammw Thanks, We basically did the same.
Just by adding "@num_errors = retry_count+1" to "batch_request_with_retry" function on api.rb we can update the retry count on the build in in_monitor_agent.rb agent metrics. For errors, there is a need to also patch the core fluentd or monitor the log file...

@cynipe
Copy link

cynipe commented May 17, 2019

I'm also glad to monitor with monitor_agent as well. Currently, to notice the problem, is to set CWAlarm on ProvisionedThroughputExceededException with a fairly large retries_on_batch_request on a plugin side. If monitor_agent can monitor, it will be very useful because it can be centrally managed like retry monitoring of other plugins.

@wryun
Copy link

wryun commented Jan 18, 2020

FYI, this is one of the issues that https://github.com/atlassian/fluent-plugin-kinesis-aggregation addresses.

@simukappu
Copy link
Contributor

simukappu commented Jan 30, 2020

Hi, thank you for your feedback!
As you know, this plugin has internal retry mechanism. The plugin itself retries to send failed records as exponential backoff, and does not raise error to fluentd.
https://github.com/awslabs/aws-fluent-plugin-kinesis/blob/master/lib/fluent/plugin/kinesis_helper/api.rb#L89-L109
That's why monitor_agent cannot observe retry_count metrics.

One option is to add configuration which the plugin raise error to fluentd, not retry itself. However, Kinesis PutRecords API may return a partially successful response.
https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
When the plugin raise the error, fluentd will retry to send all records in failed chunk and sent data may be duplicate.

Another option is to update monitoring metrics from internal retry mechanism. However, we have to define more detailed requirements.

Do you have more detailed requirements? We would appreciate it if you could give us feedback.
Thank you!

@benkap
Copy link
Author

benkap commented Mar 19, 2020

I think that as to monitor fluentd, the back-off or partial success are less important metrics. While you can count the back-off as a retry event I think it may increase the metric and generate false positive.

The metric that I'm sure will be always helpful is "count of records failed" either by max attempt reached or other (network/permissions etc). I don't think that output the error to log (which is required regardless) is a good method to monitor fluentd that's why it has it's monitor agent, however if I'm not mistaken the monitor agent does not expose/count "failed records" without code modification.

Regarding the retry metric, if the exponential back-off will increment the retry count then one can use this metric to monitor the shard capacity which is can be useful information to have.

@simukappu
Copy link
Contributor

simukappu commented Nov 1, 2020

We will add raise_error_on_batch_request_failure configuration to Fluent::Plugin::KinesisHelper::API#give_up_retries method. This configuration enables you to choose give up records (current behavior) or raise error and return chunk to Fluentd for retrying (new behavior, may duplicate putting records by Fluentd retrying).
When raise_error_on_batch_request_failure is false (default), give_up_retries method will call @counter_mutex.synchronize{ @num_errors += 1 } to enable you to monitor batch request failure from "monitor_agent" or "fluent-plugin-prometheus".

We would appreciate your feedback. Thanks!

@simukappu
Copy link
Contributor

Hi @benkap, @wryun, @cynipe, @adammw

Thank you for your feedback! I've added #211 pull request including feature enhancement to monitor batch request failure and retries. Could you review this?

simukappu added a commit that referenced this issue Feb 6, 2021
…d_records_after_batch_request_retries with default true - #150
@simukappu
Copy link
Contributor

We've just published this monitoring feature as v3.4.0.rc1. We will collect a little more feedback from RC release, and will release as main stream version.
If you could try v3.4.0.rc1 and give us any feedback, we would appreciate it. Thank you!

@simukappu
Copy link
Contributor

We published v3.4.0.rc2. We will collect a little more feedback from RC release, and will release as main stream version.
If you could try v3.4.0.rc2 and give us any feedback, we would appreciate it. Thank you!

@simukappu
Copy link
Contributor

We just published v3.4.0 including this monitoring feature. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants