Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup Retry & Configurable Startup Time-out #166

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

zhiyuanliang-ms
Copy link
Contributor

@zhiyuanliang-ms zhiyuanliang-ms commented Feb 17, 2025

Why this PR?

Support startup retry and timeout.

The retry/backoff behavior is the same as .NET provider.

reference: #488

Usage

const config = await load(connectionString, { 
  startupOptions: { 
    timeoutInMs: 30_000
});

By default the startup timeout will be 100 seconds.

The pseudo code of the whole startup retry + failover logic:

function startupWithRetry():
    // Without a timeout, continuously retry the load operation
    while true:
        try:
            result = loadWithFailover()
            return result  // Load successful; exit the retry loop
        except error:
            if isRetriable(error):
                backoffDuration = calculateBackoffDuration()
                sleep(backoffDuration)  // Wait for the backoff duration before retrying
            else:
                throw error  // Non-retriable error; rethrow immediately

function loadWithFailover():
    // Retrieve the fallback client list via DNS SRV records
    fallbackClients = getFallbackClientsFromDNS()
    
    // Attempt to perform the load operation using each fallback client sequentially
    for client in fallbackClients:
        try:
            result = client.load()
            return result  // Load successful; return result immediately
        except error:
            if not isFailoverable(error):
                // If the error is not failoverable, immediately rethrow the error
                throw error
            // If the error is failoverable, continue with the next client
    
    // If all fallback clients have failed, throw a retriable error to trigger another retry
    throw new Error("All fallback clients failed")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 8 changed files in this pull request and generated no comments.

Files not reviewed (2)
  • src/ConfigurationClientWrapper.ts: Evaluated as low risk
  • src/load.ts: Evaluated as low risk

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 8 changed files in this pull request and generated no comments.

Files not reviewed (2)
  • src/ConfigurationClientWrapper.ts: Evaluated as low risk
  • src/load.ts: Evaluated as low risk
Comments suppressed due to low confidence (1)

src/failover.ts:35

  • The comment should be 'random value between [-1, 1) * JITTER_RATIO * calculatedBackoffDuration' for consistency.
// jitter: random value between [-1, 1) * jitterRatio * calculatedBackoffMs
@zhenlan
Copy link

zhenlan commented Feb 17, 2025

Is there a reason the retry can be disabled?

@zhiyuanliang-ms
Copy link
Contributor Author

Is there a reason the retry can be disabled?

If people want to throw immediatly when initial load fails, they can disable it. Setting a very short timeout is not feasible for this case. The retryEnabled is true by default. So, there is no behavior difference from other providers.

@zhenlan
Copy link

zhenlan commented Feb 18, 2025

If people want to...

Anyone asked for it? What's the scenario for it?

@zhiyuanliang-ms
Copy link
Contributor Author

zhiyuanliang-ms commented Feb 18, 2025

If people want to...

Anyone asked for it? What's the scenario for it?

@zhenlan No one asks for it. I personally want to disable the retry so that I can know whether there is something wrong immediately.
I have experienced this: I used a wrong connection string (with correct format but wrong content), the load call lasted for 100 sec before fail. Same problem exists in .NET provider.

Anyway, I am ok with removing the retryEnabled option. But I am curious, are you worrying about that even if the default behavior is retry-enabled , user will still choose to disable it and we don't want them to disable it in most cases? Otherwise, I don't see why we cannot keep it.

@zhenlan
Copy link

zhenlan commented Feb 18, 2025

No one asks for it. I personally want to disable the retry so that I can know whether there is something wrong immediately.

If the connection string is malformed, I hope it will throw so we can fail immediately. Otherwise, you can't really tell what's wrong. Set a shorter timeout if you don't want to wait for too long.

are you worrying about that even if the default behavior is retry-enabled , user will still choose to disable it and we don't want them to disable it in most cases? Otherwise, I don't see why we cannot keep it.

Configuration is in critical code path. Most applications can't start without configuration loaded properly. Transient errors can happen for any cloud solutions. We rely on the client retry (and failover etc.) to deliver customers high availability. Please don't design a "feature" so customers can shoot their own feet.

@@ -58,7 +65,7 @@ export class ConfigurationClientManager {
this.#id = regexMatch[2];
this.#secret = regexMatch[3];
} else {
throw new Error(`Invalid connection string. Valid connection strings should match the regex '${ConnectionStringRegex.source}'.`);
throw new RangeError(`Invalid connection string. Valid connection strings should match the regex '${ConnectionStringRegex.source}'.`);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case, call it as a RangeError sounds odd, how about rename it to ArgumentError or ParameterError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RangeError is a built-in error type in JavaScript. There is no built-in ArgumentError. I did consider about create an ArgumentError type by myself. But I think it overlaps with RangeError

Copy link
Member

@avanigupta avanigupta Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like JS doesnt have anything equivalent to ArgumentError. By definition, RangeError is also not a good fit here. I'd vote for custom error that extends Error class.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 16 changed files in this pull request and generated 2 comments.

Files not reviewed (10)
  • src/ConfigurationClientWrapper.ts: Evaluated as low risk
  • src/load.ts: Evaluated as low risk
  • test/failover.test.ts: Evaluated as low risk
  • src/keyvault/AzureKeyVaultKeyValueAdapter.ts: Evaluated as low risk
  • src/AzureAppConfigurationOptions.ts: Evaluated as low risk
  • src/refresh/RefreshTimer.ts: Evaluated as low risk
  • test/keyvault.test.ts: Evaluated as low risk
  • src/ConfigurationClientManager.ts: Evaluated as low risk
  • test/requestTracing.test.ts: Evaluated as low risk
  • test/clientOptions.test.ts: Evaluated as low risk
Comments suppressed due to low confidence (1)

src/AzureAppConfigurationImpl.ts:249

  • [nitpick] The error message 'Load operation timed out.' is unclear. Consider making it more descriptive, such as 'Loading configuration settings timed out.'
reject(new Error("Load operation timed out."));

await new Promise(resolve => setTimeout(resolve, backoffDuration));
console.warn("Failed to load configuration settings at startup. Retrying...");
}
} while (!abortSignal.aborted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the check on abortSignal is at the end of each try operation, so there may be a scenario that when it's still in loading process but timeout happens, program still need to run a little while before abort. May be you can setup timeout for signal like this https://developer.mozilla.org/en-US/docs/Web/API/AbortSignal/timeout_static

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about this option, it is an alternative for using Promise.race for timeout.

If we take this option, the code will be

async function load() {
	try {
		await initializeWithRetryPolicy(AbortSignal.timeout(5200)); // 200ms longer than default backoff duration
	} catch (error) {}
}

async function initializeWithRetryPolicy(signal) {
	do {
		try {
			//--> Position A
			// load keyvalue from App Config
		} catch (error) {
		 	//--> Position B
			// wait for 5000ms (default backoff duration)
		}
	} while (signal.abort)

	if (signal.abort) {
		throw new Error("Timeout")
	}
}

Let's say one failure of load kv from App Config (position A) costs 50ms
At T0, we set timeout for the signal (At T0+5200, it should be aborted)
At T0 + 50ms, the first attempt failed and it went to position B and started to wait for 5000ms
At T0+5050ms, it started to retry and it took 50m to fail again
At T0+5100ms, failed at position A again and started to wait for 5000ms or maybe even longer due to exponential backoff duration increase
At T0+5200ms, the signal timeout, but the code is still stuck at await initialize.

program still need to run a little while before abort.

This is half-true. Unless the customer write code like this:

try {
	const setting = await load();
} catch (err) {
	// do not throw here
}

The initializeWithRetryPolicy promise will not be resolved immediately after timeout. Otherwise, error is thrown, the node process is terminated. All promises running no longer exist.

@zhiyuanliang-ms zhiyuanliang-ms force-pushed the zhiyuanliang/startup-timeout branch from 0d91484 to 9e32db4 Compare February 21, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants