Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix enclave panic reporting so that it always shows the error #3274

Merged
merged 3 commits into from
Mar 23, 2023

Conversation

cbeck88
Copy link
Contributor

@cbeck88 cbeck88 commented Mar 23, 2023

Most of the time when an enclave panics, particularly during local development, the process dies with a hard abort before slog gets a chance to flush it's buffer.

This commit makes it emit the panic message on STDERR first, using eprintln, which means it will be in an OS buffer before the process ends. We continue to write it as well as to slog, which writes to STDOUT.

I see no reason not to do this, it's a massive quality of life improvement for anyone actually working on enclave-impl's.

Without this patch, 100% of the time i do not get panic messages from the enclave, and just a core dumped (aborted) message.

Most of the time when an enclave panics, particularly during local
development, the process dies with a hard abort before slog gets a
chance to flush it's buffer.

This commit makes it emit the panic message on STDERR first, which
means it will be in an OS buffer before the process ends.
We continue to write it as well as to slog, which writes to STDOUT.

I see no reason not to do this, it's a massive quality of life
improvement for anyone actually working on enclave-impl's.
@cbeck88 cbeck88 requested review from jcape, eranrund, a team and varsha888 and removed request for a team March 23, 2023 03:31
Copy link
Contributor

@jcape jcape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, in sgx-std we're just writing the bytes to stderr, because:

  1. There's no expectation the logging system would get involved in a regular panic, so stderr needs to get captured somehow anyways
  2. It doesn't seem like there's any appetite to do anything more sophisticated in production.

@cbeck88 cbeck88 requested a review from nick-mobilecoin March 23, 2023 03:58
@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 23, 2023

There's no expectation the logging system would get involved in a regular panic,

I think eran has made it this way for most of our services though:

https://github.com/mobilecoinfoundation/mobilecoin/blob/master/common/src/panic_handler.rs#L18

I am not sure if we capture anything other than the slog output right now in production

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 23, 2023

I think everything in production is calling this today:

pub fn setup_panic_handler() {

@jcape
Copy link
Contributor

jcape commented Mar 23, 2023

There's no expectation the logging system would get involved in a regular panic,

I think eran has made it this way for most of our services though:

https://github.com/mobilecoinfoundation/mobilecoin/blob/master/common/src/panic_handler.rs#L18

This is starting to feel like we should be trying to panic in the logging OCALL (i.e. in untrusted). It's not necessary for this PR, but we can bake this in the SGX2/DCAP stuff.

I am not sure if we capture anything other than the slog output right now in production

AFAIK, it's the other way around: we're only capturing stdout/stderr in production. slog is only even involved because the logs it writes go to stdout/stderr. c.f. #2273

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 23, 2023

This is starting to feel like we should be trying to panic in the logging OCALL (i.e. in untrusted). It's not necessary for this PR, but we can bake this in the SGX2/DCAP stuff.

Yeah that's a good call -- propagate the panic from the enclave to the outside.

In principle, we could try to keep the server up and bring back a new copy of the enclave? but I think in reality there's no good reason to attempt that.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 23, 2023

In unwind mode it might be fucked up i guess, if rust is trying to get a backtrace through the ocall?

@jcape
Copy link
Contributor

jcape commented Mar 23, 2023

In unwind mode it might be fucked up i guess, if rust is trying to get a backtrace through the ocall?

The nomicon seems to indicate that it should work OK:

... If an unwinding operation does encounter an ABI boundary that is not permitted to unwind, the behavior depends on the source of the unwinding (Rust panic or a foreign exception):

  • panic will cause the process to safely abort.
  • A foreign exception entering Rust will cause undefined behavior.

AFAIK the first rule should apply since we're the one calling panic, and the function is marked extern "C". Alternatively, we could mark the OCALL "C-unwind", and it would unwind up through the generated code. [I don't think that actually will do what we want]

@cbeck88 cbeck88 merged commit 061d4e7 into master Mar 23, 2023
@cbeck88 cbeck88 deleted the fix-enclave-panic-reporting branch March 23, 2023 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants