Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for built-in ORC information #460

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

brenns10
Copy link
Contributor

Thanks to your work today with 3ce0fee ("tests: don't clobber file in use by libelf") and a1869f9 ("Make StackFrame.name fall back to symbol/PC and add StackFrame.function_name"), this branch is fully unblocked!

Now that the module API has landed, we can usually rely on having a struct drgn_module available for a kernel module, even if the module debuginfo is not loaded (either because you have only loaded debuginfo for the kernel, or because you are using CTF). My previous ORC support required essentially re-implementing a small portion of the module API, specifically a tree that mapped address ranges to ORC data. Now, I can drop all that complexity and share this branch which I think is ready for consideration.

This branch's main goal is to enable using built-in ORC information for stack unwinding, for the cases where a module does not have debuginfo. I think the commit messages here describe everything that's important. I think the testing is okay -- I wish I could test the vmlinux ORC, but I think I would need a type finder to allow the module API to even get initialized.

@brenns10
Copy link
Contributor Author

So, the error here was that the 4.9 kernel does not support ORC, and I was passing through the lookup error for "num_orcs" in struct mod_arch_specific, rather than swallowing it (since that shouldn't interrupt a stack trace).

The interesting thing is that after correcting that, the test passes, because on 4.9, unwind can be done via frame pointers rather than ORC. So really, my stack tracing test isn't (necessarily) testing the ability to unwind with ORC, it's testing the ability to unwind with anything other than DWARF. I don't really have a knob to test this in the Python API. But unfortunately, this isn't really something easy to test in a C unit test either.

@brenns10
Copy link
Contributor Author

I made a couple changes:

  1. Now, we log whenever we load built-in ORC.
  2. The tests now detect the log message to verify that ORC is getting used for unwinding.
  3. The tests also skip kernel 4.9 by detecting whether the kernel has the __start_orc_unwind symbol.

@brenns10
Copy link
Contributor Author

Interestingly, my test above that tested log messages failed on Python 3.6. I guess there is a difference in how the logging got initialized. I've added a context manager to explicitly set drgn's log level to DEBUG for the duration of the test, which I believe resolves the issue.

Finally, I've gone ahead and added one more test, which actually does test vmlinux ORC. It creates a Program with no debuginfo, and then copies a struct pt_regs from the Program which has debuginfo, and asks to unwind that. It's quite funky, but it does work, and I tested it on 4.9 as well as 6.12.

With that, I do feel happy about the testing now!

Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perused the ORC changes and they looked sane overall, but I'll have to give them a closer look after the holidays. I looked over the Module.object change more closely, so some comments there.

The Python 3.6 logging issues are probably due to our very sketchy integration with Python logging:

// This is slightly heinous. We need to sync the Python logging configuration
// with libdrgn, but the Python log level and handlers can change at any time,
// and there are no APIs to be notified of this.
//
// To sync the log level, we monkey patch logger._cache.clear() to update the
// libdrgn log level on every live program. This only works since CPython commit
// 78c18a9b9a14 ("bpo-30962: Added caching to Logger.isEnabledFor() (GH-2752)")
// (in v3.7), though. Before that, the best we can do is sync the level at the
// time that the program is created.
//
// We also check handlers in that monkey patch, which isn't the right place to
// hook but should work in practice in most cases.
.

@brenns10
Copy link
Contributor Author

Ahh, thank you! I knew about the loglevel monkey-patching, but I hadn't read the full comment, so I missed the bit about syncing the log level when the Program is created. That makes sense, and I can easily fix it.

And yeah, I should have made more clear on this that I wasn't hoping for quick action before the holidays :) All the quick updates have been because I've been nerd-sniped by the tests, which are really fun to make work.

@brenns10
Copy link
Contributor Author

At your leisure: How do you feel about the API choice of Module.object having an absent object, instead of None, when there is no object present? On the one hand, it seems nice to not have an Optional[Object] type annotation. But on the other, part of me feels like it's overloading the concept of an absent object.

@osandov
Copy link
Owner

osandov commented Dec 21, 2024

Stick with the absent object, I prefer that over None in this case.

@brenns10
Copy link
Contributor Author

brenns10 commented Jan 7, 2025

There was a strange issue with Github CI (504 error fetching kernels). Assuming they pass this time, this is ready for review again at your convenience. Hope you had a great holiday!

@brenns10 brenns10 mentioned this pull request Jan 7, 2025
6 tasks
Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the overall structure, just some minor comments. Thanks!

This will allow orc_version_from_header() to be reused for upcoming ORC
integration that does not use libelf.

Signed-off-by: Stephen Brennan <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
This allows users to get the object which the module was created from.
The primary use case is for Linux kernel modules, to return the "struct
module" associated with the drgn module object.

To simplify the implementation, we've added the restriction that
Program.linux_kernel_loadable_module(), and the associated C APIs, will
only accept objects of type "struct module *".

Signed-off-by: Stephen Brennan <[email protected]>
Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor nitpicks, plus one bug in "orc_info: apply debug file bias to pc_base at load" that I missed last time. Thanks!

@brenns10
Copy link
Contributor Author

Ok, I've gone ahead and addressed this batch of feedback. This time it didn't require a rebase to pull in recent changes from main. So it should be easier to see the changes from the previous revision.

Thank you for the catch on remove_fdes_from_orc(). In addition to resolving those finds, I did add a case to allow loading built-in ORC when there's a debug_file that does not have the ORC sections. In practice, after the remove_fdes_from_orc() step, it seems like most modules end up with just one ORC entry, so I'm not entirely certain whether it would actually be worth the cost.

ORC has always been loaded from the ELF debug file. However, ORC is
present in the memory pages of kernel core dumps, so it can still be
used when the debug file is unavailable. Implement the ability to load
built-in ORC for vmlinux and kernel modules. We still prefer to load ORC
from the debug file wherever possible, because this is almost certainly
faster.

Signed-off-by: Stephen Brennan <[email protected]>
When looking up CFI rules using ORC, we apply module->debug_file_bias to
pc_base.  This made sense when the ORC was always loaded from an ELF
debug file. However, now that built-in ORC can be loaded, this only
works for ORC when debug_file_bias is zero; that is, when there is no
debug_file.

This is a problem, because it's possible that a debug_file is loaded
when built-in ORC is used. It can happen either when the debug_file has
no ORC sections present, or when the debug_file is loaded after the
built-in ORC is.

To avoid this, we define the pc_base as the biased (runtime) address of
the orc_unwind_ip section. This is already the case for built-in ORC,
but when we load it from the debug_file, we must apply the bias. In
cases such as remove_fdes_from_orc(), which want to compare ORC PCs
against unbiased addresses, they'll need to subtract the bias to match
the relevant file.

Signed-off-by: Stephen Brennan <[email protected]>
Arrays may not be declared using a "const" variable for the size, so we
need to use a macro for this.

Signed-off-by: Stephen Brennan <[email protected]>
Loading built-in ORC is a difficult functionality to test: it is best
tested when there is no debuginfo file. Thus, we add two tests: one
simpler test in which the kernel has debuginfo, but a module does not,
and we must unwind a stack with functions from the module. The second
test is more complex, where we create a program with no debuginfo at
all, and provide it just enough data to initialize the module API and
unwind with built-in ORC.

In both cases, to verify that drgn is actually using ORC, we capture its
log messages.

Signed-off-by: Stephen Brennan <[email protected]>
@brenns10
Copy link
Contributor Author

brenns10 commented Feb 21, 2025

The failure above was in 4.9, which doesn't have ORC. It was a really good catch as well. I should not return an error from drgn_read_vmlinux_orc() if the "non-optional" symbols are not found - this is an expected case where we don't load anything and just return NULL. The failure only started now because with my recent update, drgn checks for built-in ORC after it fails to find it in the debug file. However, the error would have happened with any non-ORC kernel, if you tried to do a stack trace without having a debug file loaded too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants