-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unpack and decompress archive formats #111
Comments
Is unpacking recursively really required? It sounds unclean to me. In general, I would have expected the Are you saying there are packages that unpack recursively during build? Are these one-offs or entire build systems? For log4j, wouldn’t it be a more promising strategy (in particular because I can’t promise any changes to DCS to happen quickly) to code search for usages of the log4j library (Java imports?) rather than copies of the log4j library? |
Debian source packages are not exactly proper pristine clean source
only trees. Even Debian trees aren't proper pristine clean source only
trees. The same applies to other distros and upstream source tarballs
and VCS trees. There are thousands of prebuilt files, compressed files
and archive files themselves containing more prebuilt files, compressed
files or archive files, possibly recursively for multiple levels.
Indexing the files at each of the recursion levels would be useful for
many different situations, including log4j.
For example:
$ apt-file search -I dsc --regex '\.(tar(\.(gz|bz2|xz))|tgz|tbz|txz|iso|zip|jar|rar|gz|bz2|xz)$' | wc -l
42576
For log4j, as I understand it the vulnerability is in log4j itself and
the fix just disables the vulnerable feature, so you first need to find
instances of the vulnerable code in copies of log4j, then determine if
they are built into .jar files and or exported to binary packages in
other ways, then determine if any source packages build against those
binary packages and then check if they copy files out of the log4j
containing binary packages.
https://opensourcesecurity.io/2021/12/12/log4j-is-hard-to-find-and-harder-to-fix/
For the opposite situation, where there is a common vulnerability
caused by an API with a bad design, that is where you need to find
usages of the API, determine if they are dead code or not and then fix
the ones that end up in binary packages.
For both of the above scenarios, the vulnerable code could be inside
archive files, so indexing them is still a useful thing to do.
Many of the archive files are probably unused but there is no way to
tell since any part of the source package including the upstream build
system, test suite or even code from other binary packages could be
triggering unpacking and potentially the files could be used.
I don't think the Debian security team intends to go any further than
fixing the Debian source packages of log4j, so the above feature
request is unlikely to be of assistance to them.
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
While Debian source packages aren’t always clean source-only trees, they usually are, and I remain unconvinced that just blindly unpacking all archives will result in more valuable data in the search index afterwards. I scrolled through a number of file names based on the command you provided, and most files look like testdata, samples, etc. When faced with embedded copies of other software, it’s generally in the package maintainers interest to get rid of this problem, as all the tooling works to your disadvantage otherwise. Your package will be harder to maintain, trigger more lintian warnings, etc. I could find just one example in favor of your feature request, which is piespy, where a dependency upstream distributes their software in a jar that contains sources and binary data, and the package rebuilds from source to be DFSG-compliant: https://sources.debian.org/src/piespy/0.4.0-5/debian/rules/. Those sources are not indexed by Debian Code Search, because they’re in a .jar file until build time. Ideally, of course, that dependency would be in a separate Debian source package. Debian Code Search hasn’t had to take a strong position regarding vendoring of dependency sources thus far (it’s happening so little within the Debian archive that we could just pretend the problem doesn’t exist). Generally, I’ve tried to keep the search results as high-quality as possible, so my first instinct is to avoid vendored sources as much as possible, but I can see that for some use-cases it would be valuable to include all vendored sources. It might be another axis of searching altogether (include/exclude vendored code). So, to summarize: I can see the point of extracting archives, but given the numbers, I think it’d do more harm than good, and if we wanted to extract archives after all, we should probably allow including/excluding vendored code (and recognizing it as such!) first. |
Summary: I agree with the approach mentioned in final paragraph, but
disagree on incidence of embedding & usefulness of indexing archives.
I'm not sure about the incidence of compressed or archived embedded
dependencies, the results for log4j look like about 5/6 source packages
embed log4j .jar files. For me that was enough that I thought I should
at least start a discussion about this.
$ apt-file search -I dsc -i --regex log4j.*jar
On the topic of embedded copies in general I think you are mistaken
about how common it is to embed copies of dependencies in upstream
tarballs distributed by Debian. For example the Firefox source package
embeds at least 64 different Python modules. The record for embedding
in Debian that I saw was about 5 layers deep of projects embedding
projects embedding projects, IIRC that was in a Qt/KDE fork of
Chromium. There are many many copies known to the Debian security team
and many many copies that they do not know about. I've come across a
lot myself, I don't bother to report them as there are basically too
many to deal with manually. Since the Debian Technical Comittee
decision that approved vendored dependencies in Kubernetes, and due to
the popularity of vendoring in some communities like Golang, and due to
the declining popularity of distros amongst application authors I
believe that this trend is only going to increase in the future.
https://wiki.debian.org/EmbeddedCopies
https://lwn.net/Articles/835599/
On the topic of hiding embedded copies from the Debian Code Search
interface, I think that is a great idea, during my use of the service
to find common typos I came across lots of embedded copies (especially
in Firefox/Chromium) that I would like to have been able to hide. At
the same time I think it is important to have an option to show them,
for use-cases where they are important (like security issues).
On the topic of automatically detecting embedded copies, I would love
to have a tool for this, so if you write something for this, please
make it a separate project, with a command-line tool included.
On the topic of how to detect embedded copies, the check-all-the-things
project has heuristics to find them and a TODO item for some other
ideas for detecting them that Debian folks came up with, quoted below.
https://github.com/collab-qa/check-all-the-things
[embed-readme]
flags = embed
files = *README*
comment = Please check if these README files belong to embedded code/data copies.
command = find {cwd} -mindepth 2 -iname '*README*'
[embed-dirs]
flags = embed
comment = Please check if these directories contain embedded code/data copies.
command = find {cwd} -type d -name 'vendor*' -o -iname '*rd*party' -o -iname 3rdp -o -name contrib -o -name imports -o -name node_modules -o -iname external -o -iname externals -o -iname deps -o -name inc -o -name __pypackages__
[embed-auto-tools]
flags = todo
# I've seen configure.ac in legit subdirs, not sure how false-positivy that would be. will add anyway
# you can look for the known-metadata files in subdir
# .gitmodules :D
# setup.py, *.gemspec, package.json etc
# how about detecting mentions of subdir projects in top-level build scripts?
# no idea about details though
# also, cmake has ExternalProject_Add
# another one: difference license block than the majority of files in the package
# (should work well for license blocks that name the copyright owner)
…--
bye,
pabs
https://bonedaddy.net/pabs3/
|
There are thousands of files in Debian source packages that are in formats that contain other files, for example compressed files (*.gz .bz2 .xz etc), tarballs (.tar .tar. .tz), zip files (.zip .jar), filesystems (.iso) and other archive formats. For situations like the log4shell security issue in log4j where there may be many embedded code copies in archive file formats, it would be useful if the Debian code search system could recursively (within limits) unpack all the archive files in Debian source packages and index those too. I see from #80 that there is already some unpacking going on, but I assume that dcs isn't unpacking everything and isn't unpacking recursively.
The text was updated successfully, but these errors were encountered: