-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP: HashFromNode and VerifyDepTree #959
Changes from 2 commits
84b3567
ae2c30e
1a76ff6
66fe4eb
5c97180
546d65f
f1d5d85
b6d16e1
9a7209d
581e51a
a298630
f0b532d
911fda4
ecbbf03
0f7b419
443fcd3
8575040
541b3f8
0a01713
8a0c365
cba6ebe
59a1046
c5be3d8
87959d0
29b1709
17e7c9d
0df696c
f02d479
e9ff6b2
fbe5d64
48cd664
845c95e
9a883e8
a252c36
83ad7c6
a9bb8cd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
package fs | ||
|
||
import ( | ||
"crypto/sha256" | ||
"fmt" | ||
"io" | ||
"os" | ||
"path/filepath" | ||
"sort" | ||
"strconv" | ||
|
||
"github.com/pkg/errors" | ||
) | ||
|
||
const ( | ||
pathSeparator = string(filepath.Separator) | ||
skipModes = os.ModeDevice | os.ModeNamedPipe | os.ModeSocket | os.ModeCharDevice | ||
) | ||
|
||
// HashFromNode returns a deterministic hash of the specified file system node, | ||
// performing a breadth-first traversal of directories. While the specified | ||
// prefix is joined with the pathname to walk the file system, the prefix string | ||
// is eliminated from the pathname of the nodes encounted when hashing the | ||
// pathnames. | ||
// | ||
// This function ignores any file system node named `vendor`, `.bzr`, `.git`, | ||
// `.hg`, and `.svn`, as these are typically used as Version Control System | ||
// (VCS) directories. | ||
// | ||
// Other than the `vendor` and VCS directories mentioned above, the calculated | ||
// hash includes the pathname to every discovered file system node, whether it | ||
// is an empty directory, a non-empty directory, empty file, non-empty file, or | ||
// symbolic link. If a symbolic link, the referent name is included. If a | ||
// non-empty file, the file's contents are incuded. If a non-empty directory, | ||
// the contents of the directory are included. | ||
// | ||
// While filepath.Walk could have been used, that standard library function | ||
// skips symbolic links, and for now, we want to hash the referent string of | ||
// symbolic links. | ||
func HashFromNode(prefix, pathname string) (hash string, err error) { | ||
// Create a single hash instance for the entire operation, rather than a new | ||
// hash for each node we encounter. | ||
h := sha256.New() | ||
|
||
// "../../../vendor", "github.com/account/library" | ||
prefix = filepath.Clean(prefix) + pathSeparator | ||
prefixLength := len(prefix) | ||
if prefixLength > 0 { | ||
prefixLength += len(pathSeparator) // if not empty string, include len of path separator | ||
} | ||
joined := filepath.Join(prefix, pathname) | ||
|
||
// Initialize a work queue with the os-agnostic cleaned up pathname. Note | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this to be os-agnostic, i think we need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Absolutely correct; thanks for the nudge! |
||
// that we use `filepath.Clean` rather than `filepath.Abs`, because we don't | ||
// want the hash to be based on the absolute pathnames of the specified | ||
// directory and contents. | ||
pathnameQueue := []string{joined} | ||
|
||
for len(pathnameQueue) > 0 { | ||
// NOTE: unshift a pathname from the queue | ||
pathname, pathnameQueue = pathnameQueue[0], pathnameQueue[1:] | ||
|
||
fi, er := os.Lstat(pathname) | ||
if er != nil { | ||
err = errors.Wrap(er, "cannot Lstat") | ||
return | ||
} | ||
|
||
mode := fi.Mode() | ||
|
||
// Skip special files | ||
if mode&skipModes != 0 { | ||
continue | ||
} | ||
|
||
// NOTE: Write pathname to hash, because hash ought to be as much a | ||
// function of the names of the files and directories as their | ||
// contents. Added benefit is that even empty directories and symbolic | ||
// links will effect final hash value. | ||
// | ||
// NOTE: Throughout this function, we ignore return values from writing | ||
// to the hash, because hash write always returns nil error. | ||
_, _ = h.Write([]byte(pathname)[prefixLength:]) | ||
|
||
if mode&os.ModeSymlink != 0 { | ||
referent, er := os.Readlink(pathname) | ||
if er != nil { | ||
err = errors.Wrap(er, "cannot Readlink") | ||
return | ||
} | ||
// Write the referent to the hash and proceed to the next pathname | ||
// in the queue. | ||
_, _ = h.Write([]byte(referent)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to differentiate between a normal file containing only the path the symlink points to, and an actual symlink. Maybe that's happening elsewhere...? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please restate your comment. For all non-skipped file system nodes, we always hash the pathname to the node. Then if it's a symlink, we also hash the referent. Else if it's a directory, we enqueue the sorted list of children. Else if it's a regular file, we hash the file size and the file contents. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, if we're hashing the file size, that sorta incidentally achieves my stated goal, though maybe not entirely. an example to illustrate - say that we have two nodes, one a file, and one a symlink. the content of the file is is it true that, if the path to either of these nodes is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sdboyer, okay after reading your example about a dozen times, I think I understand your scenario. Basically we ought to protect ourselves against scenario of accidental hash collision when the stored symbolic link referent points to node, and when a regular file's contents might match the symbolic link referent. Crazy case here, but valid. I think if we write an addition NULL byte to the hash after every entry we write to the hash. Because NULL bytes are not legal file system characters for Windows or Linux, it provides a entry terminator of sorts, kind of like a new line character would. This ought to prevent accidental hash collisions due to symlink referents matching the contents of a regular file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, latest push writes NULL bytes after writing pathname information to the hash. |
||
continue | ||
} | ||
|
||
fh, er := os.Open(pathname) | ||
if er != nil { | ||
err = errors.Wrap(er, "cannot Open") | ||
return | ||
} | ||
|
||
if fi.IsDir() { | ||
childrenNames, er := fh.Readdirnames(0) // 0: read names of all children | ||
if er != nil { | ||
err = errors.Wrap(er, "cannot Readdirnames") | ||
// NOTE: Even if there was an error reading the names of the | ||
// directory entries, we still must close file handle for the | ||
// open directory before we return. In this case, we simply skip | ||
// sorting and adding entry names to the work queue beforehand. | ||
childrenNames = nil | ||
} | ||
|
||
// NOTE: Sort children names to ensure deterministic ordering of | ||
// contents of each directory, ensuring hash remains same even if | ||
// operating system returns same values in a different order on | ||
// subsequent invocation. | ||
sort.Strings(childrenNames) | ||
|
||
for _, childName := range childrenNames { | ||
switch childName { | ||
case ".", "..", "vendor", ".bzr", ".git", ".hg", ".svn": | ||
// skip | ||
default: | ||
pathnameQueue = append(pathnameQueue, pathname+pathSeparator+childName) | ||
} | ||
} | ||
} else { | ||
_, _ = h.Write([]byte(strconv.FormatInt(fi.Size(), 10))) // format file size as base 10 integer | ||
_, er = io.Copy(h, fh) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unfortunately, we have to do a little more than just a straight copy here - at minimum, we have to normalize CRLF to LF so that windows machines are working on the same literal data as the rest. i know that git itself does something like this, though i gave up looking after finding my way to here. this would certainly explain why the windows appveyor tests are failing with different digests than both the travis failures and the wanted values. i'm pretty sure this will be a correct operation across all vcs, but that belief needs verification. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my mind, at least, and correct me if you think differently, the priority is that the hash be consistent across any version of any of the operating systems supported by Go, and using any of the dep-supported VCS. Please tell me if my view is not consistent with your goals in mind. Assuming that is agreed, your point about CRLF vs LF translation might reveal quite the complex problem for us to tackle. We would need to perform CRLF translation on all text files we encounter. Obviously we don't need to mangle the bytes on disk, but just mangle the bytes as we stream them into the hash. (Also, I think we could even get away with in-memory translating every file, including binary ones, with impunity here, but this seems like a bizarre solution.) Your thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yep! i've been worried about this one for a while 😄
after poking at this and thinking about it for a while, i think this would actually be fine - a simple scanner that just reads the bytes into memory and mangles them there, effectively treating the (a?) tricky bit is that we could no longer rely on the file size to be an accurate byte count; we could only know afterwards, after having performed the mangling. i believe that would mean we'd either have to write the file length as a suffix, or keep the entire file's contents in a temporary buffer and then pass it to the hasher after we've reached the end and gotten an accurate count. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
oh, and 💯 yes. it's worse than useless if we don't have this property. |
||
err = errors.Wrap(er, "cannot Copy") // errors.Wrap only wraps non-nil, so elide checking here | ||
} | ||
|
||
// NOTE: Close the file handle to the open directory or file. | ||
if er = fh.Close(); err == nil { | ||
err = errors.Wrap(er, "cannot Close") | ||
} | ||
if err != nil { | ||
return // early termination iff error | ||
} | ||
} | ||
|
||
hash = fmt.Sprintf("%x", h.Sum(nil)) | ||
return | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
package fs | ||
|
||
import ( | ||
"os" | ||
"path/filepath" | ||
"testing" | ||
) | ||
|
||
func TestHashFromNodeWithFile(t *testing.T) { | ||
actual, err := HashFromNode("", "./testdata/blob") | ||
if err != nil { | ||
t.Fatal(err) | ||
} | ||
expected := "9ccd71eec554488c99eb9205b6707cb70379e1aa637faad58ac875278786f2ff" | ||
if actual != expected { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: we try to follow the (general) convention used within golang/go itself, which is "want" and "got" instead of "expected" and "actual". that applies to both var names and printed output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we also go a little further in gps by using a convention of:
which helps a bit with scannability. not that it matters with hash digests, ofc - just, again, consistency. |
||
t.Errorf("Actual:\n\t%#q\nExpected:\n\t%#q", actual, expected) | ||
} | ||
} | ||
|
||
func TestHashFromNodeWithDirectory(t *testing.T) { | ||
actual, err := HashFromNode("../fs", "testdata/recursive") | ||
if err != nil { | ||
t.Fatal(err) | ||
} | ||
expected := "432949ff3f1687e7e09b31fd81ecab3b7d9b68da5f543aad18e1930b5bd22e30" | ||
if actual != expected { | ||
t.Errorf("Actual:\n\t%#q\nExpected:\n\t%#q", actual, expected) | ||
} | ||
} | ||
|
||
var goSource = filepath.Join(os.Getenv("GOPATH"), "src") | ||
|
||
func BenchmarkHashFromNode(b *testing.B) { | ||
for i := 0; i < b.N; i++ { | ||
_, err := HashFromNode("", goSource) | ||
if err != nil { | ||
b.Fatal(err) | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
fjdkal;fdjskc | ||
xzc | ||
axc | ||
fdsf | ||
adsf | ||
das | ||
fd |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
fjdkal;fdjskc | ||
xzc | ||
axc | ||
fdsf | ||
adsf | ||
das | ||
fd |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
fjdsakl;fd | ||
vcafcds | ||
vca |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
this file ought to be skipped |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ package cycle | |
|
||
import ( | ||
"cycle/one" | ||
|
||
"github.com/golang/dep/internal/gps" | ||
) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ package one | |
|
||
import ( | ||
"cycle/two" | ||
|
||
"github.com/golang/dep/internal/gps" | ||
) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ package two | |
|
||
import ( | ||
"cycle" | ||
|
||
"github.com/golang/dep/internal/gps" | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this cool cat in
internal/gps/pkgtree
, rather thaninternal/fs
. While we break the rule a bit with the fs functions, this is a key part of the domain solution gps provides, and it should be grouped with it.pkgtree
makes sense as a a package to put it in, as this is another operation operating on trees of packages, and it shares some of the same nuanced semantics about what nodes are and aren't visited withpkgtree.ListPackages()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's return
[]byte
instead ofstring
.and, style nit: please use named returns only when necessary for disambiguation of return values (which we don't need here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I buy everything you suggest here, but after working with the code, and keeping in mind the fact that
Gopkg.lock
has string versions of digest sums in them, I have a hard time seeing this as[]byte
rather than astring
.Please provide additional information that I might not be aware of to help me understand this choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no strong argument for it - it's really just a consistency thing. i wasn't explicit about it above, but the big reason to move this into gps is because we try to treat gps as a library that has no idea that dep exists. the only exceptions to that right now are some calls into
internal/fs
, and then some tests that slipped intogps
that we need to back out.the only other place in gps with hashing is when we hash inputs (what shows up in
Gopkg.lock
asinputs-digest
). that returns a[]byte
; it does so because that's what thesha256
writer returns, and changing it would be expressing an opinion about use - and gps has no opinion about use. maybe it goes to a file, where it must be encoded - or maybe it's being written directly to a database as a bytestream. gps will, at some point in the not-so-distant-future, be moved out ofinternal
, so this isn't just our concern alone.again, i realize this is totally bikesheddable territory, and there's no definitive, clear reason. just, trying to maintain consistency in the exposed API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do not regularly use named return values. A quick perusal of everything I have on github reveals that it's not my style. I thought I saw something else in this library that used it, or perhaps some other library, and I wanted to maintain style discipline. Thanks for the feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the named return values was an issue i learned about myself in the course of working on all this. you probably saw some of my detritus 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely understand your logic about
[]byte
andstring
. In fact, I had a follow up question for you were you to actually answer[]byte
, but you answered my follow up question with a link to other use. The question would have been, "Do we return a byte slice of the hexidecimal output, or the actual raw bytes returned by the digest?" I have made the required changes to the source code to return the raw bytes returned by the digest just as the other link does.