Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamReadConstraints: add maxStringLen #864

Merged
merged 24 commits into from
Feb 14, 2023

Conversation

pjfanning
Copy link
Member

@pjfanning pjfanning commented Dec 15, 2022

Relates to #863

  • Open for discussion
  • The TextBuffer append methods have been updated to police the string length. This will lead to more timely exceptions than waiting until contentsAsString method is called.
  • A new class ReadConstrainedTextBuffer has been created. The legacy TextBuffer can still be used in cases where you don't want to apply the read constraints.
  • Needs tests
  • Needs uptake in jackson-databind and jackson-dataformat* libs

@pjfanning pjfanning marked this pull request as draft December 15, 2022 14:09
@pjfanning
Copy link
Member Author

@cowtowncoder I think it would be tidier to add a constructor to TextBuffer where a streamReadConstraints can be supplied. Then all the methods that take a streamReadConstraints will be changed not to take this any more. We could end up with virtually every method on TextBuffer taking streamReadConstraints as input otherwise.

For 'write' use cases like SegmentedStringWriter, we would support creating TextBuffer with no streamReadConstraints and no limits would be applied by TextBuffer.

@cowtowncoder
Copy link
Member

@cowtowncoder I think it would be tidier to add a constructor to TextBuffer where a streamReadConstraints can be supplied. Then all the methods that take a streamReadConstraints will be changed not to take this any more. We could end up with virtually every method on TextBuffer taking streamReadConstraints as input otherwise.

For 'write' use cases like SegmentedStringWriter, we would support creating TextBuffer with no streamReadConstraints and no limits would be applied by TextBuffer.

I really do not like the idea of TextBuffer being tied to StreamReadConstraints tho...

@cowtowncoder
Copy link
Member

Ok so yeah, thank you for starting with this! I think the place to check for exceeding limit(s) is when caller asks for new segment to use (usually as side effect of "finishing" current segment).

But the question of whether to construct TextBuffer with constraints, or always pass is... I can see why this is challenging.

It may be necessary to see what to do with SegmentedStringWriter which (as you noticed) should not be affected by limits.

@pjfanning
Copy link
Member Author

@cowtowncoder I had a look at the jackson-dataformat-csv code. It has its own TextBuffer class. I was trying to debug it because it has no 'append' methods. The appending is done by leaking the internal segment state into CsvDecoder. I haven't investigated if there is equivalent code in jackson-core code but if there is then checking in the append methods is not enough. I suspect that I don't know enough about how the TextBuffer is used.

@cowtowncoder
Copy link
Member

@pjfanning I don't remember details of CsvTextBuffer but I am sure we can figure that out if we can handle standard TextBuffer. I'll see if I can find bit more to remind myself why there is the discrepancy; I thought it had to do with buffer recycling (probably not a great reason to have custom one).

@pjfanning
Copy link
Member Author

@cowtowncoder I did some testing today with the latest code here. The tests are in jackson-databind and test very large strings. I found that I had to check for the length in a lot of places - because of the different paths that you can hit going through the TextBuffer code.

Of all the checks, the append ones seem like they are not useful. The checks in contentsAsString seem to be needed for the edge case where the segmentLen + currentLen is just a little over the limit.

@cowtowncoder
Copy link
Member

@cowtowncoder I did some testing today with the latest code here. The tests are in jackson-databind and test very large strings. I found that I had to check for the length in a lot of places - because of the different paths that you can hit going through the TextBuffer code.

Of all the checks, the append ones seem like they are not useful. The checks in contentsAsString seem to be needed for the edge case where the segmentLen + currentLen is just a little over the limit.

As I've mentioned contentsAsString() is not something I want to use. But I can accept that various append()s are probably not worth worrying about.

@pjfanning
Copy link
Member Author

The new contentsAsString() checks could be removed - the impact is that strings that are a bit bigger than the supposed max can get through. It may be possible to track down somewhere else that this can be remedied.

The impact of the PR change in contentsAsString is a small extra if check (is the len greater than the limit). Is this something to worry about?

@cowtowncoder
Copy link
Member

I am not sure I am supportive of this approach: I will actually propose a bit different way to go about for numeric checks too, trying to instead reduce work TextBuffer is to do, not increase it.

Looking at ParserBase.resetInt() and .resetFloat() I think I'd favor making checks from those points (for the most part) and reducing validation done at TextBuffer where possible.
I will also see if methods like TextBuffer.contentsAsDecimal() are actually needed at all.

@pjfanning
Copy link
Member Author

@cowtowncoder is this PR/issue worth pursuing for v2.15.0?

@cowtowncoder
Copy link
Member

@pjfanning I haven't had time to think more about this, but I would like to keep this open still; maybe this is the way to go. I think my main concern has always been that it'd be better to tackle length constraints before final aggregation -- probably in spot where next buffer is being requested (there are couple of methods).
Do you think that'd be possible? We could also there limit new buffer size to maximum allowed wrt limits, and so check would only be failing when requesting the next buffer, last being full.
(LMK if my explanation doesn't make sense, I can try to elaborate)

I think split into length-constrained and regular one makes actually sense; the only (?) concern is whether TextBuffer recycling could be problematic. I don't think so -- BufferRecycler only handles underlying char[], not TextBuffer?

@pjfanning
Copy link
Member Author

pjfanning commented Feb 3, 2023

@cowtowncoder I'm not sure if your biggest objection to having the checks in TextBuffer or specifically, in TextBuffer.contentsAsString.

In the latest iteration of this PR, I have removed the check in TextBuffer.contentsAsString. This leads to a failure in FasterXML/jackson-databind#3704 (a PR in jackson-databind that relies on this PR - using ./mvnw install on the branch used for this PR and testing jackson-databind with some extra tests in that 3704 PR). The reason is that testing at the buffer expansion point means that we will miss issues where the total size is a little bigger than the limit. Testing in contentsAsString means, we get the benefit of early failures at the buffer expansion point but that we apply the limit exactly in contentsAsString. The test in PR 3704 that fails is TestBigStrings.testBigString where the string is 1,001,000 chars when the limit is 1,000,000 chars - the test fails because we do not fail with the slightly too large string.

I'm happy enough if we decide that the max limit is an approx limit that errs a little (proportionally) on the side of allowing strings that are sized a little over the limit size.

@cowtowncoder
Copy link
Member

@pjfanning First of all, apologies once again for being very slow to follow up here.

Having said that, to me approximate check is perfectly reasonable, so limit is sort of minimum -- will not fail at below that level. And also that relative accuracy matters more for bigger Strings; there's no need to worry about super low limits (IMO) like keeping things to 1k or something. Those should, I think, be handled by something else. At this level we worry about service protection kinds of things.

My only concern about relying on contentAsString() is if it's the only thing and we could, f.ex, read 400k even if limit was 10k or so. I do not mind having check also there if that makes sense.

I hope to get back to checking this out soon, sometime this week.

And will try to keep thinking about the other dimension: that of nesting. It's even more performance sensitive as the best place to add limits would be within StreamReadContext / StreamWriteContext. But changes to that interface get... tricky.

@cowtowncoder
Copy link
Member

Ok this is looking pretty good and I think I am starting to feel better about having length-limited TextBuffer sub-class.
So I think this might be the way to go!

Couple of questions, thoughts:

  1. CSV module has CsvTextBuffer; wonder if we could somehow make CSV module use standard one(s) instead -- I don't quite remember why I did the "fork" here :-/
  2. Although 1M characters sounds reasonable, I have a nagging feeling someone somewhere might still use slightly bigger sizes. So I wonder if there'd be any alternative bigger limit that makes sense. I think 10 megs is quite common for maximum uploadable file/doc size, fwtw.

@pjfanning
Copy link
Member Author

I would ignore CSV for the moment. jackson-databind and other dataformat modules would need to be updated to add tests. We could circle back around to CSV module later. As part of adding a test, we can see if it is easier to recreate the new feature in CsvTextBuffer or to change CSV module to use TextBuffer. I did try the latter before and it didn't look straightforward. At this stage, CsvTextBuffer is something of its own beast.

I can look at changing the 1m default to a 10m default.

@@ -128,6 +128,34 @@ public void testLongUnicodeStrings() throws IOException
_testStrings(f, input, data, 1, 1);
}

public void testLongAsciiStringsSmallLimit() throws IOException
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cowtowncoder the test I added to passes when it is run by itself but fails when it is run with other tests. It seems that running other tests with the default StreamReadConstraints seems to leave behind state - maybe in the BufferRecyclers - that means the validateStringLength checks don't get called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, due to buffer recycling it is possible that recycled buffer has reached size of up to 64k chars.

So test will probably be set as much less strict: to set maximum length to say 100k, produce 200k String and verify that gets caught.

Another thing we could consider: making MAX_SEGMENT_LEN no higher than maxStringLength.
Then again... maybe not worth bothering yet.

@cowtowncoder
Copy link
Member

I can look at changing the 1m default to a 10m default.

Just realized that I didn't mean to suggest 10 meg as default necessarily as much as sort of data point: if the max document is 10 megs, then perhaps max value length of 1 meg is reasonable.

@cowtowncoder
Copy link
Member

@pjfanning Ok, I am almost ready to merge.

And now I think that I actually agree with you on also checking length on contentsAsString(): we may get check fail earlier on buffer alloc, but if not, this is where we get the exact check that users expect

Doing this would make tests pass I think, since buffer size wouldn't matter on eventual check accuracy; we would temporarily handle longer content but catch that when String actually accessed.

@pjfanning
Copy link
Member Author

@pjfanning Ok, I am almost ready to merge.

And now I think that I actually agree with you on also checking length on contentsAsString(): we may get check fail earlier on buffer alloc, but if not, this is where we get the exact check that users expect

Doing this would make tests pass I think, since buffer size wouldn't matter on eventual check accuracy; we would temporarily handle longer content but catch that when String actually accessed.

I added back some checks to contentsAsString and contentsAsArray. I think the checks are strategically placed so that we don't validate too often but it would be worth going through them again to make sure that none are superfluous.

Copy link
Member

@cowtowncoder cowtowncoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good -- we could possibly avoid some checks (resetWithString()) but nothing that seems like it'd have measurable performance impact.

If you can change this from draft to regular PR, I think you could merge it.

There's probably need to start discussing actual limits to go with, on mailing lists, but let's start with 1 megachars/bytes and see where we end up.

@pjfanning pjfanning marked this pull request as ready for review February 14, 2023 00:17
@pjfanning
Copy link
Member Author

In my testing, resetWithString check was needed - it shouldn't add much overhead.

I've marked this a ready for review.

@pjfanning pjfanning changed the title WIP: add maxStringLen StreamReadConstraints: add maxStringLen Feb 14, 2023
@cowtowncoder cowtowncoder added this to the 2.15.0 milestone Feb 14, 2023
@cowtowncoder cowtowncoder merged commit b14ec4a into FasterXML:2.15 Feb 14, 2023
@cowtowncoder
Copy link
Member

Ok phew! It finally went in. Thank you again @pjfanning !

Also, I decided I should add you as co-author for jackson-core due to massive number of contributions; modified release-notes/CREDITS-2.x to state that.

@pjfanning pjfanning deleted the max-string-size branch February 14, 2023 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants