Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only keep semantic fields in Java, i.e. skip location fields #1450

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

eregon
Copy link
Member

@eregon eregon commented Sep 12, 2023

cc @enebo

@enebo
Copy link
Collaborator

enebo commented Sep 12, 2023

@eregon We cannot toggle this yet. regular expressions do not unescape properly. I also still use non-escaped string value via location in one place (which perhaps is another problem). I will make an issue for one right now and perhaps a second will be forthcoming.

@enebo
Copy link
Collaborator

enebo commented Sep 12, 2023

I cannot link so this is first issue: #1452

@eregon
Copy link
Member Author

eregon commented Sep 12, 2023

@kddnewton Do you think #1452 could be fixed soon?
Otherwise I could change the logic to have something like YARP::Field#semantic_field? and special-case for the few locations we still need until such issues are fixed.

@kddnewton
Copy link
Collaborator

If you want to take a look, that would be great. Otherwise I'll add it to my list.

@eregon
Copy link
Member Author

eregon commented Sep 13, 2023

If you want to take a look, that would be great. Otherwise I'll add it to my list.

I don't think I'll have time for that one.

I updated the PR so we can mark individual location fields and still serialize those until they have a replacement.

@eregon
Copy link
Member Author

eregon commented Sep 13, 2023

@enebo I preserved the content_loc fields you mentioned in #1452, see the changes to config.yml.

Do you also need StringNode#opening_loc ? I'm not sure what you meant in #1452 (comment), that case has opening_loc: ∅.
In any case it's very easy to add back a given field (one-line change) if it turns out to be necessary for now.

@enebo
Copy link
Collaborator

enebo commented Sep 13, 2023

@eregon if you look at the other String nodes you will see the ones with delimeters have it but ones coming from interpolated strings do not. So it is neccesary until unescaped bytes for regexp is fixed. This is my current logic:

boolean interpolated = node.opening_loc != null && source[node.opening_loc.startOffset] != '\'';

@eregon
Copy link
Member Author

eregon commented Sep 14, 2023

@enebo OK I kept StringNode#opening_loc for now.
I think this could be handled in visitInterpolatedRegularExpressionNode by iterating the parts and not using the generic visitor but handling StringNode there specially.
But it's not a big deal to keep this field for now, the unescaped needs to be fixed anyway.

@eregon
Copy link
Member Author

eregon commented Sep 14, 2023

@kddnewton Could you merge this?

Copy link
Collaborator

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has two names for the same thing: semantic_fields and location_fields. Can you call them all semantic_fields?

@eregon
Copy link
Member Author

eregon commented Sep 14, 2023

@kddnewton The problem is how should the boolean be named?
Right now it's include_location_fields, include_semantics_fields would be wrong, maybe only_semantics_fields then?

@kddnewton
Copy link
Collaborator

I would probably just call it semantic.

@eregon
Copy link
Member Author

eregon commented Sep 14, 2023

OK, I'll update to that tomorrow

@eregon
Copy link
Member Author

eregon commented Sep 15, 2023

I think

// Serialize the AST represented by the given node to the given buffer.
YP_EXPORTED_FUNCTION void yp_serialize(yp_parser_t *parser, yp_node_t *node, yp_buffer_t *buffer, bool semantic);

is too unclear and we would need to explain this field in every public API in details like
"the semantic argument means whether to only keep semantic fields or keep all fields (including location fields)".

I think with only_semantics_fields there is no need to explain and it's much clearer what the true/false does, so I'll go with that:

// Serialize the AST represented by the given node to the given buffer.
YP_EXPORTED_FUNCTION void yp_serialize(yp_parser_t *parser, yp_node_t *node, yp_buffer_t *buffer, bool only_semantics_fields);

@eregon
Copy link
Member Author

eregon commented Sep 15, 2023

I have changed the approach, now there is an env var (YARP_SERIALIZE_ONLY_SEMANTICS_FIELDS) which is used at templating time and decides whether to serialize non-semantic/location fields or not.
That way there is no runtime overhead and serialize.c does not have any runtime checks for this.
This is closer to what I had initially in mind, i.e., template-time configuration, which is used (& needed) in #1493 as well.

* Add $YARP_SERIALIZE_ONLY_SEMANTICS_FIELDS to control where to serialize location fields at templating time,
  this way there is no overhead for either case and nothing to check at runtime.
* Add a byte in the header to indicate whether location fields are included as expected.
* Fixes ruby#807
* Simplify the build-java CI job now that the FFI backend is available so JRuby can serialize.
* Support keeping some location fields which are still needed until there is a replacement
@eregon
Copy link
Member Author

eregon commented Sep 16, 2023

See #1532 (comment) for the savings on serialized size with this PR

@@ -14436,6 +14436,7 @@ yp_serialize(yp_parser_t *parser, yp_node_t *node, yp_buffer_t *buffer) {
yp_buffer_append_u8(buffer, YP_VERSION_MAJOR);
yp_buffer_append_u8(buffer, YP_VERSION_MINOR);
yp_buffer_append_u8(buffer, YP_VERSION_PATCH);
yp_buffer_append_u8(buffer, YP_SERIALIZE_ONLY_SEMANTICS_FIELDS ? 1 : 0);
Copy link
Member Author

@eregon eregon Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YP_SERIALIZE_ONLY_SEMANTICS_FIELDS needs to be defined just for this usage.
I think we should move yp_serialize or the part writing the header to serialize.c.erb, but that seemed out of scope of this PR.

@kddnewton
Copy link
Collaborator

@eregon I'm nervous a bit about the implications of having this be an environment variable. It means everything is dependent on how the gem/project was templated, as opposed to the options specified. This now means there could be a mismatch between the serialize.c file serializing the AST and serialize.rb doing the deserializing.

In general I think almost all of this PR is really good, I'm just very nervous about making the runtime behavior tied to a build-time specification, especially considering how many different consumers there are. Would you consider moving this back to a runtime specification?

@eregon
Copy link
Member Author

eregon commented Sep 19, 2023

@eregon I'm nervous a bit about the implications of having this be an environment variable. It means everything is dependent on how the gem/project was templated, as opposed to the options specified. This now means there could be a mismatch between the serialize.c file serializing the AST and serialize.rb doing the deserializing.

Yes the mismatch is possible but it is checked and errors properly in case of the mismatch with the added byte in the serialization header.

In general I think almost all of this PR is really good, I'm just very nervous about making the runtime behavior tied to a build-time specification, especially considering how many different consumers there are. Would you consider moving this back to a runtime specification?

I believe we need build-time specification for #1493 anyway.
So in my view it's necessary at least for Java consumers of YARP.

I'm not very keen to make it a runtime specification because it makes the code more complex and might have some performance overhead, when there is AFAIK no value/use-case to specify this dynamically.

If it turns out to be needed to be a runtime parameter we can always change this fairly easily.
I pushed main...eregon:yarp:semantic-fields-bak which was the approach using a runtime parameter in case we want to get inspiration from it later.

Copy link
Collaborator

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. We can go with this for now. I'm still nervous about this change because we've tied runtime behavior to build-time config, but hopefully I'm wrong.

@kddnewton kddnewton merged commit cd9a80c into ruby:main Sep 19, 2023
@eregon
Copy link
Member Author

eregon commented Sep 19, 2023

Thank you for merging.

we've tied runtime behavior to build-time config, but hopefully I'm wrong.

Only when YARP is used as the JRuby/TruffleRuby parser.

When YARP is used as a gem, location fields are always included and hence the CI still passes on JRuby/TruffleRuby.
serialize.rb.erb checks and expects that location fields are included, so that's clearly defined.
Conversely Loader.java.erb checks and expects that location fields are not included.
The only part needing to handle both cases is serialize.c.erb and that's just a couple lines of difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exclusion of location fields in serialize for execution
3 participants