Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a task to show statisticts about serialized size/source size for the top 100 gems #1532

Merged
merged 1 commit into from
Sep 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,12 @@ jobs:
run: bundle exec rake lex:topgems
- name: Parse Top 100 Gems
run: bundle exec rake parse:topgems
- name: Serialized size stats with all fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting metric, but I don't think we want it to run on every CI run. The topgems check is already one of the longest-running. I'm fine keeping the task around though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only takes like 2 seconds to execute, so I think it has no observable impact on CI time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise, the 3 commands below respectively take, locally on my machine: 2s, 6s, 2s.
I'd like to keep this check in CI because it's a very convenient way to check if serialized size got better or worse or same for a PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if rake lex:topgems could be optimized further, that takes 4m22s in that run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick profile with stackprof/autorun gives:

stackprof /tmp/stackprof20230919-104519-jjlzxp.dump
==================================
  Mode: wall(1000)
  Samples: 21520 (90.63% miss rate)
  GC: 13603 (63.21%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      8441  (39.2%)        8441  (39.2%)     (marking)
      7520  (34.9%)        7520  (34.9%)     (sweeping)
      5530  (25.7%)        5530  (25.7%)     Thread#join
         3   (0.0%)           3   (0.0%)     Kernel#require
     15963  (74.2%)           2   (0.0%)     (garbage collection)
      5530  (25.7%)           0   (0.0%)     Array#map
      7890  (36.7%)           0   (0.0%)     YARP.parallelize
...

Which looks strange, like non-main threads are not included or so.
That does seem like a lot of GC though.
Also ENV.fetch("WORKERS") { 16 } is probably way too high for GitHub Actions, maybe it could use Etc.nprocessors or so, or it might be more efficient to not parallelize that at all on CRuby (might just increase GVL contention).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kddnewton I can move this to a separate CI job, then it won't make this one any longer. Would that address your concern?

run: bundle exec rake serialized_size:topgems
- name: Recompile with only semantic fields
run: YARP_SERIALIZE_ONLY_SEMANTICS_FIELDS=1 bundle exec rake clobber compile
- name: Serialized size stats with only semantic fields
run: bundle exec rake serialized_size:topgems

memcheck:
runs-on: ubuntu-latest
Expand Down
1 change: 1 addition & 0 deletions rakelib/check_manifest.rake
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ task :check_manifest => [:templates] do
rust
templates
test
top-100-gems
tmp
vendor
]
Expand Down
33 changes: 33 additions & 0 deletions rakelib/lex.rake
Original file line number Diff line number Diff line change
Expand Up @@ -356,3 +356,36 @@ task "lex:topgems": ["download:topgems", :compile] do
exit(1)
end
end

task "serialized_size:topgems": ["download:topgems"] do
$:.unshift(File.expand_path("../lib", __dir__))
require "yarp"

files = Dir["#{TOP_100_GEMS_DIR}/**/*.rb"]
total_source_size = 0
total_serialized_size = 0
ratios = []
files.each do |file|
source_size = File.size(file)
next if source_size == 0
total_source_size += source_size

serialized = YARP.dump_file(file)
serialized_size = serialized.bytesize
total_serialized_size += serialized_size

ratios << Rational(serialized_size, source_size)
end
f = '%.3f'
puts "Total sizes for top 100 gems:"
puts "total source size: #{'%9d' % total_source_size}"
puts "total serialized size: #{'%9d' % total_serialized_size}"
puts "total serialized/total source: #{f % (total_serialized_size.to_f / total_source_size)}"
puts
puts "Stats of ratio serialized/source per file:"
puts "average: #{f % (ratios.sum / ratios.size)}"
puts "median: #{f % ratios.sort[ratios.size/2]}"
puts "1st quartile: #{f % ratios.sort[ratios.size/4]}"
puts "3rd quartile: #{f % ratios.sort[ratios.size*3/4]}"
puts "min - max: #{"#{f} - #{f}" % ratios.minmax}"
end