Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude punctuation from character count for Japanese texts #22050

Open
wants to merge 7 commits into
base: trunk
Choose a base branch
from

Conversation

marinakoleva
Copy link
Contributor

@marinakoleva marinakoleva commented Feb 17, 2025

Context

  • Unlike all other languages, for Japanese we measure the length of a sentence/ paragraph/ text by counting the number of characters, rather than the number of words.
  • Since when counting the number of words we don't take into account the punctuation in the sentence, it made sense to exclude punctuation from the character count as well.

Summary

This PR can be summarized in the following changelog entry:

  • Improves accuracy of assessments measuring character count for Japanese texts by removing common punctuation from the count.
  • [shopify-seo] Improves accuracy of assessments measuring character count for Japanese texts by removing common punctuation from the count.

Relevant technical choices:

  • For Japanese, we use 2 helpers to help 😉 us with the character count: countCharacters.js and wordsCharacterCount.js. We use the latter in Keyphrase assessments, which filter out punctuation after the string has been segmented. We also use the helper for the Reading time feature.
  • This PR focuses on the helper countCharacters.js, which is used for the following assessments: Sentence length, Paragraph length, Subheading distribution, Text length.
  • The reason why I didn't use the general removePunctuation helper in countCharacters.js is because it doesn’t work for Japanese, as it requires a space before/ after a character in order to recognize it, and Japanese doesn't use spaces for the most part.
  • The punctuation characters that were excluded represent the most commonly used punctuation in Japanese. I was the one making the choice of what is "most common" :)
  • Resources consulted:

Test instructions

Test instructions for the acceptance test before the PR gets merged

This PR can be acceptance tested by following these steps:

  • Make sure the Free plugin is active.
  • Set your site language to Japanese.
  • Open a new post
  • Test the Sentence length assessment (文の長さ) within the Readability analysis (可読性解析)
    • Paste this sentence: 「黒猫」(くろねこ、Black Cat)は、1843年に発表されたエドガー・アラン・ポーの短編小説。
    • The sentence has 40 characters when all punctuation marks are excluded (the maximum recommended length).
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 文の長さ: いい感じです !
    • Confirm that adding a random letter or number to the sentence switches the feedback to a red traffic light 🍎 . Remove that letter.
  • Test the Paragraph length assessment (段落の長さ)
    • Add this paragraph to the text:
      現在、全国の約130住宅が参加しており、そのほとんどが個人所有の民家です。日本の文化財建造物は、そのほとんどが『木造建築』で地震、台風、洪水など自然災害や火災の多い中で築後何百年と云う長い歴史を生き残って来たものです。更に戦争や社会構造などの変化で消えてしまった建造物も数多くあったことでしょう。昭和52年(1977)に当『全国重文民家の集い』が誕生して早や半世紀近く経とうとしています。 その間、国指定の重要文化財民家(略称: 重文民家)の所有者が手探りで学んで来た経験を互いに情報交換し、更に地域社会との更なる交流、行政や学識経験者との協力を深めて来ました。 構造物としての家屋の保存だけでなく、地域社会の文化やその家に伝わる伝統・住文化の継承に貢献。
    • The paragraph is 328 characters altogether, but when spaces and punctuation are excluded, it's 300 characters (the maximum recommended length)
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 段落の長さ: 長過ぎる段落はありません。Good Job!
    • Confirm that adding a random letter or number to the paragraph switches the feedback to an orange traffic light 🍊 . Remove that letter.
  • Test the Subheading distribution assessment (小見出し分布)
    • Add the following text to the post:
      又、近年では英国のH.H.A.(Historic Houses Association ―歴史住宅協会―)との交流を深め、英国を初め欧州の文化財情報や所有者の高齢化に伴う次世代への継承問題についての情報交換を行っている。 こんばんは~!お昼のブログもたくさん見ていただきありがとうございました:Dうーーー、今から90年代に戻れるなら「絶対に抜いたらあかんで!」って言いに行きたい💦 でも上に貼ってるブログ見たら、しみじみ眉毛で顔ってぜんぜん印象違うな…と思う!(NARSのチークでもおすすめ~!)美容つながりもうひとつ・・・。
    • The text in the post now becomes 640 characters altogether, but when spaces and punctuation are excluded, it's 600 characters (the maximum recommended length you can have without having a subheading).
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback 小見出し分布: 小見出しは使用していませんが、テキストは十分に短く、おそらく必要ありません。
  • Test the Text length assessment (テキストの長さ) within the SEO analysis (SEO 解析)
    • Confirm the assessment returns a 🍏 green traffic light, with the feedback テキストの長さ: テキストは600 文字です。いいですね !
    • Remove 1 character from the text (that is not a punctuation)
    • Confirm the assessment now returns an 🍊 orange traffic light, with the feedback テキストの長さ: テキストは 599 文字です。これは推奨下限値 600 文字を少し下回ります。文章をもう少し加えましょう.

Relevant test scenarios

  • Changes should be tested with the browser console open
  • Changes should be tested on different posts/pages/taxonomies/custom post types/custom taxonomies
  • Changes should be tested on different editors (Default Block/Gutenberg/Classic/Elementor/other)
  • Changes should be tested on different browsers
  • Changes should be tested on multisite

Test instructions for QA when the code is in the RC

  • QA should use the same steps as above.

QA can test this PR by following these steps:

Impact check

This PR affects the following parts of the plugin, which may require extra testing:

  • N/a

UI changes

  • This PR changes the UI in the plugin. I have added the 'UI change' label to this PR.

Other environments

  • This PR also affects Shopify. I have added a changelog entry starting with [shopify-seo], added test instructions for Shopify and attached the Shopify label to this PR.

Documentation

  • I have written documentation for this change. For example, comments in the Relevant technical choices, comments in the code, documentation on Confluence / shared Google Drive / Yoast developer portal, or other.

Quality assurance

  • I have tested this code to the best of my abilities.
  • During testing, I had activated all plugins that Yoast SEO provides integrations for.
  • I have added unit tests to verify the code works as intended.
  • If any part of the code is behind a feature flag, my test instructions also cover cases where the feature flag is switched off.
  • I have written this PR in accordance with my team's definition of done.
  • I have checked that the base branch is correctly set.

Innovation

  • No innovation project is applicable for this PR.
  • This PR falls under an innovation project. I have attached the innovation label.
  • I have added my hours to the WBSO document.

Fixes ##523

@marinakoleva marinakoleva added the changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog label Feb 17, 2025
@coveralls
Copy link

coveralls commented Feb 17, 2025

Pull Request Test Coverage Report for Build 2ab1e1bf0f27bcd8cc7abf73fce32659e2b76c93

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.001%) to 53.22%

Totals Coverage Status
Change from base Build 1e8e6847fc86a736d438283affea1e7dbd3ccb61: 0.001%
Covered Lines: 30284
Relevant Lines: 57693

💛 - Coveralls

@@ -20,13 +20,14 @@ describe( "A TextLengthAssessment for a taxonomy page in Japanese", function() {
expect( assessment._config.veryFarBelowMinimum ).toEqual( assessmentConfigJapanese.taxonomyAssessor.veryFarBelowMinimum );
} );
it( "should return a good result for taxonomy pages in Japanese when the text is 60 characters or more", function() {
const paper = new Paper( "欧米では、かつては不吉の象徴とする迷信があり、魔女狩りなどによって黒猫が殺されることがあった。たとえばベルギー・ウェス。" );
const paper = new Paper( "欧米では、かつては不吉の象徴とする迷信があり、魔女狩りなどによって黒猫が殺されることがあった。その傾向は現在も続いており、" +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since punctuation was removed from the count, the sentence length became less than 60 characters, so I added some more text, in order to trigger the same feedback from the assessment.

expect( sentences[ 1 ].sentenceLength ).toBe( 7 );
expect( sentences[ 2 ].sentenceLength ).toBe( 5 );
} );
it( "returns sentences with exclamation mark", function() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was combined with the one above.

@marinakoleva marinakoleva added the Shopify This PR impacts Shopify. label Feb 18, 2025
Copy link

A merge conflict has been detected for the proposed code changes in this PR. Please resolve the conflict by either rebasing the PR or merging in changes from the base branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: enhancement Needs to be included in the 'Enhancements' category in the changelog Shopify This PR impacts Shopify.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants