Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Self-Host] excludeTags should work with onlyMainContent? #1085

Open
piotrstarzynski opened this issue Jan 23, 2025 · 2 comments
Open

[Self-Host] excludeTags should work with onlyMainContent? #1085

piotrstarzynski opened this issue Jan 23, 2025 · 2 comments
Assignees
Labels
bug Something isn't working self-host

Comments

@piotrstarzynski
Copy link

Hello,

Issue: Support excludeTags with onlyMainContent=true

Current Behavior

When onlyMainContent is set to true, the excludeTags parameter appears to be ignored. The content extraction works, but I cannot apply additional tag exclusions.

Expected Behavior

excludeTags should work in conjunction with onlyMainContent=true, allowing for:

  • Base content filtering through onlyMainContent
  • Additional fine-tuned control through custom excludeTags

Problem Details

Currently:

  1. With onlyMainContent=true: excludeTags are ignored
  2. With onlyMainContent=false: using excludeTags from the default list (https://github.com/mendableai/firecrawl/blob/79e65f31ef1d7a4172870471d81501ee2e8aef22/apps/api/src/scraper/WebScraper/utils/excludeTags.ts) plus custom tags results in longer output

This suggests that onlyMainContent uses a more aggressive content filtering algorithm than just tag exclusion.

Proposed Solution

Add support for excludeTags to work even when onlyMainContent=true. This would provide more flexible content filtering options by combining both features.

Use Case

This would be useful when users want to:

  1. Get the main content of a page (onlyMainContent=true)
  2. Additionally exclude specific tags that might still be present in the main content

Would it be possible to implement this feature enhancement?

@nickscamara
Copy link
Member

@piotrstarzynski I believe this was fixed! Can you double check and confirm? Thank you!

@nickscamara nickscamara self-assigned this Jan 25, 2025
@nickscamara nickscamara added the bug Something isn't working label Jan 25, 2025
@piotrstarzynski
Copy link
Author

piotrstarzynski commented Jan 26, 2025

@nickscamara, I checked it once again and I believe it still doesn't work. In the example below, I still get a div with id=comments. Could someone please also confirm it?

        payload = {
            "url": url,
            "formats": ["markdown"],  # Request markdown format
            "pageOptions": {
                "onlyMainContent": True,
                "replaceAllPathsWithAbsolutePaths": True
            },
            "excludeTags": ["#comments"],
            #"includeTags": ["p"],
            "waitFor": 5000,
            "timeout": 50000
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working self-host
Projects
None yet
Development

No branches or pull requests

2 participants