Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix url download function #2744

Merged
merged 1 commit into from
Feb 22, 2025
Merged

Fix url download function #2744

merged 1 commit into from
Feb 22, 2025

Conversation

hlohaus
Copy link
Collaborator

@hlohaus hlohaus commented Feb 22, 2025

No description provided.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review for Pull Request: Fix url download function

Thank you, H Lohaus, for contributing to the project! Your changes look great and address some important issues.

Summary of Changes

  1. URL Regex Update: The regex for URL matching has been updated to ensure it matches the entire URL string, which improves accuracy.
  2. Max Depth Adjustment: The max_depth parameter in the download_urls function has been changed from 1 to 0, which may affect how URLs are processed.
  3. Encoding Fixes: The encoding method in the stream_chunks function has been updated to use chunk.encode() instead of chunk.decode('utf-8'), which is a good improvement for handling byte data.

Suggestions

  • It might be helpful to add a brief description in the pull request description to provide context for the changes made. This can help reviewers understand the motivation behind the changes more clearly.

Overall, the changes look solid and should enhance the functionality of the URL download feature. Great work!

Looking forward to seeing this merged!

@@ -402,7 +402,7 @@ const handle_ask = async (do_ask_gpt = true) => {
await add_conversation(window.conversation_id);

// Is message a url?
const expression = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;
const expression = /^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/gi;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from a global case-insensitive match to a format enforcing regex using ^ and $ is correct for checking if the entire string is a URL. However, the gi flags for global and case-insensitive search remain. The g flag seems unnecessary as the intention appears to be validating a single URL string, not matching multiple occurrences. Consider removing the g flag to simplify the regex.

@@ -416,7 +416,7 @@ def read_links(html: str, base: str) -> set[str]:
async def download_urls(
bucket_dir: Path,
urls: list[str],
max_depth: int = 1,
max_depth: int = 0,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting max_depth to 0 means there will be no depth limit for URL downloads. Is this intentional? It could potentially lead to an infinite loop or excessive resource usage.

@@ -515,7 +515,7 @@ def stream_chunks(bucket_dir: Path, delete_files: bool = False, refine_chunks_wi
if refine_chunks_with_spacy:
for chunk in stream_read_parts_and_refine(bucket_dir, delete_files):
if event_stream:
size += len(chunk.decode('utf-8'))
size += len(chunk.encode())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using encode() instead of decode('utf-8') may change the behavior of this function. Please verify that this change does not cause any unintended issues.

@hlohaus hlohaus merged commit 55e2e27 into xtekky:main Feb 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant