Fix url download function #2744

hlohaus · 2025-02-22T13:21:39Z

No description provided.

github-actions

Review for Pull Request: Fix url download function

Thank you, H Lohaus, for contributing to the project! Your changes look great and address some important issues.

Summary of Changes

URL Regex Update: The regex for URL matching has been updated to ensure it matches the entire URL string, which improves accuracy.
Max Depth Adjustment: The max_depth parameter in the download_urls function has been changed from 1 to 0, which may affect how URLs are processed.
Encoding Fixes: The encoding method in the stream_chunks function has been updated to use chunk.encode() instead of chunk.decode('utf-8'), which is a good improvement for handling byte data.

Suggestions

It might be helpful to add a brief description in the pull request description to provide context for the changes made. This can help reviewers understand the motivation behind the changes more clearly.

Overall, the changes look solid and should enhance the functionality of the URL download feature. Great work!

Looking forward to seeing this merged!

github-actions · 2025-02-22T13:23:10Z

g4f/gui/client/static/js/chat.v1.js

@@ -402,7 +402,7 @@ const handle_ask = async (do_ask_gpt = true) => {
    await add_conversation(window.conversation_id);

    // Is message a url?
-    const expression = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;
+    const expression = /^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/gi;


The change from a global case-insensitive match to a format enforcing regex using ^ and $ is correct for checking if the entire string is a URL. However, the gi flags for global and case-insensitive search remain. The g flag seems unnecessary as the intention appears to be validating a single URL string, not matching multiple occurrences. Consider removing the g flag to simplify the regex.

github-actions · 2025-02-22T13:23:10Z

g4f/tools/files.py

@@ -416,7 +416,7 @@ def read_links(html: str, base: str) -> set[str]:
 async def download_urls(
    bucket_dir: Path,
    urls: list[str],
-    max_depth: int = 1,
+    max_depth: int = 0,


Setting max_depth to 0 means there will be no depth limit for URL downloads. Is this intentional? It could potentially lead to an infinite loop or excessive resource usage.

github-actions · 2025-02-22T13:23:10Z

g4f/tools/files.py

@@ -515,7 +515,7 @@ def stream_chunks(bucket_dir: Path, delete_files: bool = False, refine_chunks_wi
    if refine_chunks_with_spacy:
        for chunk in stream_read_parts_and_refine(bucket_dir, delete_files):
            if event_stream:
-                size += len(chunk.decode('utf-8'))
+                size += len(chunk.encode())


Using encode() instead of decode('utf-8') may change the behavior of this function. Please verify that this change does not cause any unintended issues.

g4f/tools/files.py

Fix url download function

6314d27

github-actions bot reviewed Feb 22, 2025

View reviewed changes

hlohaus merged commit 55e2e27 into xtekky:main Feb 22, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix url download function #2744

Fix url download function #2744

hlohaus commented Feb 22, 2025

github-actions bot left a comment

github-actions bot Feb 22, 2025

github-actions bot Feb 22, 2025

github-actions bot Feb 22, 2025

Fix url download function #2744

Fix url download function #2744

Conversation

hlohaus commented Feb 22, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

Review for Pull Request: Fix url download function

Summary of Changes

Suggestions

github-actions bot Feb 22, 2025

Choose a reason for hiding this comment

github-actions bot Feb 22, 2025

Choose a reason for hiding this comment

github-actions bot Feb 22, 2025

Choose a reason for hiding this comment