-
-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix url download function #2744
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review for Pull Request: Fix url download function
Thank you, H Lohaus, for contributing to the project! Your changes look great and address some important issues.
Summary of Changes
- URL Regex Update: The regex for URL matching has been updated to ensure it matches the entire URL string, which improves accuracy.
- Max Depth Adjustment: The
max_depth
parameter in thedownload_urls
function has been changed from1
to0
, which may affect how URLs are processed. - Encoding Fixes: The encoding method in the
stream_chunks
function has been updated to usechunk.encode()
instead ofchunk.decode('utf-8')
, which is a good improvement for handling byte data.
Suggestions
- It might be helpful to add a brief description in the pull request description to provide context for the changes made. This can help reviewers understand the motivation behind the changes more clearly.
Overall, the changes look solid and should enhance the functionality of the URL download feature. Great work!
Looking forward to seeing this merged!
@@ -402,7 +402,7 @@ const handle_ask = async (do_ask_gpt = true) => { | |||
await add_conversation(window.conversation_id); | |||
|
|||
// Is message a url? | |||
const expression = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi; | |||
const expression = /^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/gi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from a global case-insensitive match to a format enforcing regex using ^
and $
is correct for checking if the entire string is a URL. However, the gi
flags for global and case-insensitive search remain. The g
flag seems unnecessary as the intention appears to be validating a single URL string, not matching multiple occurrences. Consider removing the g
flag to simplify the regex.
@@ -416,7 +416,7 @@ def read_links(html: str, base: str) -> set[str]: | |||
async def download_urls( | |||
bucket_dir: Path, | |||
urls: list[str], | |||
max_depth: int = 1, | |||
max_depth: int = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting max_depth
to 0
means there will be no depth limit for URL downloads. Is this intentional? It could potentially lead to an infinite loop or excessive resource usage.
@@ -515,7 +515,7 @@ def stream_chunks(bucket_dir: Path, delete_files: bool = False, refine_chunks_wi | |||
if refine_chunks_with_spacy: | |||
for chunk in stream_read_parts_and_refine(bucket_dir, delete_files): | |||
if event_stream: | |||
size += len(chunk.decode('utf-8')) | |||
size += len(chunk.encode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using encode()
instead of decode('utf-8')
may change the behavior of this function. Please verify that this change does not cause any unintended issues.
No description provided.