Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching on .extract_faces to improve performance and utilize GPU in full #1435

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

galthran-wq
Copy link
Contributor

Tickets

#1433
#1101
#1434

What has been done

With this PR, .extract_faces is able to accept a list of images

How to test

make lint && make test

Benchmarking on detecting 50 faces:
image

For yolov11n, batch size 20 is 59.27% faster than batch size 1.
For yolov11s, batch size 20 is 29.00% faster than batch size 1.
For yolov11m, batch size 20 is 31.73% faster than batch size 1.
For yolov8, batch size 20 is 12.68% faster than batch size 1.

@skyler14
Copy link

Do you have a branch in your fork that currently combines all the optimizations you've submitted. I'd like to start using them while the approval process is going

whats been the total speedup you've been able to see

@galthran-wq
Copy link
Contributor Author

I do. You can check
https://github.com/galthran-wq/deepface/tree/master-enhanced

it combines these two PRs with some other small modifications:

  • .represent uses batched detector inference (here Batching on .represent to improve performance and utilize GPU in full #1433 it only does batched embedding, because batched detection is not yet implemented)
  • .represent returns a list of list of dicts, if a batch of images is passed. This is neccessary to be able to recover to which images the resulting faces correspond to. It might be a good idea to include this change in the PR as well. You can check the test in the fork.

Not all of the detectors currently (both in this PR and in the fork) implement batching. In particular, YOLO does. I've found it to be optimal in terms of performance and inference speed. The only problem is installing both torch and tensorflow with GPU, but I've managed to somehow do that.

All in all, with the combination of yolov11m and Facenet, both using GPU, and batch size 100 (the largest I could fit in 4090) I am seeing aroung 15x speed boost, but that is highly dependent on the input images, the GPU (especially memory size). I've also had a quick peek and it seems like the performance on the CPU is improved as well.

@serengil FYI I would be happy to contribute the aforementioned modifications if we have progress on the PRs.

@serengil
Copy link
Owner

I will review this PR this week i hope

@serengil
Copy link
Owner

Seems this breaks the unit tests. Must be sorted.

@galthran-wq
Copy link
Contributor Author

should be good now

@serengil
Copy link
Owner

Nope, still failing.

@serengil
Copy link
Owner

You implemented OpenCv, Ssd, Yolo, MtCnn and RetinaFace to accept list inputs

What if I send list to YuNet, MediaPipe, FastMtCnn, Dlib or CenterFace?

I assume an exception will be thrown, but users should see a meaningful message.

@serengil
Copy link
Owner

@galthran-wq you are pushing too many things, would you please inform me when it is ready.

@galthran-wq
Copy link
Contributor Author

galthran-wq commented Feb 18, 2025

@galthran-wq you are pushing too many things, would you please inform me when it is ready.

it is ready now, im usually pushing when it's done

I've just fixed what you've noted:

  • added pseudo-batching (for loop) for other models
  • support for np array batched input (dim=4)
  • changed extract_faces to return a list of lists on batched input
  • couple more tests

@serengil
Copy link
Owner

is this ready? i can still see my comments not resolved.

@galthran-wq
Copy link
Contributor Author

is this ready? i can still see my comments not resolved.

almost, tell me if you think that the batched numpy array test is okay and let's also settle on whether refactoring of current detect_faces for non-batched detectors is needed (your last comment).

I plan to:

  • add test comments
  • fallback to pseudo-batching on opencv and mtcnn and make sure the rtol stays at 0.01% then
  • (maybe) refactor detectors a bit

@galthran-wq
Copy link
Contributor Author

galthran-wq commented Feb 23, 2025

To summarize what's changed:

  • I've added comments and additional checks to the tests.
  • I've made batching on opencv and mtcnn optional (due to the above issue). To enforce batching a user can set ENABLE_OPENCV_BATCH_DETECTION (or ENABLE_MTCNN_BATCH_DETECTION) to true.

Unfortunately, this didn't fix the batch extraction case on opencv -- the problem is that is occasionally fails (it seems that the predictions have some random behaviour, so the results might be different from run to run!). Note that it has nothing to do with batching, because it is disabled by default. We might add a separate issue and test to reproduce this.

  • i have fixed the special case of a single image in a list input (batch of size 1). It now indeed returns a list with a single element -- the list of detected faces in that image.
  • those detectors that do no implement batching all had repeating logic in detect_faces. I have moved this logic to the default implementation in Detector. Now, those detectors only need to implement _process_single_image, and batching would be supported by inheritance.
  • If a detector implements batching, then it overrides detect_faces with this logic, just as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants