Q: To detect 40x40 (minimum size) faces on 800x600 image, actually the system is to scan the 240x180 image with 12net, thus the number candidate windows is only 2494. And for the other pyramids, there will be much less windows to be classified, right?
Q: How many pyramid did u use?
A: On AFW we use up to 8 levels of pyramid with scaling factor 1.414. On FDDB because of the small faces there, we use up to 16 levels of pyramid with scaling factor 1.18.
Q: According to my experience, I feel that more candidate windows should be classified in the first cascade, otherwise some faces will be missed.
A: Yes, we need enough sliding windows in the very first stage for a good recall, but a few thousands seem to be enough. We show in Table 1 that with about 5000 candidate windows, the recall on FDDB can be up to 95%. There are many small faces in FDDB, you may need a more aggressive setting to get a higher recall.
Q: Any face augmentation is used in training?
Q: Is the image pyramid used to handle faces at different scales in testing?
Q: 24net gets a relatively low recall (91% recall on AFW), even a very low confidence score (0.02) is used
A: My guessing is that the negative samples in training the 24net is too hard.
Q: When training the 48-net, it is very difficult to collect enough training samples (about 3 faces per image).
A: You can add more images and use more scales in building the image pyramid for negative mining.