Body Detection with Computer Vision
Written by Matthew Ward, Senior Developer
In early 2019, we set out to build a functional high-fidelity prototype for user experience testing, and we wanted it to feel as real as possible. The user flows required the participants to use smartphones in non-traditional ways, and we needed to make sure these unfamiliar behaviors were as easy and friction-free as possible.
One interaction required a user to put their phone on the ground, step away, and have full-body photos taken. The challenge was knowing when they were fully in-frame and the right distance away from the device. Instead of faking it and just using a timer, we tossed out the idea of actually detecting the user before snapping the photos in an effort to provide users with the appropriate prompts and make it feel more realistic. Our client loved this, so all we needed to do was figure out how to make it happen.
The project had a quick turn around with only four weeks to design and build the entire experience, and it had to work on an iPhone in real-time. So, of course, we said, “No problem!” and got to work.
The prototyping mindset
With the short time frame, we didn’t have time to deep dive into the world of computer vision. This is often the case with prototyping; we have to become proficient enough to get something working by jumping into new, cutting edge technologies. To accomplish this, we’ve become very good at a few key skills.
First, we have to be able to find quality resources and examples quickly. There are people out there who have dedicated years to specific technologies, so we can leverage their learnings and examples to give us a jump start on solving the problem.
Second, we have to become masters of timeboxing. There’s no time to go down a rabbit hole. Sometimes if the technology you’re trying to use just isn’t working, you have to scrap it and move on. We like to use timers and say, “If I can’t get this working in N minutes, I need to move on to something else”. It encourages us to keep moving
Third, we have to break down these huge intimidating problems into small, digestible chunks. Real-time body detection sounds hard, but installing a framework and getting a console log to print is easy. Start small and keep building.
After some discussion and quick research, we split off into two areas of exploration for detecting if the user was fully visible in the camera frame while the designers were hard at work iterating on the look and feel of the experience. One of us decided to explore possible solutions using the (then) new TrueDepth Camera on the Apple iPhone X, and the other started looking into using OpenCV to do body detection.
We made some very cool mini prototypes using the TrueDepth Camera, but since it introduced a hardware limitation (the user has to have the latest iPhones for it to work) we decided to go with OpenCV, which required only a camera.
What is OpenCV?
OpenCV is a cross-platform, open-source, real-time computer vision library. It’s a C++ library with Java and Python wrappers and has algorithms that can detect human features, identify objects, classify human actions in videos, track objects, follow eye movements, recognize scenery, and much more. It works in real-time, and as an added bonus it already has built-in trained models to detect people. That saved us loads of time since we didn’t have to train a model ourselves.
There are a lot of different algorithms for doing image detection. Being brand new to any sort of computer vision work, we did what any self-respecting professionals would do and frantically googled things like “OpenCV body detection”. We found our way into some great blogs and Stack Overflow topics to help us get started.
The first algorithm we explored in OpenCV was Haar Cascade. Haar Cascade uses machine learning to identify any object it has been trained on. From the few blog posts we read, it seems to work by being trained with a large selection of positive images of whatever you’re trying to detect as well as a large selection of negative images. It then creates Haar Features that match up in the positive set. The algorithm will scan through rectangular chunks of a photo and try and detect any Haar Features that it believes match what it has been trained on. Here’s a great video from Adam Harvey to help visualize how it works.
As an added bonus, Haar Cascade could detect upper vs. lower bodies, which could help us determine if the user’s legs or torso weren’t fully in frame and guide them to back up or adjust their phone to the right position.
After many parameter tweaks and a lot of testing in different locations we decided we needed to try a different approach. Haar Cascade could indeed detect bodies but not very reliably. It needed good lighting and a very clean background. As this prototype was supposed to be used in someone’s home, we couldn’t rely on either of those.
If Photobooth can do it, why can’t we?
Before abandoning Haar Cascade entirely, we decided to introduce background subtraction before running Haar Cascade on the images. We were hoping this would help with the busy background issue. Luckily for us, OpenCV had a pretty easy function for background subtraction.
All we had to do was instruct the user to step out of frame so we could take a picture of the background. Then, when they re-enter the frame, OpenCV would identify which pixels were new and give us an image with the user cut out. It seemed like a great and fool-proof plan, and we marveled at how brilliant we were to come up with such a genius solution to our problem.
Unfortunately, it wasn’t a solution at all. In the right conditions background subtraction did indeed seem to help the Haar Cascade algorithm detect the user, but in the wrong conditions it made it even more difficult. If the user is too close to a wall, for instance, they would cast a shadow and noise up the image. We also needed to lock camera exposure to get the cleanest result, and the phone couldn’t move at all after we capture the background image. Otherwise we would have to restart the flow all over again.
At this point we were frustrated and feeling defeated. We started casting about wildly for other techniques we could use to reliably detect somebody in a picture.
Grasping for straws
Branching off of our background removal idea, we came across the MoG (Mixture of Gaussians) technique which is basically a rolling background removal technique. It compares the last N number of frames to each other to remove any pixels that didn’t change. It removes backgrounds super well since it’s constantly learning and adapting, but unfortunately that also means that if a user comes into frame and stands still they will disappear into the black void.
We’ll put this technique in our back pocket in case we need to do motion detection in the future (or make a very cool music video), but for the time being we needed to push forward and figure out a solution. The clock was ticking. Tensions were high. We hit a wall and didn’t know if it would be possible to have a good body-detection experience in this prototype, which would mean hours of time wasted, and that’s never a good feeling.
At our lowest point, from the heavens, a beacon of light appeared. That beacon of light was our co-worker Mike Creighton who appeared at our desk saying, “Hey have you heard of the HOG technique?”
HOG stands for Histogram of Orientated Gradients. It works similarly to the Haar technique but instead of detecting blocks of dark and light, it detects angles of gradients. It was super easy to set up and pretty similar implementation-wise to Haar Cascade.
So we gave it a shot. We implemented the code, wildly tweaked some parameters, and it worked! Not only did it work, but it worked in bad lighting conditions, it worked with busy backgrounds, and it even worked if you were doing weird poses. It was incredible. We went into the deepest darkest cave we could find around the office (the storage room behind the front desk, for future reference) and did a test. It had no trouble at all detecting us. It was just what we needed and really helped with the success of our prototype. We got it built into the prototype, and our clients were really pleased to see it actually waited for the user to be fully in frame before it took the photos. It added that extra touch of fidelity and problem solving that our clients have come to expect from us, and it felt great to not let them down.
So what did we learn?
Computers don’t see things the way we see them; they just see data. Luckily some very smart people have written software that makes implementing computer vision a lot easier than one might think. Granted It takes a lot of parameter tweaking to get accurate and performant results. It can also be processor-intensive and uses a lot of battery (my test device was doubling as a hand-warmer.)
We were also reminded that getting stuck is good. Hitting a wall and having to throw away code and start over is a great practice when prototyping since it’s guaranteed to happen often. And last but not least, it reminded us how incredibly satisfying it is to whip something together in a very short amount of time, to write scrappy (and sometimes crappy) code to accomplish what you need, to move quickly and figure things out as you go. As long as you don’t give up, and with the occasional help from a friend, you can figure just about anything out.