Quasar: Pt. 2 — Combating Bias

Quasar Post-Mortem: Part 2

Below is a summary of the articles in this post-mortem. Although having full context is recommended, the subject matter explored herein covers many disciplines and can be consumed non-linearly. Feel free to jump around to whichever area interests you most:

Part 1:
What is Quasar?

Executive Summary
Uncanny Valley
Art Direction

Part 3:
Tech Direction

Interoperability
Standards
Systems Summary

Part 4:
User Experience

Respecting User Privacy
Designing for Impatience
Design Artifacts

Part 5:
The Conclusion

Shortcomings
Avatar Samples
Personal Contributions

Combating Bias

The racial and gender bias in computer vision applications is fairly well established at this point — just about every large tech company has had at least one incident of developing bias into computer vision implementations.

Examples of past CV biases:

Please note each of these issues has since been addressed — It is not my intention to single out any one company or product, but rather demonstrate the pervasiveness of the problem.

Amazon Prime Video: Prime Video’s “X-Ray” feature suffered from unreliable identification of dark-skinned actors’ faces, or failing to detect a dark-skinned face on-screen at all
IBM Watson: IBM’s Watson facial recognition model suffered from unreliably detecting people of color, and to an even greater extent, women of all skin tones
Google Vision AI: Google’s image recognition produced inequitable predictions — e.g. a dark skinned hand holding a monocular device was predicted as a gun, while the exact same image with a light skinned hand was predicted, accurately, as a monocular device

How does this happen?

First and foremost, I encourage everyone reading this to check out the superb documentary Coded Bias (2020) — Joy Buolamwini incisively illuminates how exactly implicit bias has made its way into our algorithms, its many destructive effects, and what we can do to help the situation.

There's a problem with our data.

Each of these CV models were given a dataset of some variety to train the model — and those datasets are small snapshots of our very real human history. It does not take much consideration to make the connection of how our history could impose bias on a dataset.

How did Quasar combat bias?

I’ll be the first to say Quasar is not perfect, but from the very beginning building-for-inclusion was identified as the prime directive by which every decision should, and would, be weighed. This required adopting a thought and development process which constantly asked — is this accessible to everyone? Does this implementation have the same success rate, the same User Experience, regardless of who is using it? In the world of CV models, historically the answer hasn't been encouraging — and we should actively develop solutions to avoid that scenario.

Solutions for avoiding CV bias:

Leveraging IR depth data
- This is the holy grail of unbiased data — depth cameras do not require visible light to ‘see’ the physical world. This was one of the reasons iPhone X+ was chosen as a primary development platform; we could use depth data to not only positively identify an individual’s presence in the frame, but also use that depth data to improve the overall transference of their likeness onto a 3D mesh (avatar head)
Pre-processing input images
- Input images should have some logic in place for identifying when and to what extent to apply image processing techniques in order to remove troublesome lighting information — we called this step the “Delighter”
- Shortcoming: I expected we could use the iPhone’s ambient light sensor data to aid in stripping lighting information from the input images — however, it turns out Apple will reject any app which accesses this data! Hopefully this changes in the future.
Expanding training set coverage
- This one is probably obvious; the more samples which feature underserved audiences (historically women and people of color) in the training set, the less difficulty the model will have in parsing the subject matter
Diversify tester groups
- The most obvious solution of all — ensure the QA/testing population sufficiently represents the full spectrum of humankind
- Establish processes for ensuring a high frequency of homogenous testers will not mute the experiences of less frequent (i.e. more diverse) testers

And most importantly — talk about it.

These subjects can be difficult to broach at first, but I assure you once that seal is broken it becomes a development pillar just like any other. Unearthing our own blindspots is not something we should be afraid to discover — the true travesty is to not make an attempt to find them at all. It is a sign of respect to inquire about the experiences of others with curiosity, and the inverse is just as true.

Empathy Fuel

Imagine what it must feel like to be using a cool new app with your friends — everyone is passing the phone around and laughing at the latest camera filter which spews rainbows and glitter out of your ears — but when it gets to you… it doesn’t work. In fact it works for everyone but you. This could understandably make you feel “less than” your peers. It might even feel like the world doesn’t have a place for you, or you were completely forgotten and simply don’t matter. These experiences are a reality for a great many people, most especially those actively trying to find their place in the world, such as children and young adults — which are our most formative experiences. An exclusionary experience like this can have potentially life-long impacts to our confidence and character. We shouldn’t turn away from the pain of empathy — empathy is the motivation!

The Regional Template System

The Regional Template System was initially theorized as a means to reduce implicit bias, but it also improved the overall output of the application for users of all colors. Each template (seen above) represents the average appearance of an individual from that particular region of the world.

In the simplest terms possible, RTS compares the face shape of the user with the average appearance of individuals from around the world to find the best foundation to layer their likeness upon.

Unsurprisingly, one of the best ways to improve the output of a procedural avatar system, or really transferring surface morphology to any separate mesh, is ensuring you have the best ‘starting point’ for the target mesh — i.e. a base mesh which is as close to the target shape (user's physical face in this case) as possible. Transferring everyone’s ARKit likeness (face mesh) to the exact same base template model would have naturally imparted some of that template’s shape data upon the generated head model; having these templates is a means for mitigating how much bias we are imparting on generated models.

RTS is a fairly straightforward procedure — essentially it compares facial point patterns (and to a lesser extent skin color*) of the user’s ARKit face mesh with those same feature patterns found in a series of 12 regional templates. As each template was created from the same base mesh (and had consistent vertex IDs) it was then possible to programmatically adjust influence of each of these template layers until a “best fit” threshold was reached.

* Leveraging skin color is generally less robust; as mentioned previously, the effect of lighting conditions on an RGB capture adds an additional variable to negotiate, or negate really. The facial point pattern comparison values take precedence in the final template auto-selection. However, the RGB data is useful for ascertaining the best starting point — i.e. Which template's influence should we begin adjusting before others to find the best fit? With 12 templates, and the ability to combine multiple templates in small increments, with 24 delta measurements per comparison — "good enough" thresholds, picking the best starting point based on sampling the skin color, and other computation-reducing logic is helpful.

How did we ascertain the average appearance of each regional template?

We accomplished this, initially, by collecting standardized images of individuals — each individual was captured from the same direction (front and side), and under the same (or remarkably similar) lighting conditions. These captures were then composited together to create an average appearance for that particular country. With these country-specific composites, we were then able to combine countries of each of the 6 macroregions:

Africa
East Asia
Europe
Middle East
South America
South Asia

These macroregion composites were then used as the ground truth for carefully and painstakingly sculpting a 3D version of each template, which is what you see above.

RTS Improvements:

RTS template meshes are essentially abstractions, while they should be data in its purest form, and never authored by a sculptor’s hand if possible as that will always impart some amount of the artist's discretion
Pacific Islander, Native American, Inuit and other indigenous peoples are not covered by the macroregions
- Note: Collecting samples of these populations was always part of the original plan but was not possible with the resources available at the time

The ideal solution would be to capture 3D scan data of all subjects

Statistical models can then group similarly-unique individuals, regardless of ancestral home, and average them to generate any number of templates based on ground truth surface data of real phenotypes. A great example of this type of statistical model (using hands in place of faces) is Dávid Komorowicz’s implementation of the Mano model.

Ultimately RTS did a solid job of selecting the best template values to transfer the user’s likeness, but improvements to the template data collection process and generation method could have improved the end result.

In Closing

RTS is a great example of how diversity improves the end result for everyone — the output of Quasar was dramatically improved for users of all shapes, sizes and colors by simply considering what makes us unique and how those unique qualities can potentially affect a user’s experience.

It is my hope that readers of this article feel compelled, and hopefully inspired, to begin thinking about inclusion from the very beginning of a project. Systems like RTS may have never been considered without a building-for-inclusion prime directive — instead, we could have easily found ourselves, and our product, as part of the problem.

Proceed to Part 3:
Technical Direction