Object Detection and Augmentation

Paul Kinlan March 11 2019

Available in: Deutsch, Español, Français, 日本語, मानक हिन्दी, русский язык, tiếng Việt, தமிழ், bahasa Indonesia, తెలుగు, Português

I’ve been playing around a lot with the Shape Detection API in Chrome a lot and I really like the potential it has, for example a very simple QRCode detector I wrote a long time ago has a JS polyfill, but uses new BarcodeDetector() API if it is available.

You can see some of the other demo’s I’ve built here using the other capabilities of the shape detection API: Face Detection,Barcode Detection and Text Detection.

I was pleasantly surprised when I stumbled across Jeeliz at the weekend and I was incredibly impressed at the performance of their toolkit - granted I was using a Pixel3 XL, but detection of faces seemed significantly quicker than what is possible with the FaceDetector API.

Checkout some of their demos.

It got me thinking a lot. This toolkit for Object Detection (and ones like it) use API’s that are broadly available on the Web specifically Camera access, WebGL and WASM, which unlike Chrome’s Shape Detection API (which is only in Chrome and not consistent across all platforms that Chrome is on) can be used to build rich experiences easily and reach billions of users with a consistent experience across all platforms.

Augmentation is where it gets interesting (and really what I wanted to show off in this post) and where you need middleware libraries that are now coming to the platform, we can build the fun snapchat-esque face filter apps without having users install MASSIVE apps that harvest huge amount of data from the users device (because there is no underlying access to the system).

Outside of the fun demos, it’s possible to solve very advanced use-cases quickly and simply for the user, such as:

Text Selection directly from the camera or photo from the user
Live translation of languages from the camera
Inline QRCode detection so people don’t have to open WeChat all the time :)
Auto extract website URLs or address from an image
Credit card detection and number extraction (get users signing up to your site quicker)
Visual product search in your store’s web app.
Barcode lookup for more product details in your stores web app.
Quick cropping of profile photos on to people’s faces.
Simple A11Y features to let the a user hear the text found in images.

I just spent 5 minutes thinking about these use-cases — I know there are a lot more — but it hit me that we don’t see a lot of sites or web apps utilising the camera, instead we see a lot of sites asking their users to download an app, and I don’t think we need to do that any more.

Update Thomas Steiner on our team mentioned in our team Chat that it sounds like I don’t like the current ShapeDetection API. I love the fact that this API gives us access to the native shipping implementations of the each of the respective systems, however as I wrote in The Lumpy Web, Web Developers crave consistency in the platform and there are number of issues with the Shape Detection API that can be summarized as:

The API is only in Chrome
The API in Chrome is vastly different on every platforms because their underlying implementations are different. Android only has points for landmarks such as mouth and eyes, where macOS has outlines. On Android the TextDetector returns the detected text, where as on macOS it returns a ‘Text Presence’ indicator… This is not to mention all the bugs that Surma found.

The web as a platform for distribution makes so much sense for experiences like these that I think it would be remiss of us not to do it, but the above two groupings of issues leads me to question the long-term need to implement every feature on the web platform natively, when we could implement good solutions in a package that is shipped using the features of the platform today like WebGL, WASM and in the future Web GPU.

Anyway, I love the fact that we can do this on the web and I am looking forwards to seeing sites ship with them.