Articles

Behind the Code – Immersive Storytelling with AR and Google Cloud

 

The New York Times is digitizing its famed photo archive, nicknamed The Morgue, with the help of Google Cloud technology. Instrument was asked to create a digital experience that showcases this historic collaboration and to help others imagine what might be possible using Google Cloud.

We created a web based digital experience that took real photos from The Morgue and combined them with data derived from Google Cloud technologies to tell new stories that are not apparent by just looking at the photograph. In order to pull people into the experience, we created a separate web-based AR application that could identify the photos in special ads printed in the Times and on billboards and posters around New York over the holidays. Once the app successfully identified a photo, we displayed more information about it and directed users to read the stories based on that photo.

AR Experience

AR applications can be roughly sorted into two categories: web-based or native. The difference between the two, simply put, is that native AR applications are more powerful. They are able to directly leverage the capabilities of the specific device they were created for, while web-based applications have more technological and performance limitations given that they are created to work across several devices. 

Since we were asking users to engage with print advertisements in such a casual, short-term way, we chose to accept the limitations of the web-based approach in order to reach the largest audience possible and to avoid requiring users to make the commitment of downloading an app. As a result, all our functionality was implemented in TypeScript, transpiled to JavaScript, and run in the visitor’s mobile browser.

The AR experience begins by asking the user for camera permission and moves into an ‘analyzing’ state. After opening the video stream from the camera we begin performing edge detection on the frame and displaying ambient animated SVG dots over the video along selected edges. At the same time we begin the analysis portion in order to find out if the user is pointing the camera at one of the ads. 

This analysis task could be broken down several steps:

  • Recognize that what the camera sees is one of our ads

  • Find the bounds of the photo itself in the frame

  • Compare the photo in those bounds to our set of known photos so that we can direct the user to the right set of stories on the site. 

The ads themselves contain a few “hints” that help the scanning process work more reliably. We explored a number of different options for these hints, and landed on a color coded bar the width of each photo that contained 4 colors at varying lengths and specific order. These bars help us identify that we’re looking at an ad, determine where the photo starts and ends, and narrow down which photo it might be.

We draw each frame to a 256px wide offscreen canvas for performance so we have less pixels that we need to analyze. From that point we start at the bottom left and go up and across, pixel by pixel, looking for our set of known HSV (hue, saturation, value) color ranges. Once we have a match we start an array and move across the photo looking for the next color. When we reach 4 of our known colors we see which one was the largest and find the order. We do that on every video frame, so optimizing the performance was key to keep the video running smoothly.

At this point in the process, we’ve narrowed down which photo we might be looking at. In order to be sure we have a match, we use a library called Pixelmatch. Taking the width of the color bar and the aspect ratio of the image we think we have, we crop the frame and compare that image to a known version of the photo and perform a pixel based diff. If the number of different pixels is acceptably low, we have a match! This triggers a success animation. The animation is mapped back onto a canvas on top of the camera video in the app so that dots animate and coalesce around specific points of the image that our stories reference. From there we lead users to the full responsive site where they can read the specific story they choose and explore all the other photos in the series. 

Website

The website part of the experience was relatively small, but had many custom details in its design and content structure. We decided that a minimalist approach to the use of libraries and frameworks would give us the best flexibility and the lowest overhead in this scenario. Both the site and the AR experience were built using TypeScript, SCSS, and the Flask web framework, hosted on Google App Engine. In order to help optimize bandwidth usage for mobile users, we avoided the use of animation libraries, instead using primarily CSS animations for the transitions and other effects on the site.

Determining the best way to manage the content for this site was an interesting challenge. We wanted the flexibility to structure the content however the project demanded, and we knew that the content would include not just lots of images but also data-based overlays on top of those images. The exact content structure was still evolving during development, and we knew that after launch there would be no further content management needs. These factors led us to choose an approach which prioritized development speed and flexibility over ease of content entry.

The solution we landed on we affectionately referred to as “BDOYF” (ba-doif, or Big Directory of YAML Files). The content of the pages was managed in static YAML files, with the directory structure and naming conventions used to define the url structure of the site. Assets were stored in Google Cloud Storage, and we leveraged asset processing features of the platform to allow for automatic generation of the different sizes of images we needed for use at different target screen resolutions. This approach was not ideal from a content entry standpoint, but it did allow us to get up and running quickly, rapidly make changes to our content model, and front end development could proceed as if a more elaborate system had been in place.

Learnings

Combining the digital and physical world is variable and hard! Our image recognition process keyed off of a set of 4 colors. We knew the digital values that represent these colors, but once you hit the real world, things get a lot less straightforward. The exact color values that the camera picks up can be affected by many different variables outside our control: 

  • luminosity and color temperature of any light source

  • variations in the printing process and ink used

  • the color and texture of the paper or material that was printed on

  • shadows cast on the printed material

  • properties of the specific camera in the user’s phone

  • internal white balance correction

We started out by using RGB values, but quickly discovered they are the least reliable way to create a range of colors to judge against. HSL (and HSV) are much better suited for calculating color ranges since hue (H) and saturation (S) are values that are on a continuous range and are independent of luminosity (L). This makes it so we can set a tolerance for a color range and worry less about the lighting situation around the ad.

 
/**
 * Apply some pre-processing color correction to the pixels
 */
applyColorCorrection() {
  // Accumulate all R,G,B values, then scale the values
  // to fix weird/unruly white points.
  let accum = [0, 0, 0];
  let bright = [0, 0, 0];
  for (let i = 0; i < this.canvas.width; i++) {
    for (let j = 0; j < this.canvas.height; j++) {
      const index = (j * this.canvas.width + i) * 4;
      const r = this.pixels[index];
      const g = this.pixels[index + 1];
      const b = this.pixels[index + 2];
      accum[0] += r;
      accum[1] += g;
      accum[2] += b;
      bright[0] = Math.max(bright[0], r);
      bright[1] = Math.max(bright[1], g);
      bright[2] = Math.max(bright[2], b);
    }
  }

  const minChannel = Math.min(accum[0], accum[1], accum[2]);

  // Multiply channels to be _darker_ (don't blow out pixel values
  // above 0xff).
  const NO_DIV_BY_ZERO = 1;
  const mults = [
    minChannel / Math.max(accum[0], NO_DIV_BY_ZERO),
    minChannel / Math.max(accum[1], NO_DIV_BY_ZERO),
    minChannel / Math.max(accum[2], NO_DIV_BY_ZERO)
  ];

   // And brighten the whole image as much as possible
  const brightest = Math.max(
    bright[0] * mults[0],
    bright[1] * mults[1],
    bright[2] * mults[2]
  );

  if (brightest < 0xff) {
    mults[0] *= (0xff / brightest);
    mults[1] *= (0xff / brightest);
    mults[2] *= (0xff / brightest);
  }

  // Now apply the color correction to the pixel data:
  for (let i = 0; i < this.canvas.width; i++) {
    for (let j = 0; j < this.canvas.height; j++) {
      const index = (j * this.canvas.width + i) * 4;
      this.pixels[index] *= mults[0];
      this.pixels[index + 1] *= mults[1];
      this.pixels[index + 2] *= mults[2];
    }
  }
}
 

Device-specific camera differences were also a key sticking point. Independent of operating system, the variability of the colors that different phone camera hardware pick up are dramatic. Phones from the same manufacturer can have large variances between versions. This led us to do as much color correction internal to the application itself as possible, along with applying a radial blur to the canvas that we used for analysis. In some cases, issues with camera quality or device speed were difficult to overcome, so we had to find a way to identify those situations and provide the user with some helpful information as to why we might not be getting a match right away.

Many of these speed bumps can be attributed to the fact that we were creating a web-based AR experience as opposed to a in-app (native) experience. While it is possible to implement AR experiences on the web, there are far more limitations and performance challenges. Things such as depth, object and surface perception, and light recognition are a lot more difficult to do well on the web, and avoiding unacceptably slow performance was a major challenge.

These limitations required workarounds. In our case, the workaround we chose was the use of those unique color bar “hints”, but by the end of the project we had identified a number of different approaches we could have explored. For example, we might have used text recognition of unique text in and around the photos, or we could have trained a deep learning model for more reliable image identification. Each of these approaches come with their own set of design constraints and technical challenges.

This project gave our team an opportunity to dig into some really interesting technologies, all in the service of creating a dynamic user experience with compelling and quirky content at its core. We made some project-defining technical decisions early on, and then learned how the implications of those decisions played out over the course of the project. Though we’re very happy with the results, it’s also a lot of fun to consider how we might approach a similar challenge in the future knowing what we know now. 

For more about the design, content, and user experience of this project, check out the other case study.

Related