My Storytime: Voice UI and State Management
Article written by Hannah Cin, Senior Developer
My Storytime is a new Google Experiment web application which allows users to record stories for their family to play back on their Google Assistant devices. The front-end is a React-based web application that presented a few challenges, but what I’d like to highlight is the way the backend interacts with the Google Assistant APIs.
The Google Assistant APIs work like this:
First, a user opens your app by requesting it by name on a Google Assistant device.
The assistant sends a request to the HTTP endpoint we built with the name of an event. These events can be user defined or one of the built-in events. In the case of the first run experience, the endpoint receives a “Welcome” event asking how we would like to reply to users opening the app.
We reply by returning a single SSML tag. SSML is a markup language for describing speech that the Assistant will then render back to the user in a spoken voice.
At this point, the connection is closed. The next time the user speaks to the Assistant, that request is compared to a database of events that can be fired if the Assistant recognizes the phrase. That event is sent as before, and we continue the response and request loop until the user in done using the app.
The important thing here is that we do not receive any information about the whole conversation, only individual events. Voice UI is incredibly context-dependent. If the user says “Go back,” we need to remember the whole conversation to understand how to implement that logic. The Google Assistant APIs provide a single way of passing data between invocations called “Context” (JSON data) which will be available to read and write during each request.
Managing and reasoning very context-dependent states is always hard. The additional constraint of the final state being a plain JSON object means we need to keep things simple. We could have used something analogous to Redux, which is essentially just an event stream that operates on a JSON object, but in my opinion, Redux is too low level of a solution when working with lots of state.
My go-to tool for these situations are State Machines. I can’t help myself. I love them. I wrote this one for jQuery 10 years ago! State machines let you declare that at every point the system is in a known state and can reply to certain events in specific ways depending on that state. Rather than searching a JSON value for clues as to the current state, we can simply say “we are in the “Welcome” state. I find this act of naming what the current combination of data means helps me reason through the business logic. It is also very easy to diagram a state machine when communicating how it works.
States are simply functions that accept an ‘action’ which is the event we want to apply to the current state. The action is an object which provides a ‘type’ key to differentiate itself from other action types. It is very similar to a Redux action in format.
States return one or more side-effects (or a Promise of one or more side-effects), which are simply functions that will be called in the order they were generated at the end of the state transition.
States can be ‘enter’-ed by sending the ‘Enter’ action. Here is an example of a simple state which says hello to the user upon entering.
Behind the scenes, we store a history of the states we have entered and the current data associated with each state. In this example, there is no associated state, but we’ll get there.
Next, we will add a new event which will let the user pick a story to listen to.
When the Google Assistant API detects the user asking for a story it knows about, it will send the ‘PickStory’ event. We then respond by playing it (via a SSML command).
Now, let’s say we want to have the user confirm that we heard them correctly. Many stories have similar names, especially series of books. Because the connection is closed after every response, we need a way to wait for a specific response. We handle this my moving into a new state while we wait for the event.
There are a bunch of new things happening here. When we want to move to a new state, we simply ‘return’ calling that function and pass data that the new state will need access to (in this case, the name of the ‘story’ being requested).
This ‘Confirm’ state now takes three different events: ‘Enter’, ‘Yes’ (to confirm that it is the correct story) and ‘No’ to let us know we misheard. ‘Yes’ plays the story as before. ‘No’ uses the history of the conversation to re-enter the previous state: ‘Welcome’.
And that’s Druyan. A very simple, functional state machine built in TypeScript. Learn more on the project repository on Github.
The web front-end is also built using Druyan and the project’s React bindings. Across the front-end and back-end, we have 75 distinct states. Having a way of reasoning about all those states has been invaluable.