How to Support Emojis
Emojis have become so ingrained in popular culture that it’s hard to imagine life without these little text-based communication tools. From sarcasm to joy, they help us express ourselves. We’ll give a brief overview of their origins and show you how you can support them when you’re building apps.
A dev history of emojis
Emojis and other graphical glyphs require the advanced encoding mechanism of Unicode. Unicode emerged as new standard in 1991 when it became clear that ASCII (used previously for emoticons :-o) was not sufficient to convey special language characters such as Japanese and Russian.
愛 - This is the Japanese symbol for love. This document (and the website hosting it) support Unicode encoding because you can see that character.
The simplest form of Unicode encoding is called UCS-2 and each character requires 16 bits. Modern ASCII requires 8 bits per character which is an extension of the original teletype 7 bit form allowing for 256 different characters. With 16 bits a UCS-2 character set can support 65,536 different characters.
UCS-2 quickly became insufficient for handling the world’s character sets and the next generation emerged called UCS-4 which has 32 bits per character and can encode up to 679,477,248 different characters. It can’t support 2^32 (over 2 billion) characters because Unicode has control codes which occupy some of the 32 bit address space.
Once UCS-4 was established and adopted as the new standard the encoding scheme UTF-8 was introduced in order to more efficiently share Unicode encoded data.
UTF-8 uses control header bits in a stream so that most characters (such as simple Latin characters) only require 16 bits while more uncommon characters such as emojis require a full 6 bytes.
HTML pages that will contain UTF-8 encoded Unicode characters should convey this to the browser through the UTF-8 meta header as follows:
<head> <meta charset="UTF-8"> </head>
Web service requests outside the scope of an HTML document (such as a REST request) should set the request and response headers as follows:
Content-type: application/json; charset=utf-8
Accept: application/json; charset=utf-8
This tells the web server to provide the response using JSON with UTF-8 Unicode character support.
How web servers deliver emojis
Although many web sites can be made using a set of static HTML pages it’s often necessary to provide dynamic content. Web servers can read content data from a database and change it programmatically in real time.
A typical client server content lifecycle might look like the process depicted in the following diagram:
Emojis can only be rendered properly if text encoding is properly maintained throughout the entire content lifecycle.
How to support emoji in your apps
Although beyond the scope of this document to cover specific configuration it’s important to verify that your web server is capable of supporting UTF-8 encoding. Most servers do this by default now when the UTF-8 charset header is found in a page.
The Apache HTTP server can be configured to support UTF-8 by default by adding this to httpd.conf:
This will make sure your web server will properly deliver HTML and JSON in a UTF-8 encoded character set.
MySQL Database Setup
Most relational databases do not deal in UTF-8 encoding by default. It doubles text storage requirements (at least) when storing simple text. MySQL is an example of a database server that out of the box only stores and fetches ASCII 8 bit data which is sufficient for most business applications.
Oracle has very version specific requirements for Unicode encoding. More recent versions of their server application support UTF-8 out of the box but earlier versions (e.g. version 7) require special configuration or don’t support it at all.
MySQL can be configured to have UTF-8 be the default system encoding for all databases.
The MySQL server configuration can be found in the file my.cnf which is usually located in the system /etc folder (depending on OS). To make newly created databases automatically support UTF-8 encoding add this to my.cnf:
[client] default-character-set = utf8mb4 [mysql] default-character-set = utf8mb4 [mysqld] character-set-client-handshake = FALSE character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci
It’s important to note the use of utf8mb4. This causes the server to store full 32 bit long words for character storage. 32 bit characters are necessary to store supplemental characters such as emojis.
If your server is not configured to support UTF-8 then MySQL can still create databases that support the encoding:
CREATE DATABASE emoji_friendly_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
If you need to migrate an existing UTF non-compliant database into one that supports full UTF-8 there are many possible processes you could follow. I have found the following to be the easiest (for MySQL):
- Export your database using mysqldump.
- Drop the database.
- Shut down the database server.
- Configure the server to use utf8mb4.
- Start the database server again.
- Create the database like in the above create database statement.
- Use sed or a text editor to replace latin character set references with utf8mb4.
- Import the database .sql using MySQL client source.
Doing show table status will indicate if your tables support the UTF-8 character set.
Here are some ways to ensure your web application can properly support emojis:
- Add the UTF-8 charset meta tag to your HTML page headers.
- Web service requests should always include “Accept” and “Content-type” headers including charset: UTF-8. This applies to both XML and JSON requests.
- Make sure your web service system (e.g. Spring or JAX) is configured to handle UTF-8 requests. If the requestor indicates UTF-8 in the request headers this should be transparent on most modern systems.
- Configure your database to support Unicode full UCS-4 (32 bit) UTF-8 encoding.
Written by Tim Heider, Senior Developer
Illustrated by Sheri Smith