I hear your cries. What? Why? Who cares?
Let me take the questions in reverse order.
Anyone building a website which accepts user input, or input from other services, or which has users outside the US (that is, everyone!) should care about Unicode. They should care because when it's done right Unicode lets your website speak in many tongues. It enables your webpages to display the weird and wonderful characters that make up the alphabets of all of those non-English languages spoken outside the US and UK.
A good (or indeed bad) example of this came when, recently, I added the new Sigur Ros album to my glisty.com wishlist using glisty's bookmarklet to add the album from its Amazon product page. When I went to glisty to check out my list I noticed that it had mangled the Icelandic characters in the album name (urgh).
I knew that this sort of thing happens when you're not properly set up for Unicode, but I had very little idea how to fix that. I had thought that configuring my MySQL DB to use Unicode would be enough. Obviously not!
So I have spent the week fixing glisty for UTF-8 input and output and along the way learnt a few things I thought I might try to pass on. I would certainly not purport to be an expert, but these are some practical tips on the steps I undertook to make it work.
Getting ready for Unicode impacts all of the components of your website, from Javascript through to MySQL, so I'll take each in turn.
First stop - web pages
This is pretty easy. Make sure your web pages all have meta data to indicate that they contain unicode. You can do this by adding:<meta equiv="Content-Type" content="text/html; charset=UTF-8">
(if you are writing XHTML you don't strictly need this as XHTML defaults to UTF-8 but what's a bit of belt and braces amongst friends)
When you do this, it tells the browser explicitly to treat the content of the page and any text submitted from a form on the page as UTF-8.
Next up - Javascript
The main area to worry about here is escaping data if you are using AJAX. There are 3 different escape functions in Javascript:escape() encodeURI() encodeURIComponent()Only the last of these works properly with Unicode. Make sure to use it everywhere you want to escape data using Javascript (true story: a rogue escape() kept me unproductive for almost an hour last night)
Then - PHP
If you are displaying user entered data on your webpages, then it is a good idea to escape html from these strings to prevent anyone 'injecting' malicious code into your site. The way you do this in PHP is through the built-in htmlentities() function.It turns out that this function expects Latin encoded data by default. It will mangle any unicode characters it finds. On top of that the setting for encoding is the second optional variable so you have to add a stub for the first. Basically you should escape HTML using the following function:
htmlentities($yourString,ENT_COMPAT,"UTF-8");
I'm looking into whether you can set this as a global parameter in PHP.ini but haven't got an answer yet.
Finally - your MySQL DB
The collation of your DB should be set to UTF-8. Any text fields should have utf8_unicode_ci encoding and be typed as varchar (not char).This is all achieved easily in PHPMyAdmin or via the command line. I'll let you figure out the details.
On top of this, you also need to tell MySQL that you are sending it Unicode data. That means adding a query to the start of all your submission scripts. I achieved this by adding it to the constructor function for my DB class in PHP. It looks something like this:
mysql_query('SET NAMES = 'UTF-8');
I'm also looking into whether this can be changed in MySQL settings to avoid the overhead of an additional call.And that's it. All relatively small things, but if anyone of them is off, then you'll end up with junk. Hopefully this guide will help you avoid some of the pitfalls I have enjoyed over the last week...
By the way, as you can see from my list, glisty now handles unicode characters!


