Thursday, June 26, 2008

This week I have been mostly learning about Unicode...


I hear your cries. What? Why? Who cares?
Let me take the questions in reverse order.
Anyone building a website which accepts user input, or input from other services, or which has users outside the US (that is, everyone!) should care about Unicode. They should care because when it's done right Unicode lets your website speak in many tongues. It enables your webpages to display the weird and wonderful characters that make up the alphabets of all of those non-English languages spoken outside the US and UK.
A good (or indeed bad) example of this came when, recently, I added the new Sigur Ros album to my glisty.com wishlist using glisty's bookmarklet to add the album from its Amazon product page. When I went to glisty to check out my list I noticed that it had mangled the Icelandic characters in the album name (urgh).
I knew that this sort of thing happens when you're not properly set up for Unicode, but I had very little idea how to fix that. I had thought that configuring my MySQL DB to use Unicode would be enough. Obviously not!
So I have spent the week fixing glisty for UTF-8 input and output and along the way learnt a few things I thought I might try to pass on. I would certainly not purport to be an expert, but these are some practical tips on the steps I undertook to make it work.

Getting ready for Unicode impacts all of the components of your website, from Javascript through to MySQL, so I'll take each in turn.

First stop - web pages
This is pretty easy. Make sure your web pages all have meta data to indicate that they contain unicode. You can do this by adding:
<meta equiv="Content-Type" content="text/html; charset=UTF-8">

(if you are writing XHTML you don't strictly need this as XHTML defaults to UTF-8 but what's a bit of belt and braces amongst friends)
When you do this, it tells the browser explicitly to treat the content of the page and any text submitted from a form on the page as UTF-8.
Next up - Javascript
The main area to worry about here is escaping data if you are using AJAX. There are 3 different escape functions in Javascript:
escape()
encodeURI()
encodeURIComponent()
Only the last of these works properly with Unicode. Make sure to use it everywhere you want to escape data using Javascript (true story: a rogue escape() kept me unproductive for almost an hour last night)
Then - PHP
If you are displaying user entered data on your webpages, then it is a good idea to escape html from these strings to prevent anyone 'injecting' malicious code into your site. The way you do this in PHP is through the built-in htmlentities() function.
It turns out that this function expects Latin encoded data by default. It will mangle any unicode characters it finds. On top of that the setting for encoding is the second optional variable so you have to add a stub for the first. Basically you should escape HTML using the following function:
htmlentities($yourString,ENT_COMPAT,"UTF-8");

I'm looking into whether you can set this as a global parameter in PHP.ini but haven't got an answer yet.
Finally - your MySQL DB
The collation of your DB should be set to UTF-8. Any text fields should have utf8_unicode_ci encoding and be typed as varchar (not char).
This is all achieved easily in PHPMyAdmin or via the command line. I'll let you figure out the details.
On top of this, you also need to tell MySQL that you are sending it Unicode data. That means adding a query to the start of all your submission scripts. I achieved this by adding it to the constructor function for my DB class in PHP. It looks something like this:
mysql_query('SET NAMES = 'UTF-8');
I'm also looking into whether this can be changed in MySQL settings to avoid the overhead of an additional call.
And that's it. All relatively small things, but if anyone of them is off, then you'll end up with junk. Hopefully this guide will help you avoid some of the pitfalls I have enjoyed over the last week...
By the way, as you can see from my list, glisty now handles unicode characters!

Radiohead at Victoria Park

The concert was truly excellent. Radiohead are a band at the height of their powers. 2 and a bit hours of splintered, beautiful music.

Sunday, June 22, 2008

Two beautiful proofs

I'm currently in the middle of my annual summer re-read of GEB. It's such a great book and it teaches me something new every time I read it. One of the things I love about it is the regular forays into Number Theory and the occasional review of a classic proof.

So that got my thinking, what is my favourite Number Theory proof? Well, I couldn't decide, so you're going to get my two favourites.

They're from different ends of the spectrum, but they have one thing in common (like all good proofs). They take a problem that looks foreboding and offer up a solution that is so elegant and simple that it seems make you wonder why you were ever so puzzled.

Here's the first one:

Gauss' Sum of numbers from 1 to N

Gauss
was an incredible mathematician and came up with many of the methods, proofs and theories that form the basis of modern mathematics. This particular proof is generally held to have been derived whilst Gauss was a mere scamp and still at school.

His teacher (feeling particularly lazy) had asked the boys in the class to add all of the numbers between 1 and 100, then write down the answer and bring it to him at his desk. When Gauss strolled up to the front after just a few minutes with a piece of paper reading 5050, the teacher was astonished (and daresay a little annoyed).

So how did Gauss do it? Well like all good mathematicians, Gauss generalised the problem and found a common solution, before applying his general solution to the initial question. In that lesson Gauss was able to fathom that the formula to tell you the sum, S(N), of the numbers between 1 and N was:

S(N) = (N/2)(N+1)

and that therefore the sum of numbers between 1 and 100 was:

S(100) = (100/2)(100+1) = 5050

The proof?

Take the numbers from 1 to N and pair them up against the numbers from N to 1. You'll notice that the sum of every pair is N+1 and that you have N pairs. So the sum of all the numbers in your two series is N(N+1). However, you doubled up your initial series, 1 to N, by pairing it up with N to 1 so you should divide this sum by two and you will have (N/2)(N+1).

Example:

                                                Sum
1 to 100: 1 2 3 4 5 ... 100 5050
100 to 1: 100 99 98 97 96 ... 1 5050
----------------------------------------------------------
Both: 101 101 101 101 101 ... 101 10100

I first saw Gauss' proof of that formula when I was at school and ever since it has remained one of mathematics little wonders to me. It felt like magic when I first saw it and it still does now.


The second result is a little more arcane, but the proof itself is no less magical than Gauss' above.

Cantor's Diagonal Proof

Cantor was a mathematician in the 19th century and he was really quite interested in infinite sets (to put it mildly...) He spent a lot of time trying to figure out a proper mathematical basis for infinity and in particular if their was more than one type of infinity.

He was able to definitively prove the existence of more than one infinity proving that there are fewer Natural Numbers (i.e. the numbers 1, 2, 3, 4, 5, ....) than there are Real Numbers (i.e. all numbers that can be represented by a decimal expansion like 0.1872459, 199.779, pi, e, ...), despite both sets clearly being infinite. His solution was ingenious and alarmingly simple.

First take all of the Real Numbers between 0 and 1 and map each of them to a Natural Number, hence:


R(1) = 0.13459872635...
R(2) = 0.23098175638...
R(3) = 0.58098903284...
R(4) = 0.98761834891...
....


(It is worth noting that the 'Reals' on the right must all be represented as their infinite decimal expansion).

Then he used this table to create a new number. From each of the original numbers he took the digit indexed by the Natural Number he had assigned to it (the bold ones from the diagram above):


T = 0.1306...


He then added 1 to each of these digits to obtain (cycling any 9's back round to 0):


T' = 0.2416...


Now, T' is definitely a Real between 0 and 1, but at the same time it can be seen that it does not correspond to any of the R(N) above, as:


The 1st digit is not the same as R(1)
The 2nd digit is not the same as R(2)
The 3rd digit is not the same as R(3)
...


So if we matched all of the Natural Numbers to a Real equivalent between 0 and 1 and found another Real that had not been matched then there must be more Reals between 0 and 1 than there are Natural Numbers. That is there must be an infinity that is bigger than the one you get by starting at 1 and counting up...


I love both of these proofs and I hope you enjoyed my (no doubt flawed) explanation of them. Do you love a different proof? Have a reason why these shouldn't be my top two? Let me know in the comments.

Great customer service

Seen at a recent expedition to our local DIY store...

They have these buttons all over the place, particularly concentrated in areas where customers might be feeling they need a little guidance (tiles, paint, wood cutting). Every time I felt that little surge of stress when I realised I have no clue what I'm doing, I looked up and there was a big orange button.

When you think about it, it seems obvious. Most customers shopping for hardware need help at some point during their visit and there is nothing as painful as trundling round the store trying to find a staff member that isn't serving someone else.

What really impressed me is the way that every part of this solution seems just right.:

1. The buttons are bright orange and green; they are placed in highly visible positions
2. The promise is simple - Need help? We'll give it to you in 2 mins or we'll give you a discount
3. The countdown dots help to make you:
(a) comfortable while you are waiting (help is on its way) and
(b) delighted when a member of staff beats your expectations by arriving while it is only a quarter of the way round.

It makes me so happy when I see companies really thinking about what they can do for their customers and then making it as easy as possible for their customers to ask for that. What upsets me is that I see it so rarely. For now, Homebase, you have a new evangelist!

Tuesday, June 17, 2008

What's the date?

I implemented some new features on glisty.com over the weekend (in fact it now has an almost complete set of features against my original spec - yipee).

One of the elements of this feature set was to enable users to set an end date for their gift lists. After that date (but not before!) they'll be able to see which items were bought for them by friends and family.

Dates on the web are a bit tricky as we pesky humans have invented all sorts of different ways to represent them. Month first (US), day first (UK), month as a word, year as two digits or four, dashes or slashes, the list is endless. I often wonder how many websites think my birthday is in October...

I looked into various options for glisty from allowing users free form input and just doing my best with it at the back end (similar to Remember the Milk's cool interface) to putting in 3 explicit boxes for day month and year using drop downs. In the end I decided on a simple in between. Pick a format, state that up front and re-enforce it during entry, and finally use Javascript to parrot back my interpretation of your entry in a universally understandable form in real time.

Here are some screen grabs showing the flow.

State the format up front.



Re-enforce it during entry.



My interpretation of what you input (is it correct?)


It's not perfect (could add support for the other punctuation types), but it's a good way to help users get their data into the system without any nasty surprises.

This sort of instant feedback loop is one of the things that Javascript (plus DOM) is really good at. Judicious use of it helps users and developers alike.

Thursday, June 12, 2008

Big Ideas: Don't get any - 'Nude' by Radiohead


Absolutely brilliant. Really, really great.