06.02.07
WordPress 2.2 and the importance of doing Unicode right, from the start
WordPress 2.2 came out a few weeks ago, and sometime recently the Fantastico service (offered by my hosting provider to perform auto-installs and upgrades to common web applications) alerted me that 2.2 was available for upgrade. Tonight I took the plunge.
Every other WordPress upgrade (a half-dozen security updates to WordPress 2.0, the 2.0.x to 2.1 upgrade, and a few 2.1.x security updates) all went without a hitch. Not so 2.2. After the upgrade, I noticed the lovely افكار و احلام and every curly quote, ellipsis, and ☃ had been turned into gibberish (and what’s worse, retyping them didn’t fix anything).
It turns out that the database storing all the “information” for this journal thinks that everything is using the old Latin-1 (Western European) character set for determining which 1s and 0s form which letter, while WordPress was writing UTF-8 (the most common way to encode Unicode, a character set that can represent all of the world’s languages at once) data to the database. This was pretty broken, but as long as WordPress and the database were both continuing to do the same thing, everything worked well.
With 2.2, WordPress apparently made a big step along the way to setting things up using Unicode/UTF-8 as the basis (and thereby easing future problems and the lives of anyone writing in non-Roman scripts, or in languages using multiple scripts, like English and Arabic or English and Chinese). Unfortunately, it seems that the Fantastico upgrade assumed that all existing WordPress installs were set up with a UTF-8 database instead of a Latin-1 database, and upgraded some configuration information with that assumption. Now WordPress and the database were no longer continuing to do the same broken thing and were now, essentially, talking in different languages. So I got gibberish.
Fortunately, there seems to be a (fragile?) simple hack to “fix” the problem; the solution was Google’s first hit (now there’s technology working for you!) and the gibberish is banished.
The real solution is to perform some complex database magic on my database to tell the database, “hey, your data is already UTF-8; now let’s actually claim to be UTF-8 without messing things up” and revert the hack, but until someone bundles up the magic as a WordPress plugin, I’ll live with the hack.
As an aside, there’s something to be said for installing everything yourself; there’s a steep learning curve and it’s often time-consuming, but that is often paid back in time saved when something goes wrong. Back in the late 80s my dad was starting his business and running it on Xenix (before SCO became evil), and he installed and reinstalled the OS over and over to get things right. Once he did, he could do it in his sleep when something needed to change or be fixed. The current system is a Solaris box configured and installed by a VAR, and last winter after Sun sent a guy out to replace a soon-to-be-dying hard disk and took the old one after leaving us with a completely blank replacement, it took 12-14 long and painful hours, with several hours of phone and email support from Sun (yay support contracts!
), to get Solaris up and running again.
If WordPress (or Fantastico; I’m not exactly sure where the database configuration choice originated) had set the database up using a Unicode encoding like UTF-8 to begin with, this would have all been avoided. The moral of the story for software developers is this: always assume that users of your product will be entering data in many languages and scripts, so set your character set to Unicode and specify a Unicode encoding like UTF-8 from day 0. Doing so will save you and your users lots of pain over the years to come; dealing with “legacy” data for years afterwards is difficult, time-consuming, and no fun, and communication should be both easy and fun, all around the world.
(Final note to the Fantastico developers: if Fantastico can make an automatic backup of all the data before performing an upgrade, why can’t it also have a user-friendly “auto-revert” function that rolls back the upgrade and restores the backed-up data, instead of requiring the user to have ssh/shell access and providing a vague list of terminal commands to perform those same steps?)
June 2, 2007 at 3:23 pm
glad that you found the “hack” useful. Ya, Fantastico is not recommended to upgrade script installation.
June 24, 2007 at 1:39 pm
Thanks, Smokey - I normally don’t understand what you’re writing about here, but this was a real lifesaver!
(blogsaver?)
May 3, 2008 at 6:44 pm
[...] course, I also had to fix the DB_CHARSET in wp-config.php again after the [...]