There's a lot to know about UTF-8 and Unicode. For my sins, I am maintaining a web site written in ColdFusion that has to support Western and Central European languages now and possibly Asian languages in the future.
If I'm to avoid the hell of multiple character sets (ISO-8859-1 and ISO-8859-2 for starters), there is little alternative to UTF-8. Fortunately, software support has improved a lot over the last couple of years. The days when Netscape Navigator 4.x and Windows NT 4 would ruin your day are gone.
With ColdFusion I ran into a strange issue. I could make the server send out pages as UTF-8.
<cfcontent type="text/html; charset=UTF-8">
Text stored as UTF-8 in the database and extracted through the JDBC connection displayed fine. UTF-8 text in the ColdFusion page itself was getting converted from UTF-8 to ISO-8859-1, so "Página" would get turned into "Página". In other words, it was converting one UTF-8 encoded character into two ISO-8859-1 characters and then converting those two characters into their UTF-8 equivalents. Disaster!
I could cheat and use explicit coding all over the place, for example ž for ž. That is hard to read and you have to escape the '#' character for ColdFusion. Better to somehow tell ColdFusion that its source file is already UTF-8 and does not need to be converted.
Turns out the magic word is cfprocessingdirective.
<cfprocessingdirective suppressWhitespace="yes" pageEncoding="utf-8">
at the top of the file suppresses ColdFusion to convert text that does not need to be converted. Intriguingly, you can mix and match, so one included file could be in UTF-8 (with the prophylactic cfprocessingdirective) whereas another could be in ISO-8859-1. It sounds messy but at least in the short term means I don't need to convert all my files at once.
Some quick notes on that conversion. I've been doing most of my work on Linux, so the following lines in my .profile really help:
LANG=en_US.UTF-8 export LANG
PuTTY has a Translation setting that allows me to say "Received data assumed to be in [...]" UTF-8. Cut & paste works.
Actual conversion is done by recode:
$ recode latin1..utf8 file1 $ recode latin2..utf8 file2
For some reason recode compiled fine under CygWin but failed its tests miserably. No reason to investigate too much when Linux is so close at hand.
I am trying to use coldfusion verity for Italian. I have the same issues with the "Página" that you mentioned in your article on
UTF-8 in Cold Fusion
Verity does not work on Coldfusion for utf-8 encoded characters. So, I need ISO-8895-1 encoding, but again i get the "Página" output. It kills the verity collection.
Any help?
Sorry, can't help much with that. I don't use Verity myself (plain SQL queries work well enough). As you write, it seems Verity doesn't support UTF-8.
I guess one workaround would be to serve all your Italian pages as ISO-8859-1 and convert all your text to match. Depending on what else you have, this could be a gigantic pain. Sorry I don't have any better suggestions.