Archives

Creative Commons License
This blog is licensed under a Creative Commons License.

Unicode support on the cheap

| 2 Comments | No TrackBacks

I’d been avoiding adding full Unicode support to Ledger for some time, since both times I tried it ended up in a veritable spaghetti of changes throughout the code, which it seemed would take forever to “prove”. One branch I started used libICU to handle Unicode strings throughout, while an earlier attempted using regular wide-string support in C++. Both were left on the cutting floor.

Where this fails is when Ledger tries to output elided columnar data, such in the register report. The problem is, there is no way to know the length of a string without determining exactly how many code-points exist in that UTF8 string. And without knowing the length, it’s impossible to get columns to line up, or to know exactly where a string can be cut in two without breaking a multibyte UTF8 character apart.

Anyway, I discovered a cheap solution today which did the job: Convert strings from UTF8 to UTF32 only when individual character lengths matter, and convert them back after that work is done. This took about one hour to implement, but now Ledger is able to justify columns correctly, even when other alphabets are used! It still doesn’t work for right-to-left alphabets, though.

No TrackBacks

TrackBack URL: http://www.newartisans.com/mt/mt-tb.cgi/1714

2 Comments

Converting to UTF32 is only an approximation. If the text contains combining characters (that may be even for western european languages when they are in some normalized forms of unicode) then a single letter will occupy several code-points. And it is wrong to break string between character and its combining characters. And there are double-width characters.

You’re very right, Dmitry. I’d forgotten about composed characters (such as occurs in vowellized Arabic left and right). I’m just hoping nobody uses such strings for their Ledger account names. :)

About this Entry

This page contains a single entry by John Wiegley published on January 23, 2009 3:57 PM.

The feature I avoided for half a year was the previous entry in this blog.

Ready Lisp version 20090130 now available is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Recent Comments

  • Curt Sampson: That there’s “no state” in Haskell is quite wrong; in read more
  • rv: Hi. I wanted to drop you a quick note to read more
  • John Wiegley: It’s here: http://ftp.newartisans.com/pub/python/modpython_gateway.py read more
  • Leon: The file “modpython_gateway.py” Is no longer available in the downloads read more
  • Kathy: Well, the article is really the sweetest on this laudable read more
  • mr.design: Hi John, I just started to read your GFTBU, it’s read more
  • yoman: “Barfin”? “Slurping”? “Slime” “Hunchentoot” ??? What in the T.F. world read more
  • John Wiegley: Something like this is slated for the next release of read more
  • womens health: According to me, Apple has implemented something called blocks, which read more
  • Bjorn Tipling: Why would you add instructions for installing an editor when read more
OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.261