Found this message in my own commit logs for the console rubygem and I figured I’d post it here in case some other poor fool
is stuck writing code with mbrtowc:
Well this is a fun little feature of mbrtowc that I discovered! Give it one
bad input and it will barf on all future inputs for the rest of the program,
unless you pass in your own cleared shift state object.
The mbrtowc manpage is helpfully vague:
“If the multibyte string starting at s contains an invalid multibyte
sequence before the next complete character, mbrtowc() returns (size_t) -1 and
sets errno to EILSEQ. In this case, the effects on *ps are undefined.”
and
“If ps is a NULL pointer, a static anonymous state only known to the
mbrtowc function is used instead. Otherwise, *ps must be a valid mbstate_t
object.”
The reader is left to infer, of course, that the “undefined effects” of the
“static anonymous state only known to the mbrtowc function” are that of
“breaking all successive calls for the rest of the execution of the program”.
Most programmers are by now familiar with the difference between the number of
bytes in a string and the number of characters. Depending on the string’s
encoding, the relationship between these two measures can be either trivially
computable or complicated and compute-heavy.
With the advent of Ruby 1.9, the Ruby world at last has this distinction
formally encoded at the language level: String#bytesize is the number of
bytes in the string, and String#length and String#size the number of
characters.
But when you’re writing console applications, there’s a third measure you have
to worry about: the width of the string
on the display. ASCII characters take up one column when displayed on
screen, but super-ASCII characters, such as Chinese, Japanese and Korean
characters, can take up multiple columns. This display width
is not trivially computable from the byte size of the character.
Finding the display width of a string is critical to any kind of console
application that cares about the width of the screen, i.e. is not simply
printing stuff and letting the terminal wrap. Personally, I’ve been needing it
forever:
- Trollop needs it because it tries to format
the help screen nicely.
- Sup needs it in a million places because it
is a full-fledged console application and people use it for reading mail in all
sorts of funny languages.
The actual mechanics of how to compute string width make for an interesting
lesson in UNIX archaeology, but suffice it to say that I’ve travelled the path
for you, with help from Tanaka Akira of pp fame, and I am happy to announce
the release of the Ruby console gem.
The console gem currently provides these two methods:
Console.display_width: calculates the display width of a string
Console.display_slice: returns a substring according to display offset and display width parameters.
There is one horrible caveat outstanding, which is that I haven’t managed to
get it to work on Ruby 1.8. Patches to this effect are most welcome, as are,
of course, comments and suggestions.
Try it out!.