HomeClusterLabs Projects

Refactor: libcrmcommon: Drop utf8_bytes()

Description

Refactor: libcrmcommon: Drop utf8_bytes()

A lot of the complexity of utf8_bytes() was for dealing with the fact
that the C standard doesn't specify the size of a byte. UTF-8 characters
come in 8-bit chunks. utf8_bytes() detected how many 8-bit bytes were in
a UTF-8 character and then converted that to a number of C bytes.

The previous commit requires an 8-bit char at build time. Now we can use
g_utf8_next_char() to get a pointer to the next UTF-8 character in a
string. Determining the number of bytes to skip is implemented more
efficiently there (by indexing into an array), and this avoids
reinventing the wheel and adding clutter.

Note: the GLib docs recommend calling g_utf8_validate() on the string
before calling g_utf8_next_char(). However, all GLib functions assume
that strings are encoded as UTF-8 [1]. I suspect most of Pacemaker and
parts of libxml2 would fall apart if we received non-UTF-8 strings. Our
escape-XML functions don't seem like the place to start validating the
encoding.

[1] https://docs.gtk.org/glib/programming.html#utf-8-and-string-encoding

Closes T801

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

Details

Provenance
nrwahl2Authored on Mar 30 2024, 6:15 AM
Parents
rPd5621b61a908: Build: configure: Require 8-bit char
Branches
Unknown
Tags
Unknown
Tasks
T801: Try to replace xml.c:utf8_bytes() with GLib UTF-8 functions