Unicode

struct Unicode {

}

Members

Static functions

ucs4ToUtf16 wchar* ucs4ToUtf16(dchar* str, glong len, glong itemsRead, glong itemsWritten): Convert a string from UCS-4 to UTF-16. A 0 character will be added to the result after the converted text.
ucs4ToUtf8 string ucs4ToUtf8(dchar* str, glong len, glong itemsRead, glong itemsWritten): Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.
unicharBreakType GUnicodeBreakType unicharBreakType(dchar c): Determines the break type of c. c should be a Unicode character (to derive a character from UTF-8 encoded text, use g_utf8_get_char()). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such as pango_break() instead of caring about break types yourself.
unicharCombiningClass int unicharCombiningClass(dchar uc): Determines the canonical combining class of a Unicode character.
unicharCompose bool unicharCompose(dchar a, dchar b, dchar ch): Performs a single composition step of the Unicode canonical composition algorithm.
unicharDecompose bool unicharDecompose(dchar ch, dchar a, dchar b): Performs a single decomposition step of the Unicode canonical decomposition algorithm.
unicharDigitValue int unicharDigitValue(dchar c): Determines the numeric value of a character as a decimal digit.
unicharFullyDecompose size_t unicharFullyDecompose(dchar ch, bool compat, dchar result, size_t resultLen): Computes the canonical or compatibility decomposition of a Unicode character. For compatibility decomposition, pass TRUE for compat; for canonical decomposition pass FALSE for compat.
unicharGetMirrorChar bool unicharGetMirrorChar(dchar ch, dchar* mirroredCh): In Unicode, some characters are "mirrored". This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text.
unicharGetScript GUnicodeScript unicharGetScript(dchar ch): Looks up the GUnicodeScript for a particular character (as defined by Unicode Standard Annex \[24|24]). No check is made for ch being a valid Unicode character; if you pass in invalid character, the result is undefined.
unicharIsalnum bool unicharIsalnum(dchar c): Determines whether a character is alphanumeric. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIsalpha bool unicharIsalpha(dchar c): Determines whether a character is alphabetic (i.e. a letter). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIscntrl bool unicharIscntrl(dchar c): Determines whether a character is a control character. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIsdefined bool unicharIsdefined(dchar c): Determines if a given character is assigned in the Unicode standard.
unicharIsdigit bool unicharIsdigit(dchar c): Determines whether a character is numeric (i.e. a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIsgraph bool unicharIsgraph(dchar c): Determines whether a character is printable and not a space (returns FALSE for control characters, format characters, and spaces). g_unichar_isprint() is similar, but returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIslower bool unicharIslower(dchar c): Determines whether a character is a lowercase letter. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIsmark bool unicharIsmark(dchar c): Determines whether a character is a mark (non-spacing mark, combining mark, or enclosing mark in Unicode speak). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIsprint bool unicharIsprint(dchar c): Determines whether a character is printable. Unlike g_unichar_isgraph(), returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIspunct bool unicharIspunct(dchar c): Determines whether a character is punctuation or a symbol. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIsspace bool unicharIsspace(dchar c): Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
unicharIstitle bool unicharIstitle(dchar c): Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.
unicharIsupper bool unicharIsupper(dchar c): Determines if a character is uppercase.
unicharIswide bool unicharIswide(dchar c): Determines if a character is typically rendered in a double-width cell.
unicharIswideCjk bool unicharIswideCjk(dchar c): Determines if a character is typically rendered in a double-width cell under legacy East Asian locales. If a character is wide according to g_unichar_iswide(), then it is also reported wide with this function, but the converse is not necessarily true. See the Unicode Standard Annex [11|11]
for details.
unicharIsxdigit bool unicharIsxdigit(dchar c): Determines if a character is a hexidecimal digit.
unicharIszerowidth bool unicharIszerowidth(dchar c): Determines if a given character typically takes zero width when rendered. The return value is TRUE for all non-spacing and enclosing marks (e.g., combining accents), format characters, zero-width space, but not U+00AD SOFT HYPHEN.
unicharToUtf8 int unicharToUtf8(dchar c, char[] outbuf): Converts a single character to UTF-8.
unicharTolower dchar unicharTolower(dchar c): Converts a character to lower case.
unicharTotitle dchar unicharTotitle(dchar c): Converts a character to the titlecase.
unicharToupper dchar unicharToupper(dchar c): Converts a character to uppercase.
unicharType GUnicodeType unicharType(dchar c): Classifies a Unicode character by type.
unicharValidate bool unicharValidate(dchar ch): Checks whether ch is a valid Unicode character. Some possible integer values of ch will not be valid. 0 is considered a valid character, though it's normally a string terminator.
unicharXdigitValue int unicharXdigitValue(dchar c): Determines the numeric value of a character as a hexidecimal digit.
unicodeCanonicalDecomposition dchar* unicodeCanonicalDecomposition(dchar ch, size_t* resultLen): Computes the canonical decomposition of a Unicode character.
unicodeCanonicalOrdering void unicodeCanonicalOrdering(dchar* string_, size_t len): Computes the canonical ordering of a string in-place. This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information.
unicodeScriptFromIso15924 GUnicodeScript unicodeScriptFromIso15924(uint iso15924): Looks up the Unicode script for iso15924. ISO 15924 assigns four-letter codes to scripts. For example, the code for Arabic is 'Arab'. This function accepts four letter codes encoded as a guint32 in a big-endian fashion. That is, the code expected for Arabic is 0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).
unicodeScriptToIso15924 uint unicodeScriptToIso15924(GUnicodeScript script): Looks up the ISO 15924 code for script. ISO 15924 assigns four-letter codes to scripts. For example, the code for Arabic is 'Arab'. The four letter codes are encoded as a guint32 by this function in a big-endian fashion. That is, the code returned for Arabic is 0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).
utf16ToUcs4 dchar* utf16ToUcs4(wchar* str, glong len, glong itemsRead, glong itemsWritten): Convert a string from UTF-16 to UCS-4. The result will be nul-terminated.
utf16ToUtf8 string utf16ToUtf8(wchar* str, glong len, glong itemsRead, glong itemsWritten): Convert a string from UTF-16 to UTF-8. The result will be terminated with a 0 byte.
utf8Casefold string utf8Casefold(string str, ptrdiff_t len): Converts a string into a form that is independent of case. The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling g_utf8_casefold() on other strings.
utf8Collate int utf8Collate(string str1, string str2): Compares two strings for ordering using the linguistically correct rules for the [current locale]setlocale. When sorting a large number of strings, it will be significantly faster to obtain collation keys with g_utf8_collate_key() and compare the keys with strcmp() when sorting instead of sorting the original strings.
utf8CollateKey string utf8CollateKey(string str, ptrdiff_t len): Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().
utf8CollateKeyForFilename string utf8CollateKeyForFilename(string str, ptrdiff_t len): Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().
utf8FindNextChar string utf8FindNextChar(string p, string end): Finds the start of the next UTF-8 character in the string after p.
utf8FindPrevChar string utf8FindPrevChar(string str, string p): Given a position p with a UTF-8 encoded string str, find the start of the previous UTF-8 character starting before p. Returns NULL if no UTF-8 characters are present in str before p.
utf8GetChar dchar utf8GetChar(string p): Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
utf8GetCharValidated dchar utf8GetCharValidated(string p, ptrdiff_t maxLen): Convert a sequence of bytes encoded as UTF-8 to a Unicode character. This function checks for incomplete characters, for invalid characters such as characters that are out of the range of Unicode, and for overlong encodings of valid characters.
utf8MakeValid string utf8MakeValid(string str, ptrdiff_t len): If the provided string is valid UTF-8, return a copy of it. If not, return a copy in which bytes that could not be interpreted as valid Unicode are replaced with the Unicode replacement character (U+FFFD).
utf8Normalize string utf8Normalize(string str, ptrdiff_t len, GNormalizeMode mode): Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. The string has to be valid UTF-8, otherwise NULL is returned. You should generally call g_utf8_normalize() before comparing two Unicode strings.
utf8OffsetToPointer string utf8OffsetToPointer(string str, glong offset): Converts from an integer character offset to a pointer to a position within the string.
utf8PointerToOffset glong utf8PointerToOffset(string str, string pos): Converts from a pointer to position within a string to an integer character offset.
utf8PrevChar string utf8PrevChar(string p): Finds the previous UTF-8 character in the string before p.
utf8Strchr string utf8Strchr(string p, ptrdiff_t len, dchar c): Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
utf8Strdown string utf8Strdown(string str, ptrdiff_t len): Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.
utf8Strlen glong utf8Strlen(string p, ptrdiff_t max): Computes the length of the string in characters, not including the terminating nul character. If the max'th byte falls in the middle of a character, the last (partial) character is not counted.
utf8Strncpy string utf8Strncpy(string dest, string src, size_t n): Like the standard C strncpy() function, but copies a given number of characters instead of a given number of bytes. The src string must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)
utf8Strrchr string utf8Strrchr(string p, ptrdiff_t len, dchar c): Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
utf8Strreverse string utf8Strreverse(string str, ptrdiff_t len): Reverses a UTF-8 string. str must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)
utf8Strup string utf8Strup(string str, ptrdiff_t len): Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)
utf8Substring string utf8Substring(string str, glong startPos, glong endPos): Copies a substring out of a UTF-8 encoded string. The substring will contain end_pos - start_pos characters.
utf8ToUcs4 dchar* utf8ToUcs4(string str, glong len, glong itemsRead, glong itemsWritten): Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4. A trailing 0 character will be added to the string after the converted text.
utf8ToUcs4Fast dchar* utf8ToUcs4Fast(string str, glong len, glong itemsWritten): Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input. This function is roughly twice as fast as g_utf8_to_ucs4() but does no error checking on the input. A trailing 0 character will be added to the string after the converted text.
utf8ToUtf16 wchar* utf8ToUtf16(string str, glong len, glong itemsRead, glong itemsWritten): Convert a string from UTF-8 to UTF-16. A 0 character will be added to the result after the converted text.
utf8Validate bool utf8Validate(string str, string end): Validates UTF-8 encoded text. str is the text to validate; if str is nul-terminated, then max_len can be -1, otherwise max_len should be the number of bytes to validate. If end is non-NULL, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise).
utf8ValidateLen bool utf8ValidateLen(string str, string end): Validates UTF-8 encoded text.

glib Unicode structs

Unicode