Convert a string from UCS-4 to UTF-16. A 0 character will be added to the result after the converted text.
Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.
Determines the break type of c. c should be a Unicode character (to derive a character from UTF-8 encoded text, use g_utf8_get_char()). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such as pango_break() instead of caring about break types yourself.
Determines the canonical combining class of a Unicode character.
Performs a single composition step of the Unicode canonical composition algorithm.
Performs a single decomposition step of the Unicode canonical decomposition algorithm.
Determines the numeric value of a character as a decimal digit.
Computes the canonical or compatibility decomposition of a Unicode character. For compatibility decomposition, pass TRUE for compat; for canonical decomposition pass FALSE for compat.
In Unicode, some characters are "mirrored". This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text.
Looks up the GUnicodeScript for a particular character (as defined by Unicode Standard Annex \[24|24]). No check is made for ch being a valid Unicode character; if you pass in invalid character, the result is undefined.
Determines whether a character is alphanumeric. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is alphabetic (i.e. a letter). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is a control character. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines if a given character is assigned in the Unicode standard.
Determines whether a character is numeric (i.e. a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is printable and not a space (returns FALSE for control characters, format characters, and spaces). g_unichar_isprint() is similar, but returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is a lowercase letter. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is a mark (non-spacing mark, combining mark, or enclosing mark in Unicode speak). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is printable. Unlike g_unichar_isgraph(), returns TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is punctuation or a symbol. Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.). Given some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.
Determines if a character is uppercase.
Determines if a character is typically rendered in a double-width cell.
Determines if a character is typically rendered in a double-width cell under legacy East Asian locales. If a character is wide according to g_unichar_iswide(), then it is also reported wide with this function, but the converse is not necessarily true. See the Unicode Standard Annex [11|11]
for details.
Determines if a character is a hexidecimal digit.
Determines if a given character typically takes zero width when rendered. The return value is TRUE for all non-spacing and enclosing marks (e.g., combining accents), format characters, zero-width space, but not U+00AD SOFT HYPHEN.
Converts a single character to UTF-8.
Converts a character to lower case.
Converts a character to the titlecase.
Converts a character to uppercase.
Classifies a Unicode character by type.
Checks whether ch is a valid Unicode character. Some possible integer values of ch will not be valid. 0 is considered a valid character, though it's normally a string terminator.
Determines the numeric value of a character as a hexidecimal digit.
Computes the canonical decomposition of a Unicode character.
Computes the canonical ordering of a string in-place. This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information.
Looks up the Unicode script for iso15924. ISO 15924 assigns four-letter codes to scripts. For example, the code for Arabic is 'Arab'. This function accepts four letter codes encoded as a guint32 in a big-endian fashion. That is, the code expected for Arabic is 0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).
Looks up the ISO 15924 code for script. ISO 15924 assigns four-letter codes to scripts. For example, the code for Arabic is 'Arab'. The four letter codes are encoded as a guint32 by this function in a big-endian fashion. That is, the code returned for Arabic is 0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).
Convert a string from UTF-16 to UCS-4. The result will be nul-terminated.
Convert a string from UTF-16 to UTF-8. The result will be terminated with a 0 byte.
Converts a string into a form that is independent of case. The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling g_utf8_casefold() on other strings.
Compares two strings for ordering using the linguistically correct rules for the [current locale]setlocale. When sorting a large number of strings, it will be significantly faster to obtain collation keys with g_utf8_collate_key() and compare the keys with strcmp() when sorting instead of sorting the original strings.
Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().
Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().
Finds the start of the next UTF-8 character in the string after p.
Given a position p with a UTF-8 encoded string str, find the start of the previous UTF-8 character starting before p. Returns NULL if no UTF-8 characters are present in str before p.
Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
Convert a sequence of bytes encoded as UTF-8 to a Unicode character. This function checks for incomplete characters, for invalid characters such as characters that are out of the range of Unicode, and for overlong encodings of valid characters.
If the provided string is valid UTF-8, return a copy of it. If not, return a copy in which bytes that could not be interpreted as valid Unicode are replaced with the Unicode replacement character (U+FFFD).
Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. The string has to be valid UTF-8, otherwise NULL is returned. You should generally call g_utf8_normalize() before comparing two Unicode strings.
Converts from an integer character offset to a pointer to a position within the string.
Converts from a pointer to position within a string to an integer character offset.
Finds the previous UTF-8 character in the string before p.
Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.
Computes the length of the string in characters, not including the terminating nul character. If the max'th byte falls in the middle of a character, the last (partial) character is not counted.
Like the standard C strncpy() function, but copies a given number of characters instead of a given number of bytes. The src string must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)
Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
Reverses a UTF-8 string. str must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)
Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)
Copies a substring out of a UTF-8 encoded string. The substring will contain end_pos - start_pos characters.
Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4. A trailing 0 character will be added to the string after the converted text.
Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input. This function is roughly twice as fast as g_utf8_to_ucs4() but does no error checking on the input. A trailing 0 character will be added to the string after the converted text.
Convert a string from UTF-8 to UTF-16. A 0 character will be added to the result after the converted text.
Validates UTF-8 encoded text. str is the text to validate; if str is nul-terminated, then max_len can be -1, otherwise max_len should be the number of bytes to validate. If end is non-NULL, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise).
Validates UTF-8 encoded text.