Skip to main content

Strings

A string is a sequence of characters such as "Hello, ๐ŸŒ!" or "Simplify(๐Ÿ‘จโ€๐Ÿš€ + โšก๏ธ) โ†’ ๐Ÿ‘จโ€๐ŸŽค".

In the Compute Engine, strings are composed of encoding-independent Unicode characters and provide access to those characters through a variety of Unicode representations.

In the Compute Engine, strings are not treated as collections. This is because the concept of a "character" is inherently ambiguous: a single user-perceived character (a grapheme cluster) may consist of multiple Unicode scalars, and those scalars may in turn be represented differently in various encodings. To avoid confusion and ensure consistent behavior, strings must be explicitly converted to a sequence of grapheme clusters or Unicode scalars when individual elements need to be accessed.

String(any*) -> string

A string created by joining its arguments. The arguments are converted to their default string representation.

["String", "Hello", ", ", "๐ŸŒ", "!"]
// โž” "Hello, ๐ŸŒ!"

["String", 42, " is the answer"]
// โž” "42 is the answer"

StringFrom(any, format:string?) -> string

Convert the argument to a string, using the specified format.

formatDescription
utf-8The argument is a list of UTF-8 code points
utf-16The argument is a list of UTF-16 code points
unicode-scalarsThe argument is a list of Unicode scalars (same as UTF-32)

For example:

["StringFrom", [240, 159, 148, 159], "utf-8"]
// โž” "Hello"

["StringFrom", [55357, 56607], "utf-16"]
// โž” "\u0048\u0065\u006c\u006c\u006f"

["StringFrom", [128287], "unicode-scalars"]
// โž” "๐Ÿ”Ÿ"

["StringFrom", [127467, 127479], "unicode-scalars"]
// โž” "๐Ÿ‡ซ๐Ÿ‡ท"

Utf8(string) -> list<integer>

Return a list of UTF-8 code points for the given string.

Note: The values returned are UTF-8 bytes, not Unicode scalar values.

["Utf8", "Hello"]
// โž” [72, 101, 108, 108, 111]

["Utf8", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [240, 159, 145, 169, 226, 128, 141, 240, 159, 142, 147]

Utf16(string) -> list<integer>

Return a list of utf-16 code points for the given string.

Note: The values returned are UTF-16 code units, not Unicode scalar values.

["Utf16", "Hello"]
// โž” [72, 101, 108, 108, 111]

["Utf16", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [55357, 56489, 8205, 55356, 57235]

UnicodeScalars(string) -> list<integer>

A Unicode scalar is any valid Unicode code point, represented as a number between U+0000 and U+10FFFF, excluding the surrogate range (U+D800 to U+DFFF). In other words, Unicode scalars correspond exactly to UTF-32 code units.

This function returns the sequence of Unicode scalars (code points) that make up the string. Note that some characters perceived as a single visual unit (grapheme clusters) may consist of multiple scalars. For example, the emoji ๐Ÿ‘ฉโ€๐Ÿš€ is a single grapheme but is composed of several scalars.

["UnicodeScalars", "Hello"]
// โž” [72, 101, 108, 108, 111]

["UnicodeScalars", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [128105, 8205, 127891]

GraphemeClusters(string) -> list<string>

A grapheme cluster is the smallest unit of text that a reader perceives as a single character. It may consist of one or more Unicode scalars (code points).

For example, the character รฉ can be a single scalar (U+00E9) or a sequence of scalars (e U+0065 + combining acute U+0301), but both form a single grapheme cluster.

Here, NFC (Normalization Form C) refers to the precomposed form of characters, while NFD (Normalization Form D) refers to the decomposed form where combining marks are used.

Similarly, complex emojis (๐Ÿ‘ฉโ€๐Ÿš€, ๐Ÿ‡ซ๐Ÿ‡ท) are grapheme clusters composed of multiple scalars.

The exact definition of grapheme clusters is determined by the Unicode Standard (UAX #29) and may evolve over time as new characters, scripts, or emoji sequences are introduced. In contrast, Unicode scalars and their UTF-8, UTF-16, or UTF-32 encodings are fixed and stable across Unicode versions.

The table below illustrates the difference between grapheme clusters and Unicode scalars:

StringGrapheme ClustersUnicode Scalars (Code Points)
รฉ (NFC)["รฉ"][233]
eฬ (NFD)["รฉ"][101, 769]
๐Ÿ‘ฉโ€๐ŸŽ“["๐Ÿ‘ฉโ€๐ŸŽ“"][128105, 8205, 127891]

In contrast, a Unicode scalar is a single code point in the Unicode standard, corresponding to a UTF-32 value. Grapheme clusters are built from one or more scalars.

This function splits a string into grapheme clusters, not scalars.

["GraphemeClusters", "Hello"]
// โž” ["H", "e", "l", "l", "o"]

["GraphemeClusters", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” ["๐Ÿ‘ฉโ€๐ŸŽ“"]

["UnicodeScalars", "๐Ÿ‘ฉโ€๐ŸŽ“"]
// โž” [128105, 8205, 127891]

For more details on how grapheme cluster boundaries are determined, see Unicodeยฎ Standard Annex #29.

BaseForm(value:integer) -> string

BaseForm(value:integer, base:integer) -> string

Format an integer in a specific base, such as hexadecimal or binary.

If no base is specified, use base-10.

The sign of integer is ignored.

  • value should be an integer.
  • base should be an integer from 2 to 36.
["Latex", ["BaseForm", 42, 16]]

// โž” (\text(2a))_{16}
Latex(BaseForm(42, 16))
// โž” (\text(2a))_{16}
String(BaseForm(42, 16))
// โž” "'0x2a'"