Check against Unicode allowlist
S This expression takes a string and a Unicode allowlist, and returns the first UTF-8 code point that violates the allowlist, if any.
It takes two string parameters:
- The text to convert to UTF-8 and test against the passed allowlist.
- The Unicode allowlist.
When the tested string matches the allowlist, the expression will return blank "".
Otherwise, the returned text will be along the format of:
Code point at index X does not match allowed list. Code point U+XXXX, decimal XX; valid = XX, Unicode category = XX.
For example, when testing string "Foobar1" with "L*" allowlist (all letter categories), the return will be:
Code point at index 6 does not match allowed list. Code point U+0031, decimal 49; valid = yes, Unicode category = Nd.
Breaking down the return text:
- Index 6 is the code point index of the failed character, the first character being 0, not the byte index.
- Code point U+0031 is the UTF-8 hex representation of the character as it will appear on most Unicode websites (e.g. this one), and in other coding languages, the character can be escaped in a "\u0031" sort of format.
- Decimal 49 is the decimal value of the Unicode code point.
- Valid = yes indicates the Unicode code point that does not match the list is still a valid Unicode code point.
- Unicode category = Nd indicates the non-matching code point is a Number, Decimal Digit classification.
Only the first non-matching code point will be returned.
For more details on UTF-8 terms, read the Unicode notes topic. Allowlists are also explained there.