This is a big subject! It is possible to write a long book about it, and several people have done so (search Amazon for "unicode book" to see some examples).
Handling characters in computers was much simpler in earlier decades, when programmers assumed that English was the only important language. So: 26 letters, upper and lower case, 10 digits, several punctuation marks, plus a code (0x07) to ring a bell, and it all fitted into 7 bits: the ASCII character set.
Naturally, people started asking what about à, ä and Ł, then other people started asking about ऄ, ஹ and ญ, and young people wanted emojis 😱. What to do?
To cut a long story short, many smart and patient people had to serve on committees for years, working out the details of the Unicode character set, and of encodings such as UTF-8, and lots of software needed a very complicated rewrite. Also, lots of new bugs were introduced.
To prevent everything breaking, the Unicode/UTF-8 design ensures that the first 127 codes are identical to ASCII (even the bell).
Languages designed after about 2005 have the huge advantage that a reasonably stable Unicode standard already existed.
Julia (first released 2012) was able to assume that everything would be Unicode: characters, strings, variable and function names, mathematical operators...
To quote the manual, "Julia makes dealing with plain ASCII text simple and efficient, and handling Unicode is as simple and efficient as possible." Note the "as possible", it is an important part of the statement.
Character literals are written in single-quotes, and are distinct from strings written in double quotes.
This is obvious to people from the C/C++ world, but potentially confusing to Python and Javascript programmers.
julia> a = 'a' # Roman alphabet
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
julia> typeof(a)
Char
julia> jha = 'झ' # Devanagari alphabet
'झ': Unicode U+091D (category Lo: Letter, other)
julia> typeof(jha)
Char
julia> '❤' # heart emoji
'❤': Unicode U+2764 (category So: Symbol, other)
We can see from the examples that the type is Char
, and Julia has further information about the category of character.
Looking closer, it appears that these characters can be represented by 4 hexadecimal digits. The full character set needs up to 6 hex digits.
These numbers are called "code points", and currently range from U+0000 to U+10FFFF.
They display in the REPL, but within code use codepoints()
to obtain them.
Converting between Char
and Int
is simple:
julia> Int('a')
97
julia> Char(97)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
The compiler allows some forms of integer arithmetic on Char
s:
julia> 'b' - 'a' # interval, in alphabetic order
1
julia> 'b' + 'a'
ERROR: MethodError: no method matching +(::Char, ::Char)
julia> 'a' + 5
'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
julia> 'f' + ('A' - 'a') # same as `uppercase('f')`
'F': ASCII/Unicode U+0046 (category Lu: Letter, uppercase)
Many string-handling functions can also work on Char
input.
uppercase()
and lowercase()
.isuppercase()
and islowercase()
.isletter()
covers many alphabetsisdigit()
, tests for strictly 0:9isnumeric()
, broader than isdigit
, so true
for ¾ and various non-European scriptsisxdigit()
, hexadecimal digitsisascii()
, pre-Unicode characterispunct()
, punctuationisspace()
, any whitespace characterisprint()
, printable characters (opposite is iscntrl()
)islowercase('A') # false
uppercase('γ') # 'Γ': Unicode U+0393 (category Lu: Letter, uppercase)
ispunct('@') # true
isdigit('A') # false
isxdigit('A') # true
For String to Char Vector, we can use collect()
.
For Char Vector to String, there is the String()
constructor.
julia> s = "abcde"
"abcde"
julia> cv = collect(s)
5-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
julia> String(cv)
"abcde"
This works with any characters, not just ASCII.
julia> collect("❤,😱")
3-element Vector{Char}:
'❤': Unicode U+2764 (category So: Symbol, other)
',': ASCII/Unicode U+002C (category Po: Punctuation, other)
'😱': Unicode U+1F631 (category So: Symbol, other)
Note that the String()
constructor operates on a Vector.
To cast a single Char
to a 1-character string, the function is string()
, with lowercase s
.
julia> string('a')
"a"
Everything so far in the document seems relatively simple, so is there really not much to worry about?
Unfortunately, this is too optimistic!
One complication comes from the need for "up to" 6 hex digits per code point. This means that different characters need different amounts of space in memory when UTF-8 encoded.
A byte can only store (unsigned) numbers up to 255, two hex digits, so UTF-8 uses a variable number of bytes (1 to 4) to store a Char
.
These are called "code units", and the ncodeunits()
function will return the number needed for a given character.
julia> codepoint(jha) # jha 'झ' is defined in an earlier example
0x0000091d
julia> ncodeunits(jha)
3
julia> ncodeunits('a') # ASCII character
1
julia> ncodeunits('😱') # emoji
4
Also, not everything that can be displayed on screen has its own unique code point. Some visually-distinct characters are considered to be derived from others, so Unicode treats them as a parent character plus a modifier.
This issue affects Strings, where it presents challenges for indexing.
In this exercise you will implement a partial set of utility routines to help a developer clean up identifier names.
In the 6 tasks you will gradually build up the functions transform
to convert single characters and clean
to convert strings.
A valid identifier comprises zero or more letters, underscores, hyphens, question marks and emojis.
If an empty string is passed to the clean
function, an empty string should be returned.
Implement the transform
function to replace any hyphens with underscores.
julia> transform('-')
"_"
Remove all whitespace characters. This will include leading and trailing whitespace.
julia> transform(' ')
""
Modify the transform
function to convert camelCase to kebab-case
julia> transform('D')
"-d"
Modify the transform
function to omit any characters that are numeric.
julia> transform('7')
""
Modify the transform
function to replace any Greek letters in the range 'α' to 'ω'.
julia> transform('β')
"?"
Implement the clean
function to apply these operations to an entire string.
Characters which fall outside the rules should pass through unchanged.
julia> clean(" a2b Cd-ω😀 ")
"ab-cd_?😀"
Sign up to Exercism to learn and master Julia with 18 concepts, 101 exercises, and real human mentoring, all for free.