W1033 Characters

From Coder Merlin
Revision as of 12:10, 26 June 2021 by Dinas-emrys (talk | contribs) (→‎Prerequisites)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Within these castle walls be forged Mavens of Computer Science ...
— Merlin, The Coder
USASCII code chart

Introduction[edit]

Characters in computer science refer to the same thing as in English and most other languages, a single letter or other symbol. The difference is that humans typically write characters on a piece of paper, while computer programs need a different way to deal with characters. Remember that all the CPU cares about is executing the instructions it gets on the data it gets; at a low-level there isn't really such a thing as types so everything needs to eventually be stored in binary. This means that the CPU couldn't care less if '01000001' refers to an integer 65 or the character 'A'. So, there needs to be a way to store characters in binary.

Topic Headers[edit]

Key Concepts[edit]

In order to store characters as binary, there needs to be some way of encoding a character into binary and vice versa.

ASCII[edit]

One of the earliest encoding schemes that you've probably heard of is ASCII (there were actually encoding schemes prior to ASCII, but ASCII is a good starting point). ASCII basically assigns each character to a number, ranging from 0 to 127, which can then be stored however you please, be it as binary or on a piece of paper in decimal form.

In C, a popular lower-level language, characters are stored using the char data type, which depending upon the implementation of C being used will either be an 8-bit signed or unsigned integer. Because ASCII only uses the numbers 0 to 127, both an 8-bit unsigned and signed integer can hold those values. An unsigned 8-bit integer can store values between 0 and 255, while a signed 8-bit integer can hold values between -128 and 127, both of which can hold all possible ASCII values.

Strings[edit]

Although individuals characters can be encoded as a number, that only works for individual characters. The CPU can deal with characters because they're just numbers; there isn't really a way to encode an entire sentence into a single number, for example. This is why if you've ever programmed in C, you've probably noticed the lack of a native string type. If you want to use strings in C, you have to use arrays of characters yourself. This gets exponentially more fun as you need to have dynamically sized strings (which you have to allocate the memory for yourself) as well as arrays of strings (a string is already an array, so an array of sentences would be an array of arrays of characters). In higher-level languages such as Swift and Java, there is a native string type that handles the arrays of characters behind the scenes.

You can see this behavior in many languages that let you access the characters of a string like an array. This can also be seen in Swift by the result of iterating through a String like an array, the type of each letter is (or is compatible with) Character:

let x: String = "Hello, World!"
for c: Character in x {
    print(c)
}

Which results in:

H
e
l
l
o
,
 
W
o
r
l
d
!

If you try with a type other than Character (or String.Element), you'll get an error that looks like this:

error: cannot convert sequence element type 'String.Element' (aka 'Character') to expected type 'String'

Unicode[edit]

Although ASCII worked just fine, and still does to this day, it does have some big problems in the current computer landscape. ASCII only has definitions for English letters and Arabic numerals (i.e. the number system you're probably accustomed to). Since this is inadequate in today's age, where computers have to deal with hundreds of languages, not to mention emojis and other symbols. Unicode currently defines nearly 150,000 characters as compared to ASCII's 128, and includes more than 150 scripts.

Because of unicode's much larger size, it cannot all be stored as an 8-bit integer. UTF-8, the most common unicode encoding for web pages, uses between one and four bytes for each character. It also provides backwards compatibility with ASCII by having the first 128 values correspond to their ASCII value. This means that any ASCII text can be decoded as UTF-8, but not necessarily the other way around.

Exercises[edit]

References[edit]

Unicode (Wikipedia) ASCII (Wikipedia)