Programming Ruby 3.3 (5th Edition, p 122): Unicode Character Definition Incorrect

On page 122, in the discussion of “Strings and Encodings”, it is stated: “A Unicode character is two bytes long, which makes a Unicode string twice as long internally as a Latin-1 string.” This is incorrect. Unicode represents a 21-bit code space; two bytes are not enough to cover it.
The Unicode Standard says, "Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems. "
I’d further argue that this is not just being pedantic: the idea that Unicode can be captured completely in two bytes is one of the more dangerous lingering misconceptions. It leads people to think that UTF-16, for instance, will always be 2-byte units (it is, like UTF-8, a variable-length encoding).
Popular Pragprog topics










Other popular topics










Latest in Pragprog
Latest (all)
Categories:
My Saved Portals
-
None saved yet
Popular Portals
- /elixir
- /opensuse
- /rust
- /kotlin
- /ruby
- /erlang
- /python
- /clojure
- /react
- /quarkus
- /go
- /vapor
- /v
- /react-native
- /wasm
- /security
- /django
- /nodejs
- /centos
- /haskell
- /rails
- /fable
- /gleam
- /swift
- /js
- /deno
- /assemblyscript
- /tailwind
- /laravel
- /symfony
- /phoenix
- /crystal
- /typescript
- /debian
- /adonisjs
- /julia
- /arch-linux
- /svelte
- /spring
- /preact
- /flutter
- /c-plus-plus
- /actix
- /java
- /angular
- /ocaml
- /zig
- /kubuntu
- /scala
- /zotonic
- /vim
- /rocky
- /lisp
- /html
- /keyboards
- /vuejs
- /nim
- /emacs
- /nerves
- /elm