
stmpjmpr
Programming Ruby 3.3 (5th Edition, p 122): Unicode Character Definition Incorrect
On page 122, in the discussion of “Strings and Encodings”, it is stated: “A Unicode character is two bytes long, which makes a Unicode string twice as long internally as a Latin-1 string.” This is incorrect. Unicode represents a 21-bit code space; two bytes are not enough to cover it.
The Unicode Standard says, "Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems. "
I’d further argue that this is not just being pedantic: the idea that Unicode can be captured completely in two bytes is one of the more dangerous lingering misconceptions. It leads people to think that UTF-16, for instance, will always be 2-byte units (it is, like UTF-8, a variable-length encoding).
Popular Prag Prog topics










Other popular topics










Latest in PragProg
Latest (all)
Categories:
Popular Portals
- /elixir
- /rust
- /wasm
- /ruby
- /erlang
- /phoenix
- /keyboards
- /js
- /rails
- /python
- /security
- /go
- /swift
- /vim
- /clojure
- /java
- /haskell
- /emacs
- /svelte
- /onivim
- /typescript
- /crystal
- /c-plus-plus
- /tailwind
- /kotlin
- /gleam
- /react
- /flutter
- /elm
- /ocaml
- /vscode
- /opensuse
- /ash
- /centos
- /php
- /deepseek
- /zig
- /scala
- /html
- /debian
- /nixos
- /lisp
- /agda
- /sublime-text
- /textmate
- /react-native
- /kubuntu
- /arch-linux
- /ubuntu
- /revery
- /manjaro
- /django
- /spring
- /diversity
- /lua
- /nodejs
- /julia
- /c
- /slackware
- /neovim