UTF-8

character encoding standard. Unicode Transformation Format - 8 bit Is used pretty much everywhere, and is able to represent all known characters of the world.

  • Variable Width, 1~4 bytes.
  • Byte order does not matter. Does not have BOM‘s as opposed to UTF-16.
    • What this means is that, each byte in UTF-8 has bits that make it clear which position they are in a character.
    • Endianness does not matter
    • The character € (U+20AC) in UTF-8 is: E2 82 AC
      • First byte: 11100010 (3-byte sequence)
      • Second byte: 10000010 (continuation)
      • Third byte: 10101100 (continuation)
    • Always read left to right
  • backwards compatible with extended ASCII (8 bit bytes, values from 0~255)
    • he ASCII character ‘A’ is 0x41 → and in UTF-8, it’s also just 0x41
    • This is why UTF-8 was easy to adopt on systems that were already using ASCII (like Unix/Linux).
Tags