UTF-8 is perhaps the best hack, the best single thing that’s used that can be written down on the back of a napkin, and that’s how was it was put together. The first draft of UTF-8 was written on the back of a napkin in a diner and it’s just such an elegant hack that solved so many problems and I
absolutely love it. Back in the 1960s, we had teleprinters, we had simple
devices where you type a key and it sends some numbers and the same letter comes out on the other side, but there needs to be a standard so in
the mid-1960s America, at least, settled on ASCII, which is the American Standard Code for Information Interchange, and it’s a 7-bit binary system, so each letter you type in gets converted into 7 binary numbers and sent over the wire. Now that means you can have numbers from 0 to 127. They sort of moved the first 32 for control codes and less important stuff for writing, things like like “go down a line” or backspace. And then they made the rest characters. They added some numbers, some punctuation marks. They did a really clever thing, which is that they made ‘A’ 65 which, in binary— find 1, 2, 4, 8, 16, 32, 64— in binary, 65 is 1000001, which means that ‘B’ is 66, which means you’ve got 2 in binary just here. C, 67, 3 in binary. So you can look at a 7-bit binary character and just knock off the first two digits and know what its position in the alphabet is. Even cleverer than that, they started lowercase 32 later, which means that lowercase ‘a’ is 97—1100001. Anything that doesn’t fit into that is probably a space, which conveniently will be all zeroes, or some kind of punctuation mark. Brilliant, clever, wonderful, great way of doing things, and that became the standard, at least in the English-speaking world. As for the rest of the world, a few of them did versions of that, but you start getting into other alphabets, into languages that don’t really use alphabets at all. They all came up with their own encoding, which is fine. And then along come computers, and, over time, things change. We move to 8-bit computers, so we now have a whole extra number at the start just to confuse matters, which means we can go to 256! We can have twice as many characters! And, of course, everyone settled on the same standard for this, because that would make perfect s— No. None of them did. All the Nordic countries start putting Norwegian characters and Finnish characters in there. Japan just doesn’t use ASCII at all. Japan goes and creates its own multibyte encoding with more letters and more characters and more binary numbers going to each individual character. All of these things are massively incompatible. Japan actually has three or four different encodings, all of which are completely incompatible with each other. So you send a document from one old-school Japanese computer to another, it will come out so garbled that there is even a word in Japanese for “garbled characters,” which is—I’m probably mispronouncing this—but it’s “mojibake.” It’s a bit of a nightmare, but it’s not bad, because how often does someone in London have to send a document to a completely incompatible and unknown computer at another company in Japan? In those days, it’s rare. You printed it off and you faxed it. And then the World Wide Web hit, and we have a problem, because suddenly documents are being sent from all around the world all the time. So a thing is set up called the Unicode Consortium. In what I can only describe as a miracle, over the last couple of decades, they have hammered out a standard. Unicode now have a list of more than a hundred thousand characters that covers everything you could possibly want to write in any language— English alphabet, Cyrillic alphabet, Arabic alphabet, Japanese, Chinese, and Korean characters. What you have at the end is the Unicode Consortium assigning 100,000+ characters to 100,000 numbers. They have not chosen binary digits. They have not chosen what they should be represented as. All they have said is that THAT Arabic character there, that is number 5,700-something, and this linguistic symbol here, that’s 10,000-something. I have to simplify massively here because there are about, of course, five or six incompatible ways to do this, but what the web has more or less settled on is something called “UTF-8.” There are a couple of problems with doing the obvious thing, which is saying, “OK. We’re going to 100,000. That’s gonna need, what… to be safe, that’s gonna need 32 binary digits to encode it.” They encoded the English alphabet in exactly the same way as ASCII did. ‘A’ is still 65. So if you have just a string of English text, and you’re encoding it at 32 bits per character, you’re gonna have about 20-something… 26? Yeah. 26, 27 zeroes and then a few ones for every single character. That is incredibly wasteful. Suddenly every English language text file takes four times the space on disk. So problem 1: you have to get rid of all the zeroes in the English text. Problem 2: there are lots of old computer systems that interpret 8 zeroes in a row, a NULL, as “this is the end of the string of characters.” so if you ever send 8 zeroes in a row, they just stop listening. They assume the string has ended there, and it gets cut off, so you can’t have 8 zeroes in a row anywhere. ‘K. Problem number 3: it has to be backwards-compatible. You have to be able to take this Unicode text and chuck it into something that only understands basic ASCII, and have it more or less work for English text. UTF-8 solves all of these problems and it’s just a wonderful hack. It starts by just taking ASCII. If you have something under 128, that can just be expressed as 7 digits, you put down a zero, and then you put the same numbers that you would otherwise, so let’s have that ‘A’ again—there we go! That’s still ‘A.’ That’s still 65. That’s still UTF-8-valid, and that’s still ASCII-valid. Brilliant. OK. Now let’s say we’re going above that. Now you need something that’s gonna work more or less for ASCII, or at least not break things, but still be understood. So what you do is you start by writing down “110.” This means this is the start of a new character, and this character is going to be 2 bytes long. Two ones, two bytes, a byte being 8 characters. And you say on this one, we’re gonna start it with “10,” which means this is a continuation, and at all these blank spaces, of which you have 5 here and 6 here, you fill in the other numbers, and then when you calculate it, you just take off those headers, and it understands just as being whatever number that turns out to be. That’s probably somewhere in the hundreds. That’ll do you for the first 4,096. What about above that? Well, above that you go “1110,” meaning there are three bytes in this—three ones, three bytes— with two continuation bytes. So now you have 1, 2, 3, 4, 10, 16 spaces. You want to go above that? You can. This specification goes all the way to “1111110x” with this many continuation bytes after it. It’s a neat hack that you can explain on the back of a napkin or a bit of paper. It’s backwards-compatible. It avoids waste. At no point will it ever, ever, ever send 8 zeroes in a row, and, really, really crucially, the one that made it win over every other system is that you can move backwards and forwards really easily. You do not have to have an index of where the character starts. If you are halfway through a string and you wanna go back one character, you just look for the previous header. And that’s it, and that works, and, as of a few years ago, UTF-8 beat out ASCII and everything else as, for the first time, the dominant character encoding on the web. We don’t have that mojibake that Japanese has. We have something that nearly works, and that is why it’s the most beautiful hack that I can think of that is used around the world every second of every day. (BRADY HARAN)
-We’d like to think Audible.com for their support of this Computerphile video, and, if you register with Audible and go to audible.com/computerphile, you can download a free audiobook. They’ve got a huge range of books at Audible. I’d like to recommend “The Last Man On the Moon,” which is by Eugene Cernan who is the eleventh of twelve men to step onto the Moon. but he was the last man to step off the Moon, so I’m not sure whether or not he is “the last man on the Moon” or not. Sort of depends how you define it. But his book is really good, and what I really like about it is it’s read by Cernan himself, which I think is pretty cool Again, thanks to Audible. Go to audible.com/computerphile and get a free audiobook. (TOM SCOTT)
-“… an old system that hasn’t been programmed well will take those nice curly quotes that Microsoft Word has put into Unicode, and it will look at that and say, ‘That is three separate characters…’ ”

Characters, Symbols and the Unicode Miracle – Computerphile
Tagged on:                             

100 thoughts on “Characters, Symbols and the Unicode Miracle – Computerphile

  • August 30, 2015 at 1:05 pm
    Permalink

    All 0s in ASCII is Nul. 32 (01 00000) is Space.

    Reply
  • September 5, 2015 at 7:28 pm
    Permalink

    cameraman, please take a seat

    Reply
  • September 16, 2015 at 3:07 pm
    Permalink

    is not 0000 in ascii null not space ?

    Reply
  • September 23, 2015 at 2:32 pm
    Permalink

    ⌠▓▒░cool░▒▓⌡

    Reply
  • September 25, 2015 at 1:41 pm
    Permalink

    UTF-8 should be the standard everywhere but it isn't. 🙁

    Reply
  • October 1, 2015 at 5:04 pm
    Permalink

    This was an excellent presentation. Thank you for making it so understandable!

    I do have a very minor quibble. At 7:18, there's an error; in a 2 byte Unicode character, having 11 bits available (5 from the header, and 6 from the continuation) will only allow you to get values up to 2048, not 4096.

    Reply
  • October 12, 2015 at 10:31 pm
    Permalink

    Mojibake translates to 'character ghost'

    Reply
  • October 27, 2015 at 8:50 am
    Permalink

    interestingly all the functions that come with php standard mash everything into ASCII making it the worst computer language in existence if you ask me.

    there are frameworks for php that can use other encodings buy if you accidentally use a single one of the regular ones you get fucked up text.

    PHP needs to quietly die or be updated (but the idea of updating PHP makes everyone heads explode because they have to have backwards comparability)

    kinda scary to think that Wikipedia runs on that pile of shit…..

    i guess i have to give it point for that at least if it's already built and working it runs fine….

    But you have to be mentally ill to want to develop anything new on it unless your really really focused on making sure its damn cheap to host…

    Reply
  • November 4, 2015 at 2:08 pm
    Permalink

    Holy shit, this guy is freaking enthusiastic about it. But he has a point…. I only recently learned the way UTF-8 works and I gotta say, this is some freaking genius hack.

    Reply
  • November 30, 2015 at 8:13 am
    Permalink

    A pity that Javascript doesn't properly support UTF-8, it only supports UTF-16. Which is compatible in some cases but not all. This situation causes problems for pasting Word documents into Web forms, among other things. Basically, there is a need to deprecate all of the multiplicity of variants of Unicode, and standardise on one. UTF-8 would seem the sensible choice.

    Reply
  • December 22, 2015 at 1:18 pm
    Permalink

    UTF-8 master race

    Reply
  • December 26, 2015 at 6:21 pm
    Permalink

    is 11111110 + 7 spaces possible?

    Reply
  • January 5, 2016 at 10:51 pm
    Permalink

    The end-video links don't work! missing "/watch?v="

    Reply
  • January 21, 2016 at 11:34 am
    Permalink

    Same problem existed with Russian encodings. The name for the mess which appeared if wrong encoding was used is "krakoz'abry" («кракозябры»). Probably, there was a name for it in each non-latin scripted language…

    Reply
  • January 23, 2016 at 12:53 pm
    Permalink

    “UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go”

    Excerpt From: Brian W. Kernighan. “The Go Programming Language (Addison-Wesley Professional Computing Series).”

    Reply
  • January 29, 2016 at 2:23 am
    Permalink

    Was this filmed in the St Pancras Hotel?

    Reply
  • February 5, 2016 at 5:36 pm
    Permalink

    Wait isn't 0000000 null and 0100000 a space?

    Reply
  • February 8, 2016 at 8:35 pm
    Permalink

    A bit too fast…

    Reply
  • March 14, 2016 at 3:21 am
    Permalink

    Another quibble: he accidentally damns this with faint praise when he says "you have something at nearly works" right at the very end. He meant it nearly works perfectly but it does work excellently.

    Reply
  • May 7, 2016 at 5:42 pm
    Permalink

    Well my phone's email app doesn't use it because I get "J"s and stuff instead of smiley faces…

    Reply
  • May 25, 2016 at 9:51 pm
    Permalink

    good stuff

    Reply
  • June 7, 2016 at 4:08 pm
    Permalink

    I still hate Unicode, and limit it very strictly in my own projects. There are so many "characters" encoded in it that can break a website (at least its design). I honestly think mankind should settle on ONE language and ONE script. Use the metric system, UTC time and date, and the English language is the only solution that makes sense in the long run.

    Reply
  • August 21, 2016 at 4:30 am
    Permalink

    can any one plz teach me how to do the things he said in real life?

    Reply
  • September 2, 2016 at 1:55 pm
    Permalink

    I watched the video about reading in binary

    Reply
  • September 18, 2016 at 12:48 pm
    Permalink

    homework is to watch this

    Reply
  • September 19, 2016 at 2:36 am
    Permalink

    Were you filming after a liquid lunch?

    Reply
  • September 21, 2016 at 4:02 am
    Permalink

    3 years later, still quality.

    Well, give-or-take a few leap seconds

    Reply
  • October 19, 2016 at 3:54 am
    Permalink

    I learn more here than my software lessons

    Reply
  • October 27, 2016 at 11:15 am
    Permalink

    There's a saying that UTF-8 was successful because USA did not need to understand it. (Explanation: they could just keep using ASCII and magically they are compatible with UTF-8).

    Reply
  • November 17, 2016 at 12:56 am
    Permalink

    Love Tom's enthusiasm and passion when he talks!

    Reply
  • November 21, 2016 at 8:38 pm
    Permalink

    You forgot to mention that the great hacker behind the great hack is Ken Thompson, the genius behind unix

    Reply
  • December 2, 2016 at 11:12 am
    Permalink

    subscribed

    Reply
  • February 9, 2017 at 7:15 am
    Permalink

    why did i even take computer science this year lmao??? don't know how i'm gonna survive this pray for me

    Reply
  • March 22, 2017 at 9:19 pm
    Permalink

    I am YODA. Luke, forget EVERYTHING I taught you about THE FORCE. There is something even more INCREDIBLE….UTF-8!!!!

    Reply
  • April 19, 2017 at 2:23 am
    Permalink

    "[…] we don't have mojibake, […] we have something that nearly works" – Tom Scott, 2013.
    I absolutely adore this "nearly" thing.

    Reply
  • July 5, 2017 at 3:41 am
    Permalink

    What are you looking at Tom?

    Reply
  • July 23, 2017 at 11:22 am
    Permalink

    Interestingly ASCII's pronunciation in Bulgarian is АСКИ and such is our language that it can be said to mean Айде Сърбай Крема Идиот which translated mean Go on, drink the cream idiot.

    Reply
  • September 28, 2017 at 9:40 pm
    Permalink

    why is binary written and read right to left?

    Reply
  • October 5, 2017 at 11:29 pm
    Permalink

    I liked this one – Tom seemed genuinely hyped all the way through, and didn't once resort to that "frustrated sigh" schtick, which can quickly get quite… frustrating 😛

    Reply
  • October 7, 2017 at 4:17 am
    Permalink

    mojibake = emoji bukakke

    Reply
  • November 3, 2017 at 9:15 pm
    Permalink

    This explains why a fellow student kept receiving unreadable emails from one of our japanese teachers and an exchange student. I ended up copying the gibberish (mojibake 文字化け) into editor and changing the encoding to make them kind of readable lol

    Reply
  • November 6, 2017 at 12:20 pm
    Permalink

    gotta love the passion of this guy 🙂

    Reply
  • November 15, 2017 at 1:28 pm
    Permalink

    𖤓𖥂𖣘𖣐᳄₪᳇₪᳄𖣐𖣘𖥂𖤓

    Reply
  • November 22, 2017 at 3:36 pm
    Permalink

    I have NEVER been able to make unicode work. You type the code and nothing happens. I don't get it?

    Reply
  • November 28, 2017 at 9:01 am
    Permalink

    Today I learn about mojibake, thank you.

    Reply
  • December 3, 2017 at 11:36 pm
    Permalink

    Thank you !!

    Reply
  • December 13, 2017 at 7:24 am
    Permalink

    Unicode contains like 138,xxx codepoints currently, but can contain up to 1.1 million.

    Reply
  • December 13, 2017 at 3:28 pm
    Permalink

    Why its filmed in public place ??

    Reply
  • December 19, 2017 at 1:45 am
    Permalink

    This is literally the first video I have seen with Tom Scott in and I absolutely love his passion. I think there should be a standard for a lot more things too. What side of the road we all drive on for a start. Power sockets and the actual powers supply itself. Phone chargers etc.

    Reply
  • January 3, 2018 at 3:05 pm
    Permalink

    This guy is good. UTF-8 is easy. Microsoft is junk. UTF-16 is horrible

    Reply
  • January 5, 2018 at 1:37 am
    Permalink

    5:14

    Reply
  • January 16, 2018 at 7:52 pm
    Permalink

    Brilliantly explained!

    Reply
  • January 21, 2018 at 8:05 am
    Permalink

    i still don't get why some games uses fake system ui

    Reply
  • January 21, 2018 at 2:45 pm
    Permalink

    I'm still a little confused, now. Does the existence of the header bits mean that no character data can contain those exact pattern of bits? That seems really obtuse. And if not, then the computer has to be very careful to count 8 bits at a time forward or backward. Well then what's the point in having such long headers? It could be as simple as 0= start of new character, and 1 = continuation.

    Reply
  • February 24, 2018 at 1:26 pm
    Permalink

    A note from someone studying Japanese: as far as I can tell, from my limited knowledge of Japanese shortenings, "mojibake" means "character monster." Rather prefer the one on Sesame Street, myself.

    Reply
  • March 28, 2018 at 8:49 pm
    Permalink

    UTF-8 A great universal standard.

    Reply
  • April 21, 2018 at 1:52 pm
    Permalink

    I don't know … why some people dislike this video .

    Reply
  • June 21, 2018 at 3:46 am
    Permalink

    very interesting…

    Reply
  • July 5, 2018 at 11:01 am
    Permalink

    thanx Computerphile for explaining utf8 , user tried to understand from wiki but could not do it, u make everything simple

    Reply
  • July 25, 2018 at 11:50 am
    Permalink

    … aaaaaand Microsoft chose UTF-16 as their default C# encoding. facepalm

    Reply
  • September 11, 2018 at 10:09 pm
    Permalink

    Thanks for the history lesson. It is always interesting to remember how we got to where we are today.

    Reply
  • September 13, 2018 at 12:59 am
    Permalink

    So all the weird symbols like NUL and REF in a picture when you open it in a text editor are the 64 characters before A? And it’s weird because it’s being read incorrectly. Cool!

    Reply
  • September 17, 2018 at 10:53 am
    Permalink

    8:28 "We have something that nearly works"

    Reply
  • September 26, 2018 at 1:56 pm
    Permalink

    This guy knows everything!

    Reply
  • October 12, 2018 at 7:32 am
    Permalink

    Awesome explanation

    Reply
  • October 12, 2018 at 4:07 pm
    Permalink

    At time 6:46, the number is 49 not 65. Super interesting video, very informative. I was directed here from electroboom, and am excited to find another great educational YouTuber!

    Reply
  • October 15, 2018 at 7:53 am
    Permalink

    Well this sounds very nice and spacesaving, but still CZECH written texts encoded in utf-8 turn into complete gibrish upon transferring from one computer to another, or worse, from one app to another

    Reply
  • October 20, 2018 at 9:13 pm
    Permalink

    If you ever had to follow along after a hacker and analyze why their code is broken, you would be much less enthralled with hacked software. I have, it ain't pretty.

    Reply
  • October 30, 2018 at 8:54 am
    Permalink

    Does the Computerphile channel have some giant stash of green bar paper they carry around and hand to who ever is speaking to illustrate? Seriously, where do they get an endless supply of greenbar these days?

    Reply
  • November 1, 2018 at 12:36 pm
    Permalink

    I watched this video like 5 times over a long period now. Keep coming back to it, I so love the explanation and the storytelling!

    Reply
  • November 5, 2018 at 7:55 am
    Permalink

    Utf-8 is the obvious solution. Why any other option was ever considered is the real question.

    Reply
  • November 5, 2018 at 10:01 am
    Permalink

    Teletypes used 5 bit Baudot. ASCII was set up specifically for computers from the start.

    Reply
  • November 13, 2018 at 9:45 pm
    Permalink

    Isn't UTF-8 resticted to 21 bit, 4 bytes (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)?

    Reply
  • November 24, 2018 at 3:00 am
    Permalink

    moji baka
    amirite

    Reply
  • November 26, 2018 at 8:24 pm
    Permalink

    THREE BYTES.
    THREE ONES
    THREE BYTES

    Reply
  • November 27, 2018 at 3:50 am
    Permalink

    I thought there was a Klein bottle on the left side behind Tom. I got excited, and then I got sad when I realized it wasn't…. :'(

    Reply
  • December 8, 2018 at 5:31 pm
    Permalink

    So why isn't the header for everything that doesn't fit into two bytes (e.g. 110xxxxx 10xxxxxx) not just 110 aswell? Or why does the header need to specify how many more bytes there are? It could also just say: "there are more bytes to come" and the program reading it would just look for the next header (or the end of the data) and "use" all the bytes in between… Or am I missing somethin?

    Reply
  • December 11, 2018 at 11:00 pm
    Permalink

    A Miracle would have given one unified Encoding. Not the mess we have now! Video is misleading UTF-8 is NOT de facto standard, not even for the Internet.

    Reply
  • December 12, 2018 at 8:28 pm
    Permalink

    Was this filmed in Dorchester? 🙂

    Reply
  • December 17, 2018 at 12:25 am
    Permalink

    Additionally, UTF-8 does not have a "byte order". The "always store 32 bits for each character" encoding (a.k.a. UTF-32) has the problem that when a little-endian computer and a big-endian computer exchange data in this format, they have to add a prefix which tells the other computer "I'm sending the bytes of each character in ascending order" or "… in descending order". Then software needs logic to understand this prefix, to eliminate this prefix, to guess what to do when this prefix is missing, and so on. The UTF-16 encoding, which is used by Microsoft Windows internally, has the problem. Whereas UTF-8 just gets away without it. Simple and beautiful!

    Reply
  • January 6, 2019 at 4:12 am
    Permalink

    All zeroes is NOT a space. It is the null character, while 32 in decimal 20 in hex and 100000 in binary is a space.

    Reply
  • January 6, 2019 at 2:42 pm
    Permalink

    I am confused by a phrase at 2:13, "languages that don't use alphabets at all." If there is no written system, what are they typing?

    Reply
  • January 13, 2019 at 8:09 pm
    Permalink

    Very interesting as always Tom, but I couldn't watch this video – I listened to it, but the camera work made it unwatchable. The constant side to side swaying me seriously nauseous and the random rapid zooms just jarred. Please, tell your cameraman to get a tripod and to use it and to stop playing with the zoom lever.

    Reply
  • January 16, 2019 at 7:04 pm
    Permalink

    Utf 8 is not so great if you are a web dev working on non-english sites, it’s a mess

    Reply
  • January 22, 2019 at 10:50 am
    Permalink

    best explanation on the internet..

    Reply
  • February 8, 2019 at 6:48 pm
    Permalink

    The Japanese just did a great job in imitating their native writing system via computer.

    Reply
  • March 25, 2019 at 3:10 am
    Permalink

    You pronounced mojibake pretty well! For anyone who wants google translate or a Japanese English dictionary: もじばけ

    Reply
  • March 26, 2019 at 1:31 am
    Permalink

    The 6:29 is not 65 but 97 which is lower case a in ASCII

    Reply
  • April 8, 2019 at 10:54 am
    Permalink

    For the people wanting to know where this vid was taken it in a cafe called the booking office in St Pancras station I know because I have been there once it's pretty popular

    Reply
  • June 7, 2019 at 7:27 am
    Permalink

    How does it save space? I am not sure I followed that

    Reply
  • June 7, 2019 at 7:56 am
    Permalink

    Video's showing its age – UTF-8 is now redefined to be maximum 4 bytes per code point.

    Reply
  • June 25, 2019 at 4:51 am
    Permalink

    Why is it considered a hack though??

    Reply
  • July 14, 2019 at 1:01 am
    Permalink

    Where I London was this filmed?

    Reply
  • July 15, 2019 at 7:19 pm
    Permalink

    In depth explanation. He also shares a cool way to remember what A's and a's codepoints are.

    Reply
  • July 19, 2019 at 2:57 pm
    Permalink

    Such an incredible enthusiasm just for UTF-8! I’d like to hear you speaking about quantum entanglement 🥴

    Reply
  • August 1, 2019 at 11:28 pm
    Permalink

    Incompatible == too lazy to write a program

    Reply
  • August 22, 2019 at 2:45 am
    Permalink

    There are Unicode standards for Egyptian Hieroglyphs. Who would want to use those?

    Reply
  • August 25, 2019 at 9:27 am
    Permalink

    Lovely job. Thanks for the video!

    Reply
  • September 1, 2019 at 9:11 pm
    Permalink

    So does that mean non-English text takes more space to store? If I translate a document from English to .. say … Arabic, wouldn't that double or triple its file size? That sounds like a pretty big problem to me.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *