When you put English text into a computer, it is saving every individual character as
eight bits of data. Eight ones and zeros. I mean, they’re not actually ones and zeros, they’re particular positions
of atoms on a disk or a few electrons going over a wire, but as far as we’re concerned
in the computer science world, they are ones and zeros. Bits. And yes, a modern phone or computer might
store a few quadrillion of them(!) But every time you’ve had to wait a long time
for a download, or your phone’s complained
that its storage is full, you’re running up against the same question
that computer scientists have been working on since before these things here
were state of the art: can we please not use so many bits? So, let’s run through
how computers compress text. Images and video are different —
that’s lossy compression, where it doesn’t matter if you lose
a little bit of detail. But text has to be losslessly compressed: you can’t just lose a bit of detail, otherwise you end up sending the wrong worms. So the first thing we need to know is how text is stored on disk before it’s
compressed. To make this easier, I am just going to talk about
English text here: I know only too well that it gets
more complicated than this, but hey, the series is called “the basics”,
so deal with it. On a modern computer,
each English character takes up exactly eight ones and zeros on disk –
eight bits, or one byte. And there are 256 possible combinations of
those ones and zeros, so you can have 256 possible characters. That’s enough for the English alphabet, numbers,
some punctuation, and… well, then it gets complicated depending on
which country you’re in and oh my god that’s a whole separate video we do not want to get into right now. Why eight bits? Because it was big enough to store all the
characters that American computers usually needed,
but no bigger, so the text doesn’t take up any more space
than it absolutely has to. Eight is also a power of two, which means
it’s easy to deal with at the really low levels of programming. It’s also helpful to have a fixed number of
bits per character because it makes searching through text
really fast. If you want to go to the 100,000th character
in some text, you know exactly where the ones and zeros
are going to be without having to count the number of bits in every
single character before it. If you want to know how long
a string of text is, you just count the bits
and you divide by eight. Computers don’t have the luxury of spaces,
remember: it’s just one long string of 1s and 0s. But let’s say we’re not worried about speed
right now, we just want to fit as much text as possible into as small a number of 1s and 0s as possible. Compression. A good plan would be to assign
the most common characters to smaller arrangements of bits. So those researchers, way way back,
could have said: well, the space bar is generally
used most often. Just give that the code “0”. Then it’s the lowercase e.
Give that the code “1”. Then it’s lowercase t: give that two zeros… …and immediately, we’ve hit a problem. Because the computer running through that text later
has no way of knowing whether “00” is a t, or the space bar twice. And if we keep on assigning letters like that, the problems keep getting worse: is three zeros a lowercase n? Or a t followed by a space? Or three spaces? Remember, there are no gaps here, no actual
space separators, all the computer can see is a constant stream
of 1s and 0s. There’s no way to know which is meant. Except. In 1952, a very clever mathematician
called David Huffman invented Huffman coding. You invent something like that,
your name gets put on it. Yes, there are much better, more modern, more complicated,
mathematical ways to do this, but Huffman coding is the basic foundation
of modern text compression. And here is how it works. Let’s say we want to compress… let’s go with the lyrics to
Will Smith’s “Wild Wild West”. Uncompressed, that is 3,684 characters
taking up nearly 30,000 bits. First up: you count how many times
each character is used, and you put that in a list in order. Now that’ll be different for each block of
text you’re compressing: this has way more Ws than usual. Now take the two least used characters. Those two are going to be the bottom branches
on your “Huffman tree”. Write them down, with how often they’re used
next to them. That’s called their frequency. Then connect them together, one level up,
with the sum of their frequencies. Now add that new sum back into your list, wherever it sits, higher up. And repeat! Take the bottom two off your list,
connect them up, add the result back in your list. Keep that going. And when one of the sums reaches the bottom
two of the list, you connect it up like that. And eventually, you’re going to be down to
one thing in your list, right at the top. You now have a Huffman tree,
and it looks like this, and it tells you how to convert your text
into ones and zeros. Our first letter is the first uppercase W
in ‘wiki-wiki-wild-wild-West’. Uppercase W is here, there’s 141 of them. So go the top, follow the path down. Each time you take the left hand side, write
a 0; each time you take the right hand side, write
a 1. So W is this: only 5 bits instead of 8. Next, lowercase i. Follow the code down, only four bits
this time, it’s more common. Then the k, which is less common; that actually
takes up seven bits. And you keep going. Some of the letters will take up more than
8 bits, but that’s fine, because they’re not used
very often. Now, you do have to also store this tree to
provide a translation table between your new symbols
and the uncompressed ones – so this is not efficient for
short bits of text. If you’re ever asked to do this
in a computer science exam, then they’ll ignore all that
and just ask you to do one easy word from a tree by hand. But for Will Smith’s magnum opus,
we have compressed it down to just over 20,000 bits: about a 30% saving. To uncompress the resulting stream of bits,
it works the other way: just read across, take the left fork every
time you see a 0 and the right fork every time you see a 1. When you reach a letter, that’s it, you know
that’s the end, and you know that there is no other path you
could have possibly taken. You start again with the next bit. Now in practice, computers might do this working
backwards from the bottom up sometimes, but this is a pretty good way to at least
understand what’s going on. And here’s the really clever part: Huffman proved that this is
the most efficient way to assign 0s and 1s to single characters. It is mathematically impossible to beat this. Unless, as you might already
have figured out, you start working on blocks
bigger than one character. Clever tricks like that are basically how
zip files work… but then, this is just the basics. Thank you to all my proofreaders who helped
with that script, thank you to my graphics team, and also thank you to the
Centre for Computing History in Cambridge for letting me film with their old kit.

How Computers Compress Text: Huffman Coding and Huffman Trees
Tagged on:                                         

100 thoughts on “How Computers Compress Text: Huffman Coding and Huffman Trees

  • September 9, 2017 at 10:27 am
    Permalink

    This is the last of the three trial Basics videos! This pushed my quick-explanation skills to the limit, but I figure that "slow down the video and replay if necessary" is better than "let people get bored"…

    Reply
  • September 23, 2018 at 12:05 pm
    Permalink

    0:52 11/10

    Reply
  • October 2, 2018 at 10:32 pm
    Permalink

    I wished my Analog Signal Processing Professor explained this that clearly 18 years ago.

    Reply
  • October 8, 2018 at 11:27 am
    Permalink

    Hllo thr Tm.

    Reply
  • October 9, 2018 at 2:54 pm
    Permalink

    Tom Scott you are amazing I wish one day to meet you ❤️

    Reply
  • October 9, 2018 at 6:21 pm
    Permalink

    But with this system don't you lose the advantage of having a fixed word-length?

    Reply
  • October 9, 2018 at 10:09 pm
    Permalink

    How was this a whole year ago??!! It feels like only a month ago.

    Reply
  • October 10, 2018 at 3:58 pm
    Permalink

    yeah I am not becoming no Richard Hendricks anytime soon

    Reply
  • October 10, 2018 at 11:35 pm
    Permalink

    Further reading, look how braben compressed giant planet descriptions in elite.

    Reply
  • October 21, 2018 at 5:36 am
    Permalink

    "worms"…bam! pressed that like button😂

    Reply
  • October 22, 2018 at 8:11 pm
    Permalink

    Wouldn't it be possible to compress that already compressed text file again with the same method? Since you end up with 0s and 1s, you could interpret 8 of them as a character and create a new tree right? What stops you from doing this more than once?

    Reply
  • October 27, 2018 at 6:20 am
    Permalink

    4:48 while decoding, you forgot the c in wicky

    Reply
  • October 31, 2018 at 9:27 pm
    Permalink

    This explanation doesn't cover how exactly the order of the characters is maintained, just that letters can be compressed, seemingly agnostic of order.

    Reply
  • November 3, 2018 at 1:58 am
    Permalink

    Should have used CMPRSS I read CMPRS as compares

    Reply
  • November 3, 2018 at 11:37 pm
    Permalink

    Love your explanations, can you also make one for zip?

    Reply
  • November 5, 2018 at 2:39 am
    Permalink

    proceed to impress the intro to compsci teacher by quoting this video

    Reply
  • November 5, 2018 at 8:28 pm
    Permalink

    Please uncompress this video to explain further

    Reply
  • November 10, 2018 at 4:29 am
    Permalink

    Hey it's me

    Reply
  • November 12, 2018 at 10:34 pm
    Permalink

    I just realised I could totally implement that.

    Reply
  • November 13, 2018 at 8:11 am
    Permalink

    Pls explain magic zip files:)

    Reply
  • November 20, 2018 at 11:56 pm
    Permalink

    Bloody brilliant, cheers mate

    Reply
  • November 22, 2018 at 7:37 am
    Permalink

    Funnily enough, I have proved mathematically that Tom Scott's videos are the most efficient way to insert knowledge into my head. (Do I win a Nobel prize?)

    Reply
  • November 24, 2018 at 3:01 pm
    Permalink

    How can it be proved that this (and any solution of something in general) is the most efficient way?

    Reply
  • November 24, 2018 at 11:03 pm
    Permalink

    i hate when you limit the bit rate of video on propose

    Reply
  • November 25, 2018 at 5:33 am
    Permalink

    His “Magnum Opus”!?!? Not even funny given his greatness brought us “Fresh Prince” 😎🤠

    Reply
  • December 4, 2018 at 2:27 pm
    Permalink

    Dude are you wearing make-up?

    Reply
  • December 5, 2018 at 11:55 pm
    Permalink

    Can someone explain how a space would work?

    Reply
  • December 15, 2018 at 3:17 pm
    Permalink

    Why is it that much of a deal? I don’t care if Discord uses 80 kilobytes instead of 100 because it’d use 3000 as soon as i see an image or even a profile picture

    Reply
  • December 17, 2018 at 7:29 pm
    Permalink

    When comes a new series of the basic, I love them, youre a cool guy. Keep going

    Reply
  • December 27, 2018 at 5:39 pm
    Permalink

    You remind me of Edd China

    Reply
  • January 3, 2019 at 12:54 pm
    Permalink

    Thanks Tom 🙂

    Reply
  • January 5, 2019 at 7:15 pm
    Permalink

    Oh, worm?

    Reply
  • January 10, 2019 at 11:14 am
    Permalink

    Can you do an another, more advanced video about that? I really enjoyed!

    Reply
  • January 11, 2019 at 9:31 pm
    Permalink

    Burned myself on my tea at "the wrong worms"

    Reply
  • January 27, 2019 at 2:23 am
    Permalink

    Great video, extreme well put

    Reply
  • February 6, 2019 at 7:16 am
    Permalink

    This should be on YouTube premier!

    Reply
  • February 12, 2019 at 11:32 pm
    Permalink

    i still don’t understand how this is implemented in code

    Reply
  • February 13, 2019 at 9:16 pm
    Permalink

    In quantum computing, you can use a half state as a divider. This will allow letter strings to be shorter, all whilst the computer doesn't need to store a Huffman tree inside.

    Reply
  • February 15, 2019 at 6:41 pm
    Permalink

    You should cover Lempel-Ziv, and it works as an effective preprocessor to Huffman Coding. Or maybe Fenwick Trees.

    Reply
  • February 17, 2019 at 5:22 pm
    Permalink

    I love how you manage to shove in all those itsy-bitsy jokes!

    Reply
  • March 3, 2019 at 6:07 pm
    Permalink

    It would be killer if you did an intermediate/advanced version that covered things like what zip does.

    Reply
  • March 9, 2019 at 11:37 am
    Permalink

    Found the 7Zip PRO version 2019 here @t

    Reply
  • March 17, 2019 at 3:21 pm
    Permalink

    Glad to see I wasn't the only one who thought "wrong worms" was a Ronnie Barker reference. 🙂

    Reply
  • March 26, 2019 at 3:40 pm
    Permalink

    So… In about 6 minutes I have understood Huffman compression algorithm… WOW Tom you are a wizard

    Reply
  • March 27, 2019 at 7:24 am
    Permalink

    How smart do you even have to be to come up with something like the Huffman code ? wow

    Reply
  • April 10, 2019 at 6:27 am
    Permalink

    Sending wrong 'worms' 😄😄😄

    Reply
  • April 23, 2019 at 3:41 am
    Permalink

    I want more videos on this topic 🙂

    Reply
  • May 4, 2019 at 7:27 pm
    Permalink

    Super informative

    Reply
  • May 9, 2019 at 6:01 pm
    Permalink

    great video and well explained! thank you

    Reply
  • May 17, 2019 at 5:16 pm
    Permalink

    Worms 🙂 🙂 🙂

    Reply
  • May 19, 2019 at 8:32 am
    Permalink

    6:15 Oh my god, I used to have an Acorn A3010!

    Reply
  • May 20, 2019 at 8:47 am
    Permalink

    Watching this video I have been thinking about compressing text and I think if it works I might get a IEEE award 😀 Thank you, Tom 😀

    Reply
  • May 26, 2019 at 10:37 am
    Permalink

    Last night I was watching my friends here have this argument. About, you know, manipulating data And, you know, how many datas could one guy manipulate at once and, uh And I was just I was thinking. Maybe it could be another way, you know? Something that I would call, "middle out".

    Reply
  • May 26, 2019 at 10:43 pm
    Permalink

    Do zip files plz

    Reply
  • May 29, 2019 at 10:47 am
    Permalink

    Shame modern game devs have given up on saving space and increasing efficiency. Brute force is now the solution.

    Reply
  • May 30, 2019 at 6:54 am
    Permalink

    HW CMPTRS CMPRS TXT

    Reply
  • June 6, 2019 at 9:57 pm
    Permalink

    Can someone tell me what computer is behind him?

    Reply
  • June 8, 2019 at 5:15 am
    Permalink

    "Can we please not use so many bits!" 0:34 I yell this out loud to my colleagues every day.

    Reply
  • June 23, 2019 at 10:08 am
    Permalink

    Ok so the file stores the tree so it knows the most common letter is w but how does it know the order to reassemble it in order I instead of having all the characters jumbled?

    Reply
  • July 13, 2019 at 6:06 am
    Permalink

    Spend way too much time on the basic stuff and not enough on the huffman coding and trees and why they actually work. As usual…..

    Reply
  • July 16, 2019 at 3:56 pm
    Permalink

    Space counts as a character

    Reply
  • July 23, 2019 at 7:48 am
    Permalink

    Amazing expl

    Reply
  • July 26, 2019 at 1:37 pm
    Permalink

    8 bits per character was a relatively modern luxury when i was young. Of course it started out with 5-bit encoding as used on punch paper tapes (no hole =0, hole=1). Giving 31 codes for letters and a serial shift code for numbers.
    Then we got US ASCII which was 7-bit. Then extended ASCII/ANSI 8-bit, and then of course the modern day 16-bit Unicode.

    Reply
  • July 31, 2019 at 7:10 am
    Permalink

    0:02 : Why "English" text ? It may work for every text, whatever the language. Or am I missing something ?

    Reply
  • July 31, 2019 at 8:00 am
    Permalink

    The animation of the red rectangle trying to parse the bits is awesome! I laughed really hard.

    Reply
  • July 31, 2019 at 3:56 pm
    Permalink

    "or you may send the wrong woms"
    All comments: That jake waz awful

    me an intellectual: plank xswegwoqrunqnrclw oe wrhcqioW HH

    Reply
  • July 31, 2019 at 5:01 pm
    Permalink

    2 computer science lesson now make sense after 6:30 of of Tom Scott explaining Huffman Code! Thank you

    Reply
  • August 1, 2019 at 7:13 pm
    Permalink

    Just marvelous how you went from Acorn Atom to Acorn Electron and ended up at the BBC Micro, developed by Acorn. Although it's not chronologically right, it is a nice piece of (British) computer history, especially when you realise Acorn has an impressive legacy. Almost every single smartphone anywhere in the world uses technology from ARM. ARM, which was originally an acronym for Acorn RISC machines, derived from Acorn.

    It brings back good memories. 😉

    Reply
  • August 6, 2019 at 1:13 am
    Permalink

    if cpu as clev as H brn & comp. wrds not lets but gram sentcs ……> flnm.z files !

    Reply
  • August 9, 2019 at 4:41 am
    Permalink

    How do they send these trees?

    Reply
  • August 11, 2019 at 4:59 am
    Permalink

    Yo, the YT algorithm is giving me a Tom Scott renaissance right now and it's all so good. Love this guy!

    Reply
  • August 11, 2019 at 9:38 pm
    Permalink

    what does it mean by sending wrong worms?

    Reply
  • August 13, 2019 at 9:33 am
    Permalink

    In the days of the Sinclair Spectrum entire words were stored using a single byte of memory by having keywords among the ASCII list.

    Reply
  • August 13, 2019 at 7:13 pm
    Permalink

    Sorry what

    Reply
  • August 14, 2019 at 11:25 pm
    Permalink

    Someone needs to make a program to do this.

    Reply
  • August 15, 2019 at 2:01 pm
    Permalink

    Hi Tom. Electrons do not actually go across the wire – electricity does, but not electrons.

    Reply
  • August 16, 2019 at 6:53 pm
    Permalink

    How would the space character be encoded in the example of the Huffman tree for wild wild west?

    Reply
  • August 17, 2019 at 7:56 pm
    Permalink

    Nice backdrop! 😀

    Reply
  • August 18, 2019 at 2:43 pm
    Permalink

    1's and 0's or in other words haves and have-nots.

    Reply
  • August 19, 2019 at 5:42 am
    Permalink

    What's really interesting is that genetics has similar rules. Codons, for example, are made of three nucleotides. Each nucleotide taking the place of a bit. The bits represent amino acids, but the start and stop codons are used in a manner similar to assembly programming.

    One big difference is that genetics isn't digital. Rather than two binary bit states there are four possible bit states: A, T, C, and G.

    To some it might seem like binary because the amino acids appear in pairs, but they pair between strands so that the genetic information can be copied.

    Each codon has a length of 3 bits, and because there are 4 possible states per bit that allows for 64 possible combinations.

    What's even more interesting is that there are over five times as many proteins as there are genes to code for. To oversimplify things, the epigenome acts as a kind of encryption layer switching genes on and off in order to change the way the instructions run. In other words, our DNA is heavily encrypted, but the same mechanisms also allow for the real-time adaptation to environmental conditions. This approach seems like it's more efficient then what's represented in this video. I'm curious if this could be applied in computer science.

    Reply
  • August 20, 2019 at 11:06 am
    Permalink

    are you a teacher?

    Reply
  • August 20, 2019 at 9:17 pm
    Permalink

    A9 07 20 EE FF 60

    Reply
  • August 21, 2019 at 5:06 am
    Permalink

    Anyone remember the thing like this in Gregor the Overlander? The Tree of Transmission, I think it was called?

    Reply
  • August 21, 2019 at 9:02 pm
    Permalink

    I love that wrong worms joke XD

    Reply
  • August 22, 2019 at 11:37 am
    Permalink

    wow, very well explained, using trees (and their uses) explained properly, text compression.

    Reply
  • August 23, 2019 at 8:51 pm
    Permalink

    Surprised we didn't learn about Huffman coding in university. Could have been mentioned right after the sorting algorithms…

    Reply
  • August 24, 2019 at 6:27 am
    Permalink

    Why don't they just use quantum bits, duh

    Reply
  • September 3, 2019 at 9:09 am
    Permalink

    Those old BBC computers are worth their weight in gold, even if they aren't working.

    Reply
  • September 3, 2019 at 11:31 am
    Permalink

    Awesome!

    Reply
  • September 7, 2019 at 7:27 am
    Permalink

    "… or you'll send the wrong worms"
    duפn Nakk, tnet goכe vסz avvfai!

    Reply
  • September 8, 2019 at 12:44 am
    Permalink

    So zip files will find blocks of characters that are used in varying frequencies and turn those into bits? Like a word being 1?

    Reply
  • September 13, 2019 at 11:34 pm
    Permalink

    Tom Scott: Text has to be losslessly compressed!

    Xerox: Hold my copy machine!!!

    Reply
  • September 17, 2019 at 8:05 pm
    Permalink

    great video, but, i did not get how the decoder computer sees the huffman tree?

    Reply
  • September 19, 2019 at 8:36 am
    Permalink

    Can you make a video on Swift payments

    Reply
  • September 19, 2019 at 3:03 pm
    Permalink

    My godness,this is so clever!

    Reply
  • September 19, 2019 at 9:25 pm
    Permalink

    So good that i find this after i spent hours figuring out huffman trees

    Reply
  • September 19, 2019 at 9:25 pm
    Permalink

    So good that i find this after i spent hours figuring out huffman trees

    Reply
  • September 21, 2019 at 8:22 am
    Permalink

    Thet sound at the end!

    Reply
  • September 21, 2019 at 10:17 pm
    Permalink

    Just when I think I’ve seen your best work, Tom, along comes this. Excellent!

    Reply
  • October 7, 2019 at 3:49 am
    Permalink

    I'm a bit confused by the part where you said each character is basically a path of the tree. If some characters have shorter paths than others, how does the code or computer know where to break it up?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *