How the bits work
You’ve probably heard that computers store things in 1s and 0s. These fundamental units of information are called parts. When a bit is “on”, it corresponds to a 1; when it is “off” it turns to 0. Each bit, in other words, can only store two pieces of information.
But once you put them together, the amount of information you can encode grows exponentially. Two bits can represent four pieces of information, because there are 2 ^ 2 combinations: 00, 01, 10 and 11. Four bits can represent 2 ^ 4 or 16 pieces of information. Eight bits can represent 2 ^ 8 or 256. And so on.
The right combination of bits can represent data types like numbers, letters, and colors, or types of operations like addition, subtraction, and comparison. Most laptops these days are 32 or 64 bit computers. This does not mean that the computer can only encode 2 ^ 32 or 2 ^ 64 information in total. (That would be a very weak computer.) This means that it can use that many bits of complexity to encode each piece of data or individual operation.
4-bit deep learning
So what does 4-bit training mean? Well, for starters, we only have a 4-bit computer, and therefore 4-bit complexity. One way to think about it: each number we use during the training process must be one of 16 integers between -8 and 7 because these are the only numbers our computer can represent. This goes for the data points we feed into the neural network, the numbers we use to represent the neural network, and the intermediate numbers that we need to store during training.
So how do we do this? First, let’s think about the training data. Imagine it was a whole bunch of black and white images. First step, we have to convert these images into numbers, so that the computer understands them. To do this, represent each pixel by its grayscale value: 0 for black, 1 for white and the decimal places between shades of gray. Our image is now a list of numbers going from 0 to 1. But in a 4-bit country, we need it to range from -8 to 7. The trick here is to scale our list linearly. numbers, so 0 becomes -8 and 1 becomes 7 and the decimals correspond to the integers in the middle. Therefore:
This process is not perfect. If you started with the number 0.3, for example, you ended up with the scaled number -3.5. But our four bits can only represent whole numbers, so you need to round -3.5 to -4. You end up losing some of the shades of gray, or so called precision, in your image. You can see what it looks like in the image below.
This trick isn’t too shabby for workout data. But when we apply it again to the neural network itself, things get a bit tricky.
We often see neural networks drawn like the above as something with nodes and connections. But for a computer, these also turn into a series of numbers. Each node has a so-called Activation value, which typically ranges from 0 to 1, and each connection has a weight, which typically ranges from -1 to 1.
We could scale them the same way we did with our pixels, but activations and weights also change with each training cycle. For example, activations sometimes vary from 0.2 to 0.9 one turn and sometimes from 0.1 to 0.7 another. So, the IBM group discovered a new trick in 2018: resize these ranges to stretch them between -8 and 7 with each revolution (as shown below), which effectively avoids losing too much precision.
But then there is only one last element: how to represent in four bits the intermediate values that arise during training. What is difficult is how much these values can span over several orders of magnitude, unlike the numbers we were processing for our images, weights and activations. They can be incredibly small like 0.001 or incredibly large like 1000. Trying to scale linearly between -8 and 7 loses all the granularity at the very small end of the scale.
After two years of research, the researchers finally solved the puzzle: by borrowing an existing idea from others, they scale these intermediate numbers logarithmically. To see what I mean, here is a logarithmic scale you might recognize, with a so called “base” of 10, using just four bits of complexity. (Researchers use a base of 4 instead, as it worked best through trial and error.) You can see how this allows you to encode large and small numbers within bit constraints.
With all of these pieces in place, this latest article shows how they combine. IBM researchers have conducted several experiments in which they simulate 4-bit deep learning training for various models of computer vision, speech, and natural language processing. The results show that there is a limited loss of precision in the overall performance of the model compared to 16-bit deep learning. It is also more than seven times faster and seven times more energy efficient.