4

Allow me to preface that I am not a computer specialist. More than anything, I am curious as to the information.

In a conversation with a computer science specialist, I was told that a string of decimal numerical values, such 73829182093, could be stored on a hard drive occupying only half of the needed bytes by utilizing a hexadecimal system. As said by the specialist, a string of six decimal numbers could be stored as 3-bytes, because each number could be represented by a hex digit, which is only 4-bits in size. Is this correct with regard to storage on a hard drive? Note, I am referencing storage on a hard drive, not the required memory needed to display.

My previous understanding is that all information was stored in a binary form (0s and 1s) on hard drives, and in blocks of 8-bits, in modern computer hard drives. And that hexadecimal is utilized to facilitate the display of information, so humans aren't required to read through long blocks of bits.

If this is true, does this mean that in a given scenario, a block of 8 bits on a hard drive, under a hexadecimal storage, would instead be encoding for two half-bytes of data, instead of 8 full bits for a character, like the letter "M"? Or on a hard drive, is the half-byte actually represented with the full 8 bits and then just omitted when displayed?

Thank you.

J. J.
  • 39

3 Answers3

5

My previous understanding is that all information was stored in a binary form (0s and 1s) on hard drives, and in blocks of 8-bits, in modern computer hard drives. And that hexadecimal is utilized to facilitate the display of information, so humans aren't required to read through long blocks of bits.

That's 100% correct. Hexadecimal is merely a representation of data; there's nothing special about the nature of hexadecimal compared to other formats. It doesn't enable data compression or anything like that.

I think what your friend was referring to is the difference between representing numbers as character strings versus representing numbers as numbers.

For unsigned integers -- which is a representation of numbers in bits (zeroes and ones) from 0 to a certain, fixed, maximum number -- the maximum number that can be represented by N bits is 2^N, minus 1, assuming you start with 0.

So, if you have 8 bits (a.k.a. 1 byte), you can represent every number from 0 to 255 without losing information; you can manipulate these eight bits between 0 and 1 to unambiguously represent every number from 0 to 255, inclusive. Or from 1 to 256, if you prefer. It doesn't matter. Computers tend to represent them starting from 0, though.

If you have 16 bits (2 bytes), you can represent every number from 0 to 65535 (that's 2^16 - 1). 32 bits, every number from 0 to 4294967295. 64 bits, every number from 0 to a number that's 1.8 with nineteen zeroes.

You might know from algebra that 2^N is an exponential function. That means that, even though 64 bits is only eight times more bits than 8 bits, it can store way, way, way more data in that 8-times-more-bits than the number 255*8 (which is only 2040!). 2040 is a very small number compared to approximately 180000000000000000000. And 64 bits can store EVERY number from 0 all the way up to that maximum.

One interesting implication of integers stored in this way is that the programmer must decide in advance how big the storage needs to be, which in turn, determines the maximum number that can be represented by a given integer. If you try to store a number bigger than the storage can handle, you get something called overflow. This happens, for example, if you have an 8-bit integer, that's set to 255, and you ask the computer to add 1 to it. Well, you can't represent 256 within an integer whose range is 0 to 255! What usually happens is it "wraps around" back to the start, and goes back to 0.

There are programs that perform math in a mode called "arbitrary-precision" that automatically resize their storage to grow bigger and bigger depending on how big the numbers being handled are; for example, if you multiplied 255 by 100000, the answer would have to grow beyond 8 bits, and beyond 16 bits, but would fit within a 32-bit integer. If you input a number or performed a math operation that produced a number larger than the maximum for a 64-bit integer, it would have to allocate even more space for it.


HOWEVER -- if you represent numbers as a character string, then each number will take up as much space as a letter in written prose. "ASDF" and "1234" take up exactly the same space. "OneTwoThreeFourFive" (19 characters) takes up the same space as "1234567890123456789". The amount of space required grows linearly with the number of numbers (or letters, or characters, generically) you have. That's because each character can represent any of a myriad of characters within the character set, and numbers are just characters in a character set. A specific sequence of zeroes and ones will produce the number "3", a different sequence will produce "4", etc.

Typically characters are stored taking up either 8 or 16 bits, but some character encodings either take up a variable number of bits depending on the character (like UTF-8), or always take up a larger number of bits (like UCS-32).

If each character takes 8 bits, "OneTwoThreeFourFive" and "1234567890123456789" both take up 152 bits. But "1234567890123456789" can fit within a 64-bit unsigned integer, which... only consumes 64 bits. That's a savings of 88 bits! And we didn't even use any "data compression" tricks like Zip, 7-Zip, RAR, etc.

allquixotic
  • 34,882
4

My previous understanding is that all information was stored in a binary form (0s and 1s) on hard drives, and in blocks of 8-bits, in modern computer hard drives. And that hexadecimal is utilized to facilitate the display of information, so humans aren't required to read through long blocks of bits.

Your previous understanding is exactly correct, and I have a feeling you already understand the rest of this answer, but I want to explain a few ideas people often conflate anyways. I'm going to try to be as brief as possible, but it will be tough.

Bytes, Storage

Data is typically stored on a hard drive (or in memory) in 8-bit blocks called bytes. A single bit has two possible values, which, by convention, we refer to as 0 and 1. A single byte therefore has 28 = 256 possible values.

I don't actually know why 8-bit blocks are the typical unit. I'm not familiar enough with the history of computer development to know that, but I can at least tell you that we continue to use 8-bit bytes on common systems because we're kind of locked into it at this point and there's no reason to change.

Also, because I know this will come up, in reality data isn't necessarily stored on a drive in one-byte blocks or one byte at a time. Typical hard drives often use larger blocks, etc. However, for the scope of your question, none of this matters. All that matters is that it appears to us that hard drives operate on individual bytes. The actual implementation is an interesting topic but doesn't affect us here: traditionally, humans generally discuss storage in terms of individual bytes, and we probably are human.

Binary, Hexadecimal

The reason we often use binary notation when discussing values of bit-related things like bytes is simply because it makes the most sense. Since a bit has two possible values, this naturally translates to a binary representation of numbers (binary meaning each digit has two possible values, as opposed to the decimal system we typically use every day, where each digit has ten possible values).

The reason us programmers also like to use hexadecimal (each digit has sixteen possible values) notation is because it's really convenient. It just so happens that the range representable by a single hexadecimal digit corresponds exactly to the range representable by four binary digits. And this fits nicely into our 8-bit bytes: two hex digits can represent every value of a byte. It's also a manageable system for our brains, it's really easy to relate hex to binary once you get used to it.

We could have used a base-256 system in writing, but that would be inconvenient, because it's hard to come up with 256 easily typeable, speakable, and memorizable characters. We could have used a base-17 system but that doesn't correspond as neatly to 8-digit binary numbers. So we use hexadecimal, because it makes a ton of sense for us.

Text

We use text a lot, so it's to our benefit to come up with standard ways of representing the characters we use every day as series of bytes. This mapping of characters to bytes is called a "character encoding" or "character set". Of course, we suck at actually agreeing on things, and also many different such mappings were developed independently for many different needs, so we have many character sets, like ASCII, or ISO-8859-1, or JIS.

As an aside, unicode was invented to attempt to define a standard that made everybody happy, unifying all of our various character encodings, hence the name "unicode".

But the point is, text is represented by series of bytes, and exactly what each series of bytes means is determined by various character encodings, and the fact that the bytes represent text at all relies on the assumption that the program reading the bytes understands that they're supposed to represent text. ASCII is a convenient one to talk about because each character maps to exactly one byte, and also it's really old, really simple, was really widely used, and despite being grossly inadequate for the global community, is still very popular and easy to discuss.

Semantics

This is, I'm convinced, the most confusing point to many people.

Bytes are just bytes. They have essentially arbitrary values. What those values actually mean is determined only by context and by what a program reading them actually does with them.

For example, recalling that a byte can take on 256 values, the value 97 (binary 01100001, hexadecimal 61), at the end of the day, can mean many different things:

  • If the byte is treated as an integer value, it's the number 97.
  • If the byte is treated as an ASCII character, it's the letter a.
  • If the byte is treated as a machine instruction for Intel x86 compatible processors, it's the POPA or POPAD instruction (doesn't matter if you don't know what these are, that's not the point).
  • If the byte represents a pixel in a grayscale image, it's probably this shade of gray.
  • If the byte is part of some map data for some game, maybe it's a tree or a fence or something.
  • Etc.

Even for numeric values the bit patterns can take on different meanings, for example:

  • Sometimes we're satisfied with the values 0-255. Other times we want to handle negative numbers so we shift the range of semantic values to -128 thru 127 and use the first bit to indicate if it's negative or not. Or whatever. The sky is the limit (although, like character encodings, there is a generally agreed upon standard set of rules for integer values as well).
  • Sometimes due to various circumstances, we even encode integer values in other ways, e.g. BCD.
  • Sometimes we need to represent larger integers. So we use many bytes. Even this has options, see "endianness".
  • Sometimes we need to represent decimal numbers. Many options here as well, see floating-point and fixed-point for options here.

The point of all this is a byte is just a byte, it means nothing until you have context. If a program writes some bytes with some intended meaning, only a program that reads them and interprets them as having that same meaning will be able to make proper sense of it.

Putting This All Together

So now, relating this all back to your answer, this should actually be really simple now:

  • Your friend is referring to the idea of you storing a number as a textual representation of its value in hexadecimal. For example, the value 97, in hex might be 61. This is a two digit number, containing character "6" followed by "1". Encoded as ASCII that would be two bytes: the value 54 followed by the value 49 (decimal). But that only has meaning if, when you read those bytes back, you understand them to be two ASCII encoded hexadecimal digits.
  • You could also just store the value 97. That's only one byte. That's half the length of the previous option. But of course, that only has meaning if, when you read that byte back, you understand it as corresponding directly to an integer value.

Typically, us programmers would probably choose the second option, but it really, really depends on context. For example, in an HTML document, which is designed to be human readable text, we'd still store an attribute like width="97". Sure it may take up less space to use some tighter representation here, but then it'd be a pain to write HTML. So it really depends on the context and use case.

I hope at least some of this makes sense.

Jason C
  • 11,385
1

a string of six decimal numbers could be stored as 3-bytes

That sounds like BCD, binary coded decimal, representation versus numeric ASCII characters (a full byte per digit). Four bits are used to represent the values 0 through 9. (The other six values are undefined/invalid.)
BCD values can be unpacked (one BCD digit per byte) or packed (two BCD digits per byte).

The advantages of using BCD versus binary are convenience for human display (i.e. trivial conversion) and no loss of accuracy for decimal fractions (e.g. one tenth is an infinite repeating binary number).

Calculators typically use BCD representation instead of binary. The long string of digits on credit cards and security/access cards are typically encoded as a BCD string on the magnetic stripe or in the transmitted RF packet.
Digital computers typically use binary representation for calculations and storage. A CPU might have instructions to perform BCD arithmetic.

sawdust
  • 18,591