Computers understand binary. All our instructions and data is stored in binary format. We have abstractions that take data structures like strings, and lists, and instructions and store them as binary. Today we are going to explore how this conversion happens, and what your data structures look like under the hood. Understanding the memory layout is crucial to building better software where we understand our tradeoffs and don't just memorize time complexity of operations.
Before we dive into how data is represented we should clarify our abstraction of a computers memory. For todays discussion we can divide a computers memory into just 2 categories RAM (Random Access Memory) and Disk Storage (Hard disks and Solid State Disks). The RAM is a form of volatile memory (data in RAM is lost when you switch you computer off). The RAM provides fast access to data and instructions being used by the processor. Disk Storage is a non-volatile, data is persisted even after the power is turned off. Access to data on disk is slower than access to RAM (we will explore memory hierarchy in further articles in more detail).
Each process/program that runs on your computer is assigned their own separate virtual memory space by the Operating System. This provides memory protection and isolation between processes. Each virtual memory space is divided into different segments: code segment,, data segment, stack segment, and the heap segment. Each segment is a contiguous range of virtual memory addresses for a specific purpose. The data structures we make are placed in either the stack or the heap segments. The stack memory allocation and deallocation are handled by the compiler generating code that grows and shrinks the stack as functions are called and local variables are made. Memory allocation and deallocation on the heap is handled by requesting the OS (if heap is memory currently assigned is not enough) which is why allocation and deallocation on stack is faster than on the heap.
To a process/program the memory layout according to virtual memory is a continuous list of byte (8 bit) chunks that is composed of physical memory from the RAM and Disk. Using a technique called paging, and caching the Operating System (OS) provides an abstraction that lets us use memory in the form of a list of addresses that we can place data in. This is known as Virtual Memory.
Wait what's a byte? Is it the same as a bit? Nope! A byte is basically a group of 8 bits. 8 bits means each byte can represent 2^8 (256) unique values. I tried finding out why we settled on using 8 bits as the standard of smallest addressable memory and not say like 4. Turns out it was chosen as a tradeoff between how many values could be represented while keeping a balance of complex addressing schemes. It then eventually became a standard, some systems are built to work with even smaller addressable units like 'nibble' but byte is the standard.
Character Representation
A 'char' (character) which is the smallest data type you can make is of 1 byte.
char smallest = 'c'; printf("A character which is the smallest data type you can make is of %zu bytes \n", sizeof(smallest));
c is represented as: 01100011. 01100011 can also be represented in hexadecimal as => 0x63. How is an alphabet being written as a binary number that is 99 in decimal? Well, what we are basically doing is using binary numbers in different contexts to convey different meanings. The use of representing 'c' as 99 decimal / 01100011 binary is known as encoding: Ascii encoding is the most widely used, and it works by assigning numeric values to individual characters.
// we can get the virtual address of the variable smallest like this char* addressOfSmallest = &smallest;
The data for the variable 'smallest' in the above example has a size of 1 byte and is stored at the following virtual memory address: 0x7ff7b3b64567. That weird address you see above is called hexadecimal and it is just a more human readable form of binary. The address that it represents is a block of memory in the virtual memory address.
Numbers
Unsigned integers are represented as the standard binary, but the standard understanding of binary encoding ignores the representation of negative numbers. Computers can represent more than just postiive integers that fit in 4 byte(s). Well for that we have to expand our encoding of binary numbers and how we read them. We need a way to show number that are negative, numbers that are decimals, and number that need to be larger and smaller that 4 byte(s).
unsigned int tres = 3;
3 is of size 4 byte(s) represented as: 00000000 00000000 00000000 00000011
int negTres = -3;
The negative number: -3 is of size 4 byte(s) represented as: 11111111 11111111 11111111 11111101
Do you notice something unusual? We flipped the sign from 3 but the representation does not even look close to what the representation of 3 looked like. This is known as complement of 2 encoding of binary numbers and can be used to form negative numbers! In 2's complement encoding the first bit is used to represent the sign, negative is 1 and positive is 0, the rest of the binary encoding can be expanded as regular binary to decimal conversion.
float cuteFloaty = 5.3;
A floating point number is one of the ways we can store umbers that have a fractional part such as 5.30. The floating point number for 5.3 is represented as: 01000000 10101001 10011001 10011010. The most common way to encode these numbers with fractional parts is the IEEE floating-point format. The IEEE format is a topic of its own but the core concept is to divide different bits into different responsibilites. You can read more about it here: https://mathcenter.oxford.emory.edu/site/cs170/ieee754/
Now certain data types that we need require using more bytes, these data types include things like 'double' and 'long', which depending on the system will use more an assigned number of bytes to store the data type. Here is a list of C primitive data types and how much space they take on different architectures: https://www.ccoderun.ca/programming/2017-03-10_sizeof/ .
Endianness (No not the same thing as Indian)
Before we move to learning about the more fun data structures we need to address Endianness. Endianness refers to which direction the binary number is laid out in memory. Big Endian the most significant bit is stored first followed by the lesser significant bits in virtual memory. In Little endian the layout is opposite. Endianness is important to address because when dealing with data being exchanged across systems or interacting with external data, we need to know endianness to interpret it correctly. Most machines today support the little endian architecture by default on X86 and X86_64 architecture.
Strings
// what does a string look like const char* text = "this is a string";
Strings are represented as 'char*' in C that terminate with the null character '\0'.
The char* means that a string is a pointer to a character in memory. A string is hence 'composed' of characters that point to each other and finish with a pointer to a special character called the null character '\0'.
A pointer in C stores the memory address of a variable, we can address the value at that address by either 'dereferencing' a pointer by calling *pointer' or by using array style indexing. Array style indexing means if we have a string 'char* mystr = 'lalala' I can access the value at the first character of mystr via mystr[0] and get the character 'l', the next one via mystr[1] and so on.
Now if I try to dereference something out of the bound, then I run into undefined behavior, the program may crash, there may be segmentation fault, or maybe nothing.
The undefined behavior is not defined by C standard since we are accessing memory outside what was assigned for the string
We can iterate over the string we wrote above via memory address and dereferencing with array indexing.
for (size_t i = 0; i < strlen(text); i++) { printf("%c\n", text[i]); }
Output:
t h i s i s a s t r i n g
Notice how in the string we just know that the next character is present in the next byte in virtual memory! This is because memory allocated for a string is continuous chunk. This means that we can get to any part of our string in constant time as long as we know the offset adn the starting virtual address. This moving from the virtual address to a different byte in memory is known as pointer arithmetic, and we use it to iterate over the characters of our string, or reach different parts of our string.
Now we have concluded that strings are collections of characters in one continuous chunk of virtual memory, and we can reach any part of a string in constant time as long as we know the starting address and the offset we want to get to. We can now graduate to Arrays! Arrays are exactly like string except that instead of characters an array may be comprised of any type.
Arrays
const int myarr[] = {10, 15, 4500}; for (int i = 0; i < 3; i++) { printf("%d\n", *(myarr+i)); }
Output:
10 15 4500
Arrays are composed of two things, a type, and the capacity of the array. The type and capacity of the array are used to figure out how much contiguous virtual memory space needs to be allocated. For an array of 3 integers, we would need memory = size of an int * 3 = 4*3 = 12 bytes. If we want to go from one element to another in the array we need to hop to the address 4 bytes at a time! Luckily the C compiler scales our step appropriately based on the type which lets us iterate over and access items in our array without explicitly jumping by N bytes. Given the structure of an array in the virtual memory, things like random access can be done in constant time.
With the primitives covered, and arrays covered, there's only one thing we need to know before we can start looking into the data structures we build like linked lists, hashmaps, sets, and trees. We need to cover what a Struct / Structure looks like in memory!
Structs
A struct is a user defined data type that is composed of multiple primitives, or other user defined data types. Using a struct we can create instances that follow the struct's schema/structure. We could make a struct that holds an int along with a list of 3 characters.
If we wanted to create a linked list we could create a node with an int that signifies a nodes value, and have a field that holds a pointer to another node in memory! We can literally build our own data types, and write methods that interact with these types to create a higher layer of abstraction, how exciting!
struct Node { int value; struct Node* next; };
struct Node linkedList = {12, NULL}; printf("size of node: %zu bytes \n", sizeof(linkedlist));
Output:
size of node: 16 bytes
Let's take a look into how these structs are stored in the memory first though. We have a linked list node struct with an int and a pointer. It takes up: 16 bytes in memory. This is made up of the 4 bytes from the int field, and the 8 bytes of the pointer to the next node field along with some padding for what is called memory alignment.
Similar to arrays, structs are also stored in contiguous sections of memory. Their type is only a concept that exists at compile time! Once your code is compiled into machine code, types do not exist. The compiler uses the types to generate correct machine code but nothing further. When you access the struct's field in your code the compiler uses the knowledge of the sizes of the structs fields, along with pointer arithmetic for getting to the desired field. By understanding how structs are compiled and laid out in memory, we can be mindful of things like arranging writing structs that are suitable for memory alignment and save bytes in memory constrained environment. The fact that the type system is such a beautiful abstraction still absolutely blows my mind!
Now that we know structs and how they look in memory we are ready to tackle harder data structures that use pointers and structs along with primitives to build beautiful abstractions. In the next part we will use the primitive types, arrays, structs, and functions to build and dive deep into linked lists, dynamically sized arrays, and Hashmaps.
Well done!!