Content Addressable Storage22 min read

If you’re working on a modern JavaScript application and use a build tool like Vite to build and deploy your code, you can open the dist folder and find out something similar to this:

JavaScript
src/main.js
src/components/Header.vue
src/assets/logo.png

// After build 
dist/assets/main.a1b2c3d4.js      
dist/assets/Header.8e7f9a0b.js    
dist/assets/logo.5c4d3e2f.png     

There are some questions that we could ask here: why is each of the files appended with some random hash, and how are these hash values generated, and why do we need them? After finishing this post, we will have a good enough answer to these questions.

First off, let’s presume that our generated build files have the exact same name as in the source file, for example, the build file main.js is still having the name main.js. If that’s the case, there are a few problems with this static file name approach:

When the browser enters your website and downloads this main.js file, and later, even if there is some update in this main.js file and you deploy the code change, the browser thinks it already has the file and never bothers to download it once again:

JavaScript
<script src="/assets/main.js"></script>

// Old main.js (cached in user browser)
function calculateTotal(price) {
  return price * 1.05; // 5% tax
} 

// new main.js (in your server)
function calculateTotal(price) {
  return price * 1.08; // 8% tax rate
}

So our new function with the updated tax rate never gets applied when executed in the browser!

Another problem is that if we want to change the file name for each deployment, then is it necessary for the browser to fetch the content again if the content for 2 versions is the same but only the file names are different? How can we de-duplicate the files in this situation?

Content-addressable storage (CAS) is a technique used to solve these problems. Instead of identifying the files based on names, we use the hash function over the file content to calculate its address or reference. In this way, this can give us some guarantees:

  • 2 different files with the same content will generate the same hash value.
  • A tiny change from the input creates a dramatic change in the output (so-called “avalanche effect”).

So, coming back to the generated file names from the Vite build tool is a perfect usage of the CAS in web development. In this case, if the file with some hash is downloaded in the browser, it will not bother to download this file again from our server if the file name doesn’t change. In case the file name does change, indicating that the content has changed, then the browser will fetch the new file again as expected.

CAS is also useful for content deduplication, for example, in a public file storage system, if users upload 50 files with identical content, then instead of storing all 50 files in 50 different locations, the system can just have one hash address to store just one file instead of 50, hence saving a lot of space!

Git also stores everything — files, directories, commits using content hashes (SHA-1, SHA-256), these hashes are used as addresses to retrieve the content later.

Now that we have some basic idea about the CAS as well as its characteristics, let’s have some hands-on experience by building the content-addressable storage system from scratch. We will start simple and then advance to more sophisticated features. In the first phase, let’s implement the basic concept in which we address content by its cryptographic hash:

Rust
use sha2::{Sha256, Digest};
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct ContentHash(String);

impl ContentHash {
    pub fn from_bytes(data: &[u8]) -> Self {
        let mut hasher = Sha256::new();
        hasher.update(data);
        let result = hasher.finalize();
        ContentHash(hex::encode(result))
    }

    pub fn from_str(data: &str) -> Self {
        let bytes = data.as_bytes();
        Self::from_bytes(bytes)
    }

    pub fn as_str(&self) -> &str {
        &self.0
    }
}

We define a struct named ContentHash and it takes a string as input. To calculate the content hash, we use SHA-256, which is a deterministic one-way hashing function that maps arbitrary-sized input data to a fixed-size output of 256 bits (32 bytes).

Deterministic because if using the same input, it always yields the same output no matter how many times we try. [math]H(x) = H(x)[/math]. It’s called a one-way hashing function because it’s computationally easy to go from the input to the hash, but it’s computationally infeasible to go from the hash to derive the original input.

In our from_bytes function, we take the byte input, and then using SHA-256, which will run over every byte of the input and calculate the 32-byte (256-bit) hash result as a byte array. At the end, we use hex::encode to to convert the 32-byte binary array to a 64-character hex string and return the instance of ContentHash.

Simple In-memory storage

Now, let’s create a SimpleStore that will let us add and retrieve the content:

Rust
pub struct SimpleStore {
    objects: HashMap<ContentHash, Vec<u8>>,
}

impl SimpleStore {
    pub fn new() -> Self {
        Self {
            objects: HashMap::new()
        }
    }

    pub fn put(&mut self, data: Vec<u8>) -> ContentHash {
        let hash = ContentHash::from_bytes(&data);
        self.objects.insert(hash.clone(), data);
        hash
    }


    pub fn get(&self, hash: &ContentHash) -> Option<&Vec<u8>> {
        self.objects.get(hash)
    }

    pub fn contains(&self, hash: &ContentHash) -> bool {
        self.objects.contains_key(hash)
    }
}

This basic implementation demonstrates the fundamental principles: identical content will always produce the same hash, and different content will always produce different hashes; the store automatically deduplicates content without any explicit logic.

We now realize some problems of in-memory storage, such as limited space, since a typical computer has around 16-32GB of RAM, and we would expect far more space in a real-world application for a storage service, and we will lose all data when the power is off.

Another problem with the current in-memory storage is that if we have 2 instances of the SimpleStore running in 2 different processes, then storing the same content in one process doesn’t necessarily prevent this same content get stored again in another process, since in modern operating system, each process has its own protected memory space — it prevents one program from accessing or corrupting another program’s memory:

Rust
// Process 1 : Web Server
fn main() {
   let mut store1 = SimpleStore::new();
   
   // this hashmap lives in Process 1's memory space
   // Address space: 0x7F8A2C000000 - 0x7F8A2CFFFFFF
   
   let video_data = load_file("vacation.mp4"); // 500MB file
   let hash = store1.put(video_data); // Stored in Process 1's RAM
} 

// Process 2: Background Worker
fn main() {
   let mut store2 = SimpleStore::new();
   
   // This HashMap lives in Process 2's memory space  
   // Address space: 0x7F8B1D000000 - 0x7F8B1DFFFFFF
   // Completely separate from Process 1!
   
   let same_video_data = load_file("vacation.mp4"); // Same 500MB file
   let hash = store2.put(same_video_data); // Store AGAIN in Process 2's RAM
}

We can extend our initial SimpleStore service to use a more reliable storage. Here, let’s take advantage of the file system, where we can have more space available compared to the in-memory solution. Data is no longer wiped out when restarting. Last but not least, now 2 processes with the same file content will point to the same location instead of 2 different locations.

Persistence Storage

Let’s create the PersistentStore where we implement persistent storage with a file system layout optimized for performance:

Rust
use std::path::PathBuf;
use std::io::{self, Read, Write};

pub struct PersistentStore {
    root: PathBuf,
}

impl PersistentStore {
    pub fn new<P: AsRef<Path>>(root: P) -> io::Result<Self> {
        let root = root.as_ref().to_path_buf();
        std::fs::create_dir_all(&root)?;
        Ok(Self { root })
    } 
    
    // Convert hash to filesystem path with optimized directory structure
    // e.g., "a1b2c3d4..." becomes "objects/a1/b2/c3d4..."

    fn hash_to_path(&self, hash: &ContentHash) -> PathBuf {
        let hash_str = hash.as_str();
        let mut path = self.root.clone();
        path.push("objects");

        // Use first 4 characters for nested directory structure
        // This gives us 16^4 = 65,536 possible directories so that the files can be spread out evenly 
        if hash_str.len() >= 4 {
            path.push(&hash_str[0..2]);
            path.push(&hash_str[2..4]);
            path.push(&hash_str[4..]);
        } else {
            path.push(hash_str);
        }
        path
    }
}

We start the improvement by saving the data to the file system. In the modern file system, such as NTFS or APFS, the system gets slowed down when a directory contains a thousand or millions of files. Lookup, traversal, and metadata operations become slower, and tools like ls, rm, find take longer.

That’s why the hash_to_path function exists in our implementation, instead of putting all the files together in just one directory, which will hurt the performance, we instead break the hash into nested folders. For example, if we have the original hash like a1b2c3d4e5f6789, our hash_to_path function will then return the path as /objects/a1/b2/c3d4e5f6789.

Why is this better? There are a few reasons for this. The key feature of SHA-256 is that its output will be uniformly distributed over a 256-bit space, so any value in this space will have an equal chance of being generated, and the hash always appears as random.

Secondly, since the output of our ContentHash function is a fixed-length 64-character hash value; hence, each character will be represented by 4 bits, and 2 characters will be 8 bits. So the first two directory levels are derived from 2 bytes of the hash, which gives us a total of [math]256 * 256 = 65,536[/math] different directories. With 65,536 directories, each containing ~1000 files, performs significantly better than a single directory with 65M files. This is a good threshold for us to achieve a good performance when the files are evenly distributed, while it can keep looking up files fast, and keep the directory sizes small enough.

And then we write the other functions for adding and retrieving data from the PersistentStorage:

Rust
impl PersistentStore {
    ...

    pub async fn put(&self, data: Vec<u8>) -> io::Result<ContentHash> {
        let hash = ContentHash::from_bytes(&data);
        let file_path = self.hash_to_path(&hash);

        // check if file already exists (deduplication)
        if file_path.exists() {
            return Ok(hash);
        }

        // create parent directories
        if let Some(parent) = file_path.parent() {
            async_fs::create_dir_all(parent).await?;
        }


        // write data atomically using temporary file 
        let temp_path = file_path.with_extension("tmp");
        async_fs::write(&temp_path, &data).await?;
        async_fs::rename(&temp_path, &file_path).await?;

        Ok(hash)
    }


    pub async fn get(&self, hash: &ContentHash) -> io::Result<Option<Vec<u8>>> {
        let file_path = self.hash_to_path(hash);

        match async_fs::read(file_path).await {
            Ok(data) => {
                // verify content integrity 
                let computed_hash = ContentHash::from_bytes(&data);
                if computed_hash == *hash {
                    Ok(Some(data))
                } else {
                    // corruption detected
                    Err(io::Error::new(io::ErrorKind::InvalidData, "corrupted data"))
                }
            }
            Err(e) if e.kind() == io::ErrorKind::NotFound => Ok(None),
            Err(e) => Err(e),
        }
    }


    pub async fn contains(&self, hash: &ContentHash) -> bool {
        self.hash_to_path(hash).exists()
    }

}

If we’re storing content-addressed data based on its hash. If two of the things below happen during the write, then we’ll be in trouble:

  • Power loss, crash during write –> partial file
  • Concurrent writes (e.g, two processes writing the same hash) –> race condition

Then, later reader sees half-written or corrupted data when the file is not written fully. That’s why in our put implementation, instead of getting the hash and writing to this file immediately, we create a temporary file, write to this temporary file, and then rename this temporary file to the original destination.

Writing to a temporary file and then renaming requires 2 different operations. But here in this case, only the latter operation needs to be atomic, because even if we fail in the former step, then the destination is still left untouched. Since Rust’s fs::rename(temp, final) API is atomic on a POSIX-compliant file system, which means once the rename function returns, either:

  • The file didn’t exist
  • Or it does, in its complete form

This technique is called “write-then-commit protocol”, a standard technique in system programming to achieve atomicity effectively over a non-atomic system.

The storage system that we’ve been building hashes the entire file content. This raises some questions: are large files also repeatedly hashed, and do small changes require re-storing the whole file? The answer to both of the questions is yes, it’s clearer if we take a look at an example:

Rust
impl PersistentStore {
...
    pub async fn put(&self, data: Vec<u8>) -> io::Result<ContentHash> {
      // problem: entire file must be in memory at once
      let hash = ContentHash::from_bytes(&data); // hash entire file
      
      if file_path.exists() {
        return Ok(hash);
      }
      
      async_fs::write(file_path, &data).await?; // write entire file at once
      Ok(hash)
    }
}

We read and write the entire file at once. Here is the usage that will break:

Rust
let video_file = tokio::fs::read("movie.mp4").await?; // 8GB file!
let hash = store.put(video_file).await?; // needs 8GB+ RAM!

Indeed, this seems to be a problem when users start submitting large files, and multiple concurrent uploads on this occasion would probably break the system. Not only because of storage cost, but also the time for the hash function to finish would be slower and slower as the file size gets bigger. And if there is a file is uploading to our CAS system and suddenly gets disconnected, then nothing is then saved in our server.

Chunk-based Storage

Now it’s pretty natural for us to explore a better solution where we can get rid of or mitigate these problems. The answer for us at this point is “chunking“. Instead of treating files as atomic units, now we treat them as a collection of smaller, independently-addressable chunks.

First, we define a struct where we store a chunked file (a file that contains multiple smaller parts):

Rust
struct ChunkedFile {
    total_size: u64,
    chunks: Vec<ContentHash>,
}

pub struct ChunkedStore {
    root: PathBuf,
}

Next, we’re going to define the most important function in this phase, the store_file function, what it takes is a vector of bytes as input. If the file size is smaller or equal to the configured chunk size, we store it as a single chunk. Otherwise, we split the file into multiple fixed-size chunks before storing them:

Rust
pub async fn store_file(&self, data: Vec<u8>) -> io::Result<ContentHash> {
    if data.len() <= CHUNK_SIZE {
      // small file: store directly
      return self.store_chunk(data).await;
    };

    // large file: split into chunks
    let mut chunk_hashes = Vec::new();

    for chunk_data in data.chunks(CHUNK_SIZE) {
      let chunk_hash = self.store_chunk(chunk_data.to_vec()).await?;
      chunk_hashes.push(chunk_hash);
    };

    // create metadata pointing to all chunks
    let chunked_file = ChunkedFile {
      chunks: chunk_hashes,
      total_size: data.len() as u64
    };

    // store metadata
    let metadata = bincode::serialize(&chunked_file).unwrap();
    self.store_chunk(metadata).await
}

async fn store_chunk(&self, data: Vec<u8>) -> io::Result<ContentHash> {
    let hash = ContentHash::from_bytes(&data);
    let file_path = self.chunk_path(&hash);

    if file_path.exists() {
      return Ok(hash);
    }

    if let Some(parent) = file_path.parent() {
      async_fs::create_dir_all(parent).await?;
    }

    async_fs::write(file_path, data).await?;
    Ok(hash)
}

fn chunk_path(&self, hash: &ContentHash) -> PathBuf {
    let hash_str = hash.as_str();
    let mut path = self.root.clone();
    path.push("chunks");
    path.push(&hash_str[0..2]);
    path.push(&hash_str[2..]);
    path
}

Here, one interesting detail is that we need to calculate the hashes for all the chunks before we can write the metadata. This metadata contains the hashes of all the chunks in the file and acts like a recipe to assemble them. This metadata is then hashed and saved at the end; later, we can use the hash of the metadata to reference all chunks again.

Let’s assume we have a large file that will need to be broken into 4 different chunks in our store_file function, for example, the hashes of those chunks could look like:

Rust
/storage/chunks/ab/c123...  <- Just 64KB of raw data (bytes 0-65535)
/storage/chunks/de/f456...  <- Just 64KB of raw data (bytes 65536-131071)  
/storage/chunks/78/9xyz...  <- Just 64KB of raw data (bytes 131072-196607)
/storage/chunks/11/1aaa...  <- Just 8KB of raw data (bytes 196608-204799)

After the last chunk is saved (in this case, the chunk ends with the hash 1aaa), then the metadata is

Rust
{
   chunks: ["abc123", "def456", "789xyz", "111aaa"],
   total_size: 200000
}

This metadata is serialized and then stored in the location ending with the hash:

Rust
/storage/chunks/ma/in999...  <- ChunkedFile metadata:

Later in the future, if we want to get the file content again, we will use the hash of the metadata, deserialize it, and look into the chunk arrays to find out all the hashes and then retrieve the chunks as normal.

Also, the chunks are automatically deduplicated. If there are 2 different files and there is some common chunk between them, then the common chunk is stored only ONCE but can be used in BOTH of the files:

Rust
// File A: [chunk1, chunk2, chunk3]
let file_a = vec![/* some data */];
let hash_a = store.store_file(file_a).await?;

// File B: [chunk1, chunk4, chunk3]  // Shares chunk1 and chunk3 with File A!
let file_b = vec![/* different data that happens to share some chunks */];
let hash_b = store.store_file(file_b).await?;

// Chunks (no file metadata):
//   chunk1: just raw data
//   chunk2: just raw data  
//   chunk3: just raw data
//   chunk4: just raw data

// File metadata:
//   file_a_metadata: { chunks: [chunk1, chunk2, chunk3], size: X }
//   file_b_metadata: { chunks: [chunk1, chunk4, chunk3], size: Y }

// Note: chunk1 and chunk3 are stored only ONCE but used by TWO files!

Conclusion

Content-addressable storage represents a fundamental shift from location-based to content-based addressing. We don’t say one is better than the other, but to pick one, it pretty much depends on the task we have. With CAS, it enables automatic deduplication since the identical content is stored only once, and it’s also used to check the data integrity of files since any change will generate different hashes. CAS isn’t just an academic concept; it’s the backbone of the systems that you use daily. Git uses SHA-256 hashes to address commits and objects, Docker images are stored using content-addressable layers, and modern build tools like Vite leverage CAS for efficient browser caching, and many other examples.

5 2 votes
Article Rating
Previous Article
Subscribe
Notify of
guest

0 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x