Skip to content

Conversation

@ZacWalk
Copy link
Member

@ZacWalk ZacWalk commented Dec 9, 2025

This PR introduces thread_aware_cache, a high-performance in-memory cache designed to maximize read throughput and minimize latency on multi-socket NUMA architectures.

  • Thread-Affinity Sharding
    Unlike traditional caches that shard by key hash, this implementation shards by thread affinity. This ensures data remains physically local to the executing thread, significantly reducing cross-socket interconnect traffic.

  • "Read-Through" Replication
    This implementation utilizes a unique replication strategy that automatically promotes hot data to the local shard upon access. This is supported by a lock-free Bloom filter to efficiently short-circuit negative lookups across shards.

  • SIEVE Eviction Algorithm
    Capacity management is handled by the SIEVE algorithm. This provides scan-resistance and high concurrency by eliminating the need for heavy write locks during cache reads.

@ralfbiedert
Copy link
Collaborator

What is the relation between this and the cache Schuyler is working on, do they serve different use cases? Also, I don't quite understand how the cache does thread migration, it doesn't seem to implement the ThreadAware trait?

@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

❌ Patch coverage is 94.37052% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.3%. Comparing base (32e413b) to head (68e98e6).
⚠️ Report is 43 commits behind head on main.

Files with missing lines Patch % Lines
crates/thread_aware_cache/src/cache.rs 87.5% 30 Missing ⚠️
crates/thread_aware_cache/src/sieve.rs 94.8% 11 Missing ⚠️
crates/thread_aware_cache/src/bloom.rs 94.3% 7 Missing ⚠️
crates/thread_aware_cache/src/shard.rs 94.6% 6 Missing ⚠️
crates/thread_aware_cache/src/tests.rs 99.6% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #115     +/-   ##
========================================
- Coverage   100.0%   99.3%   -0.7%     
========================================
  Files          33      72     +39     
  Lines        3291    7973   +4682     
========================================
+ Hits         3291    7918   +4627     
- Misses          0      55     +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@geeknoid
Copy link
Member

geeknoid commented Dec 9, 2025

Please see the top-level README.md files for the steps needed to add a new crate. You need to register it in the top-level README and the top level CHANGELOG, you need to add the requisite logo and favicon, and you need the voodoo lines in the crate-level documentation to pull in the logo and icon into docs.rs. See the script scripts\add-crate.ps1 for everythig.

Copy link
Member

@geeknoid geeknoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool to see this kind of thing coming online.

@@ -0,0 +1,14 @@
# Changelog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We auto-generate the changelog, so you can leave this blank for starters and it'll be filled in when releasing the crate.

// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

//! Benchmarks for the NUMA cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comparative benchmarks for existing cache crates out there?

//!
//! The Bloom filter is sized for ~1% false positive rate and uses atomic operations for
//! lock-free concurrent access. Note that removals don't clear bits from the filter, so
//! stale positives may occur (safe, just slower), but false negatives never occur.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the bloom filter ever get regenerated over time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling removals in the bloom filter is tricky because items are independently removed from the shards. Think of it like SIEVE eviction per NUMA node. To handle this well we would need a bit per affinity. My thinking was to have this decided at compile:

  • Cache has a bit per affinity bloom filter because the cache will have large number of deletes
  • Single bit but does not track removals and may become saturated. This would be right for mostly read only caches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was thinking that you could keep track of the number of false positive over time. As the ratio gets over a certain threshold, you'd throw the filter away and rebuild it. I don't know if rebuilding it is possible though.

I don't know if there is such a thing as a mostly read-only cache. A cache with TTL will eventually recycle every item, so the bloom filter will become ineffective.

// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

//! Lock-free Bloom filter for fast negative lookups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless this code has some magic in it, can we use an existing crate for this like fastbloom?

If thia code does have some magic, then perhaps it should be in a separate crate?


pub use cache::{NumaCache, NumaCacheBuilder};

// Re-export thread_aware types for convenience
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ralfbiedert Do we have a policy on reexports?

@@ -0,0 +1,136 @@
// Copyright (c) Microsoft Corporation.
Copy link
Member

@geeknoid geeknoid Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple high-level comments:

  • You use the word 'shard' to describe the different partitions. Given this is explicitly a NUMA cache, when would you expect the shards to be different than the NUMA nodes? If they're also going to be 1:1, then can we just use the words 'NUMA node' instead of shard in order to eliminate one concept?

  • You call this thread_aware_cache, but it's really a NUMA-aware cache, right? There is no general affordance for thread-specific shards. You can imagine having per-thread L0 caches to completely eliminate contention for common lookups. So should we call this crate numa_aware_cache?

  • Would it be possible to support the Borrow pattern like HashMap does? This enables mix-and-match between &str and String which is mighty handy.

  • Caches should report some state about their efficacy. Hit & miss rates in particular. Can we add this? Otherwise, it's hard for app to tune the cache size.

  • I worry that this won't scale well within a shard given the very coarse lock that's taken.

  • I also worry that a contended lock can have catastophic effects on overall systemic perf. If N threads are in the middle of reading from a shard and a writer comes along, that writer will block and in a thread-per-core world the thread's core will go completely idle. Should we be using async mutexes and make cache access be async methods? Is the overhead of async warranted or would that potentially still also suffer from contention inside the async mutex logic?

  • I didn't dig enough in the code, but what's the lock behavior when a cache miss occur and probing occurs in other shards? If the shard's lock is held across the whole operation, this could lead to substantial contention. If the lock is not held, then we have a racy situation where a probe could be occurring and then a local thread can put the sought-after data into the shard "behind the scenes". Or 10 threads could potentially be probing for the same state at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments. My intention was to start the ball rolling again with previous work. Setting to draft while we mature the ideas here and have some discussions.

/// thread affinity to minimize cross-NUMA traffic. Each shard is explicitly
/// associated with a [`thread_aware::PinnedAffinity`].
///
/// A shared Bloom filter optimizes cross-shard lookups by quickly identifying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an implementation detail that's not needed in the user docs.

///
/// * `K` - The key type, must implement `Eq + Hash + Clone`.
/// * `V` - The value type, must implement `Clone`.
/// * `S` - The hash builder type, defaults to `DefaultHashBuilder`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default hasher is very slow. We might want to default to something else, like RapidHash.

use crate::sieve::{NodeIndex, SieveList};

/// Cache line size for alignment to prevent false sharing.
const CACHE_LINE_SIZE: usize = 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody is making a 128 byte cache line? There's no system-level constant for this we could leverage?

/// A single cache shard containing data and eviction metadata.
///
/// Aligned to the CPU cache line (64 bytes) to prevent cache-line bouncing between locks.
#[repr(align(64))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use CACHE_LINE_SIZE here, or are you not allowed to use constants in these attribute expressions?

@ZacWalk
Copy link
Member Author

ZacWalk commented Dec 10, 2025

What is the relation between this and the cache Schuyler is working on, do they serve different use cases? Also, I don't quite understand how the cache does thread migration, it doesn't seem to implement the ThreadAware trait?

Schuyler told me they did not have an in-memory part to their cache yet. I wanted to contribute this one we wrote last year but was parked. This code has been evolved to use thread_aware. Let me set to draft while we discuss if that is a good idea or not.

@ZacWalk ZacWalk marked this pull request as draft December 10, 2025 10:12
@ZacWalk ZacWalk changed the title Thread aware cache feat: Thread aware cache Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants