-
Notifications
You must be signed in to change notification settings - Fork 10
feat: Thread aware cache #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
What is the relation between this and the cache Schuyler is working on, do they serve different use cases? Also, I don't quite understand how the cache does thread migration, it doesn't seem to implement the |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #115 +/- ##
========================================
- Coverage 100.0% 99.3% -0.7%
========================================
Files 33 72 +39
Lines 3291 7973 +4682
========================================
+ Hits 3291 7918 +4627
- Misses 0 55 +55 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Please see the top-level README.md files for the steps needed to add a new crate. You need to register it in the top-level README and the top level CHANGELOG, you need to add the requisite logo and favicon, and you need the voodoo lines in the crate-level documentation to pull in the logo and icon into docs.rs. See the script scripts\add-crate.ps1 for everythig. |
geeknoid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cool to see this kind of thing coming online.
| @@ -0,0 +1,14 @@ | |||
| # Changelog | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We auto-generate the changelog, so you can leave this blank for starters and it'll be filled in when releasing the crate.
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the MIT License. | ||
|
|
||
| //! Benchmarks for the NUMA cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comparative benchmarks for existing cache crates out there?
| //! | ||
| //! The Bloom filter is sized for ~1% false positive rate and uses atomic operations for | ||
| //! lock-free concurrent access. Note that removals don't clear bits from the filter, so | ||
| //! stale positives may occur (safe, just slower), but false negatives never occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the bloom filter ever get regenerated over time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling removals in the bloom filter is tricky because items are independently removed from the shards. Think of it like SIEVE eviction per NUMA node. To handle this well we would need a bit per affinity. My thinking was to have this decided at compile:
- Cache has a bit per affinity bloom filter because the cache will have large number of deletes
- Single bit but does not track removals and may become saturated. This would be right for mostly read only caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking that you could keep track of the number of false positive over time. As the ratio gets over a certain threshold, you'd throw the filter away and rebuild it. I don't know if rebuilding it is possible though.
I don't know if there is such a thing as a mostly read-only cache. A cache with TTL will eventually recycle every item, so the bloom filter will become ineffective.
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the MIT License. | ||
|
|
||
| //! Lock-free Bloom filter for fast negative lookups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless this code has some magic in it, can we use an existing crate for this like fastbloom?
If thia code does have some magic, then perhaps it should be in a separate crate?
|
|
||
| pub use cache::{NumaCache, NumaCacheBuilder}; | ||
|
|
||
| // Re-export thread_aware types for convenience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ralfbiedert Do we have a policy on reexports?
| @@ -0,0 +1,136 @@ | |||
| // Copyright (c) Microsoft Corporation. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple high-level comments:
-
You use the word 'shard' to describe the different partitions. Given this is explicitly a NUMA cache, when would you expect the shards to be different than the NUMA nodes? If they're also going to be 1:1, then can we just use the words 'NUMA node' instead of shard in order to eliminate one concept?
-
You call this thread_aware_cache, but it's really a NUMA-aware cache, right? There is no general affordance for thread-specific shards. You can imagine having per-thread L0 caches to completely eliminate contention for common lookups. So should we call this crate numa_aware_cache?
-
Would it be possible to support the Borrow pattern like HashMap does? This enables mix-and-match between &str and String which is mighty handy.
-
Caches should report some state about their efficacy. Hit & miss rates in particular. Can we add this? Otherwise, it's hard for app to tune the cache size.
-
I worry that this won't scale well within a shard given the very coarse lock that's taken.
-
I also worry that a contended lock can have catastophic effects on overall systemic perf. If N threads are in the middle of reading from a shard and a writer comes along, that writer will block and in a thread-per-core world the thread's core will go completely idle. Should we be using async mutexes and make cache access be async methods? Is the overhead of async warranted or would that potentially still also suffer from contention inside the async mutex logic?
-
I didn't dig enough in the code, but what's the lock behavior when a cache miss occur and probing occurs in other shards? If the shard's lock is held across the whole operation, this could lead to substantial contention. If the lock is not held, then we have a racy situation where a probe could be occurring and then a local thread can put the sought-after data into the shard "behind the scenes". Or 10 threads could potentially be probing for the same state at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comments. My intention was to start the ball rolling again with previous work. Setting to draft while we mature the ideas here and have some discussions.
| /// thread affinity to minimize cross-NUMA traffic. Each shard is explicitly | ||
| /// associated with a [`thread_aware::PinnedAffinity`]. | ||
| /// | ||
| /// A shared Bloom filter optimizes cross-shard lookups by quickly identifying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an implementation detail that's not needed in the user docs.
| /// | ||
| /// * `K` - The key type, must implement `Eq + Hash + Clone`. | ||
| /// * `V` - The value type, must implement `Clone`. | ||
| /// * `S` - The hash builder type, defaults to `DefaultHashBuilder`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default hasher is very slow. We might want to default to something else, like RapidHash.
| use crate::sieve::{NodeIndex, SieveList}; | ||
|
|
||
| /// Cache line size for alignment to prevent false sharing. | ||
| const CACHE_LINE_SIZE: usize = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nobody is making a 128 byte cache line? There's no system-level constant for this we could leverage?
| /// A single cache shard containing data and eviction metadata. | ||
| /// | ||
| /// Aligned to the CPU cache line (64 bytes) to prevent cache-line bouncing between locks. | ||
| #[repr(align(64))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use CACHE_LINE_SIZE here, or are you not allowed to use constants in these attribute expressions?
Schuyler told me they did not have an in-memory part to their cache yet. I wanted to contribute this one we wrote last year but was parked. This code has been evolved to use thread_aware. Let me set to draft while we discuss if that is a good idea or not. |
This PR introduces
thread_aware_cache, a high-performance in-memory cache designed to maximize read throughput and minimize latency on multi-socket NUMA architectures.Thread-Affinity Sharding
Unlike traditional caches that shard by key hash, this implementation shards by thread affinity. This ensures data remains physically local to the executing thread, significantly reducing cross-socket interconnect traffic.
"Read-Through" Replication
This implementation utilizes a unique replication strategy that automatically promotes hot data to the local shard upon access. This is supported by a lock-free Bloom filter to efficiently short-circuit negative lookups across shards.
SIEVE Eviction Algorithm
Capacity management is handled by the SIEVE algorithm. This provides scan-resistance and high concurrency by eliminating the need for heavy write locks during cache reads.