{"id":587692,"date":"2026-04-07T09:16:08","date_gmt":"2026-04-07T09:16:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/587692\/"},"modified":"2026-04-07T09:16:08","modified_gmt":"2026-04-07T09:16:08","slug":"bloom-filters-theory-engineering-trade-offs-and-implementation-in-go","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/587692\/","title":{"rendered":"Bloom Filters: Theory, Engineering Trade\u2011offs, and Implementation in Go"},"content":{"rendered":"<p>\t\t\t\t\t\t\t\t\tKey Takeaways<br \/>\n\t\t\t\t\t\t\t\t\t&#13;<br \/>\n\tBloom filters provide efficient probabilistic membership testing with no false negatives and controlled false-positive rates.&#13;<br \/>\n\tBloom filters may reduce unnecessarily expensive lookups in storage systems by acting as fast pre-filters.&#13;<br \/>\n\tPractical parameter selection (filter size and hash count) is essential for balancing memory and accuracy.&#13;<br \/>\n\tGo\u2019s low-level control makes implementation and reasoning about Bloom filters straightforward.&#13;<br \/>\n\tEngineers should understand when Bloom filters are the right fit and when non-probabilistic data structures are a better choice.&#13;<\/p>\n<p>\t\t\t\t\t\t\t\tIntroduction<\/p>\n<p>In one of our recommendation pipelines, we had a simple requirement: don\u2019t show users articles they had already viewed. At its peak, the feed service handled around 18,000 requests per second, with about 120 candidates evaluated per request. This meant roughly 2.16 million membership checks per second. However, the workload was heavily skewed, with around 97-98% of checks negatives.<\/p>\n<p>Our initial design used exact lookups (cache plus backing store) for every candidate. This worked functionally, but when there were many lookups for items that didn\u2019t exist, it became less efficient. Each miss still caused network and storage costs, which increased I\/O. During peak traffic, this showed up as an increases p95 latency (from about 85ms to 140ms), backend read spikes, and steadily rising infrastructure cost.<\/p>\n<p>To address this, we introduced a Bloom filter in front of the exact lookup path. The filter rejects definite negatives in memory and submits only likely positives for expensive verification. This change let us avoid unnecessary work for items that were definitely not present, reducing both latency and backend load. By filtering out obvious misses early, we could focus resources on the cases that actually needed verification.<\/p>\n<p>This article walks you through that implementation end to end: the architectural problem, Bloom filter mechanics, Go integration, parameter tuning with math (\\(m\\) and \\(k\\)), and the practical lessons learned from making it work under production constraints.<\/p>\n<p>Naive Solution: Exact Lookup for Every Candidate<\/p>\n<p>As mentioned, before introducing Bloom filters, our recommendation service used a baseline &#8220;exact-first&#8221; architecture:<\/p>\n<p>&#13;<br \/>\n\tIn the candidate generation step, a ranked list of candidate article IDs is produced.&#13;<br \/>\n\tIn the history-check stage, each candidate is validated against the user\u2019s seen set.&#13;<br \/>\n\tThe history-check stage used a cache-first logic, relying on backing storage for misses.&#13;<br \/>\n\tOnly candidates confirmed as unseen in the history-check proceeded to final ranking and response assembly.&#13;<\/p>\n<p>From a correctness viewpoint, this was ideal: duplicate suppression was deterministic and easy to reason about. In a system perspective, however, this stage sat directly on the serving critical path and performed one remote membership check per candidate.<\/p>\n<p>Why Exact Lookup Alone Was Not Good Enough<\/p>\n<p>The workload characteristics made the exact approach expensive by design. Around 97-98% of checks were negatives, so most lookups existed only to return &#8220;unseen&#8221; and move on. In other words, we were paying storage\/network costs primarily for negative answers.<\/p>\n<p>Three issues were dominant under peak traffic:<\/p>\n<p>&#13;<br \/>\n\tLatency amplification: each request contained many candidate checks; p95 response latency grew from roughly 85ms to 140ms.&#13;<br \/>\n\tRead amplification: backend and cache read volume scaled with the number of candidates checked per request (&#8220;candidate fanout&#8221;), not just request count.&#13;<br \/>\n\tCost pressure: infrastructure spend rose with traffic because exact checks dominated the serving path.&#13;<\/p>\n<p>At that point, we needed a design that preserved correctness guarantees where it mattered but removed most of the unnecessarily expensive lookups from the negative-heavy path.<\/p>\n<p>The Solution: Bloom Filters<\/p>\n<p>The change was to introduce a Bloom filter as a quick, in-memory check (&#8220;membership gate&#8221;) before performing the more expensive history lookup. At a high level, a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bloom_filter\" rel=\"nofollow noopener\" target=\"_blank\">Bloom filter<\/a> is a compact probabilistic data structure used for membership checks. It stores information in a bit array and uses multiple hash functions per key. A query has two possible outcomes:<\/p>\n<p>&#13;<br \/>\n\tDefinitely not present (guaranteed correct)&#13;<br \/>\n\tPossibly present (may include false positives)&#13;<\/p>\n<p>It never produces false negatives, which makes it well-suited for quickly rejecting items that are certainly unseen.<\/p>\n<p>With the introduction of Bloom filters, the request path changes slightly:<\/p>\n<p>&#13;<br \/>\n\tGenerate candidate article IDs.&#13;<br \/>\n\tQuery the Bloom filter with (user_id, article_id) membership keys.&#13;<br \/>\n\tIf the filter returns definitely not present, treat the candidate as unseen and keep it in the ranking pipeline.&#13;<br \/>\n\tIf the filter returns possibly present, go on with exact history verification.&#13;<\/p>\n<p>This design targets exactly the pain point from the previous section: the dominant negative path. Most candidates are unseen, so most checks can be resolved in-memory without remote I\/O.<\/p>\n<p>Why a Probabilistic Approach Fits This Workload<\/p>\n<p>The approach described here has several interesting properties:<\/p>\n<p>&#13;<br \/>\n\tNegative-heavy query distribution: with ~97-98% negatives, fast negative rejection has a high impact.&#13;<br \/>\n\tStrict correctness where needed: positives can still be verified against exact storage.&#13;<br \/>\n\tPredictable memory footprint: Bloom filters compactly represent large membership sets.&#13;<br \/>\n\tTunable trade-off: as we will see later, we can control false-positive rate via m (bit array size) and k (hash count).&#13;<\/p>\n<p>The next sections explain how Bloom filters work, how we implemented them in Go, and how we used the math to tune parameters for this recommendation workload.<\/p>\n<p>Bloom Filters in Practice<\/p>\n<p>In our recommendation workload, the goal of a Bloom filter is to quickly identify likely unseen candidates and avoid unnecessary expensive history lookups. Because most candidate checks are negative, the filter helps remove a large amount of avoidable storage and network work from the serving path.<\/p>\n<p>Core Mechanics<\/p>\n<p>A Bloom filter encodes membership information for a set of elements. Its core components are simple but powerful:<\/p>\n<p>&#13;<br \/>\n\tA bit array of size m: each position stores a 0 or 1, representing whether it has been set by one or more elements.&#13;<br \/>\n\t\u00a0k hash functions: each element is mapped to k positions in the bit array. These hash functions should be independent and uniformly distribute elements across the array. Finding good hash functions is crucial for minimizing collisions and controlling the false-positive rate, making it an important engineering consideration. We\u2019ll discuss practical hash function choices in the implementation section.&#13;<\/p>\n<p>Bloom filters do not store the elements themselves, only the presence information is encoded in the bits.<\/p>\n<p>Insertion<\/p>\n<p>To add an element E to the Bloom filter:<\/p>\n<p>&#13;<br \/>\n\tApply the \\(k\\) hash functions to E:<br \/>&#13;<br \/>\n\t\\(h_1(E), h_2(E), \\dots, h_k(E)\\) each produces a numeric hash.&#13;<br \/>\n\tMap the hashes into the bit array using modulo \\(m\\):<br \/>&#13;<br \/>\n\t\\(index_i = h_i(E) \\mod m\\)<br \/>&#13;<br \/>\n\tThis gives \\(k\\) positions in the array (it\u2019s possible for some positions to be the same due to collisions).&#13;<br \/>\n\tSet the bits at these positions to 1:<br \/>&#13;<br \/>\n\t\\(\\text{bit_array}[\\text{index_i}] = 1\\)&#13;<\/p>\n<p>Multiple elements may set the same bit. Bits are only ever set, never cleared. This is why standard Bloom filters cannot support deletions.<\/p>\n<p>Membership Queries<\/p>\n<p>To check if an element is present:<\/p>\n<p>&#13;<br \/>\n\tCompute the same k hash values for the element.&#13;<br \/>\n\tCheck the corresponding bits in the array.&#13;<\/p>\n<p>If any bit is 0, the element is definitely not in the set, because it would have been set to 1 during insertion.<\/p>\n<p>If all bits are 1, the element is possibly in the set. This is where false positives can occur: different elements may hash to the same positions, causing bits to be set even if the queried element was never added.<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/bloom-filters-practice-go-recommender\/articles\/bloom-filters-practice-go-recommender\/en\/resources\/198figure-1-1774950090560.jpg\" style=\"width: 722px; height: 424px;\" rel=\"share\"\/><\/p>\n<p style=\"text-align:center\">Figure 1 &#8211; Bloom filter insertions and membership test<\/p>\n<p>In the diagram above, we insert three elements (element1, element2, element3) into the Bloom filter with 2 hash functions (h1 and h2). Each element sets two bits in the array. When we query for element4, we find that not all bits are set, so we can confidently say it is not present. (Note that in the diagram we have hash collisions: for example, element1 and element3 both set the bit at index 6, which contributes to the possibility of false positives.)<\/p>\n<p>Key Properties<\/p>\n<p>Bloom filters display several interesting properties:<\/p>\n<p>&#13;<br \/>\n\tNo false negatives: a &#8220;not present&#8221; result is always correct.&#13;<br \/>\n\tFalse positives possible: an element may appear to exist even if it hasn\u2019t been added.&#13;<br \/>\n\tDeterministic: the same element always maps to the same bits.&#13;<br \/>\n\tEfficient in memory and speed: the bit array and simple hash computations allow fast insertions and queries.&#13;<br \/>\n\tOnly stores membership information: it cannot retrieve the original elements.&#13;<br \/>\n\tNo deletions: once bits are set, they cannot be cleared without affecting other elements. This is a fundamental limitation of standard Bloom filters, and while there are variants that support deletions (like counting Bloom filters), they come with additional complexity and memory overhead.&#13;<\/p>\n<p>While the mechanics described above explain how a Bloom filter operates, this understanding alone is no guarantee \u00a0for a practical, ready-to-use filter. Without careful choices of the bit array size (\\(m\\)), the number of hash functions (\\(k\\)), and appropriate hash functions, a Bloom filter could be inefficient or produce too many false positives. In the next section, we will demonstrate how to implement a Bloom filter in Go, translating the mechanics into working code. The discussion of how to choose and tune these parameters will follow in the Practical Considerations section.<\/p>\n<p>Implementing a Bloom Filter in Go<\/p>\n<p>Go is an ideal language for implementing a Bloom filter because it provides low-level control over memory, efficient slices and arrays, and fast, predictable execution. These characteristics make it easy to reason about the bit array, hash computations, and overall performance of the filter. These are all critical for production systems that need both speed and memory efficiency.<\/p>\n<p>Translating the mechanics of a Bloom filter into Go is straightforward. The implementation uses a bit array and multiple hash functions, mirroring the step-by-step behavior we described in the Core Mechanics section. At this stage, we focus on the structure and basic operations; parameter tuning will be addressed in the Practical Considerations section.<\/p>\n<p>Defining the Bloom Filter Structure<\/p>\n<p>The Bloom filter structure (struct) in Go consists of a packed bit array, the number of addressable bits, and the configured hash functions. Storing hash functions in the struct avoids per-call API mistakes and keeps usage ergonomic. Using a packed representation (64 bits per word) improves memory efficiency and cache behavior compared to storing one boolean per bit:<\/p>\n<p>&#13;<br \/>\ntype BloomFilter struct {&#13;<br \/>\n\u00a0 bits \u00a0 []uint64 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \/\/ packed bit array (64 bits per word)&#13;<br \/>\n\u00a0 m \u00a0 \u00a0 \u00a0uint \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \/\/ number of addressable bits&#13;<br \/>\n\u00a0 hashes []func([]byte) uint \u00a0\/\/ configured hash functions&#13;<br \/>\n}<\/p>\n<p>Creating a New Bloom Filter<\/p>\n<p>The NewBloomFilter function initializes a new Bloom filter with the specified size and hash functions:<\/p>\n<p>&#13;<br \/>\n\/\/ NewBloomFilter creates a new Bloom filter with m bits and configured hash functions.&#13;<br \/>\nfunc NewBloomFilter(m uint, hashes []func([]byte) uint) *BloomFilter {&#13;<br \/>\n\u00a0 if m == 0 {&#13;<br \/>\n\u00a0 \u00a0 panic(&#8220;bloom filter size m must be &gt; 0&#8221;)&#13;<br \/>\n\u00a0 }&#13;<br \/>\n\u00a0 if len(hashes) == 0 {&#13;<br \/>\n\u00a0 \u00a0 panic(&#8220;at least one hash function is required&#8221;)&#13;<br \/>\n\u00a0 }&#13;<br \/>\n&#13;<br \/>\n\u00a0 words := (m + 63) \/ 64 \/\/ ceil(m\/64)&#13;<br \/>\n\u00a0 return &amp;BloomFilter{&#13;<br \/>\n\u00a0 \u00a0 bits: \u00a0 make([]uint64, words),&#13;<br \/>\n\u00a0 \u00a0 m: \u00a0 \u00a0 \u00a0m,&#13;<br \/>\n\u00a0 \u00a0 hashes: hashes,&#13;<br \/>\n\u00a0 }&#13;<br \/>\n}<\/p>\n<p>To operate on the packed bit array, we use helper methods for setting and reading individual bits:<\/p>\n<p>&#13;<br \/>\nfunc (bf *BloomFilter) setBit(i uint) {&#13;<br \/>\n\u00a0 word := i &gt;&gt; 6 \u00a0 \/\/ i \/ 64&#13;<br \/>\n\u00a0 offset := i &amp; 63 \/\/ i % 64&#13;<br \/>\n\u00a0 bf.bits[word] |= uint64(1) &lt;&lt; offset&#13;<br \/>\n}&#13;<br \/>\nfunc (bf *BloomFilter) hasBit(i uint) bool {&#13;<br \/>\n\u00a0 word := i &gt;&gt; 6&#13;<br \/>\n\u00a0 offset := i &amp; 63&#13;<br \/>\n\u00a0 return (bf.bits[word] &amp; (uint64(1) &lt;&lt; offset)) != 0&#13;<br \/>\n}<\/p>\n<p>Adding an Element<\/p>\n<p>The Add method takes a byte slice (the data to be added), computes configured hash values, maps them to indices in the bit array, and sets the corresponding packed bits:<\/p>\n<p>&#13;<br \/>\nfunc (bf *BloomFilter) Add(data []byte) {&#13;<br \/>\n\u00a0 for _, hash := range bf.hashes {&#13;<br \/>\n\u00a0 \u00a0 idx := hash(data) % bf.m&#13;<br \/>\n\u00a0 \u00a0 bf.setBit(idx)&#13;<br \/>\n\u00a0 }&#13;<br \/>\n}<\/p>\n<p>Bits are only ever set; insertion mirrors the Bloom filter core mechanics exactly.<\/p>\n<p>Querying an Element<\/p>\n<p>The Contains method checks if an element is possibly in the Bloom filter by verifying the bits corresponding to the hash values:<\/p>\n<p>&#13;<br \/>\nfunc (bf *BloomFilter) Contains(data []byte) bool {&#13;<br \/>\n\u00a0 for _, hash := range bf.hashes {&#13;<br \/>\n\u00a0 \u00a0 idx := hash(data) % bf.m&#13;<br \/>\n\u00a0 \u00a0 if !bf.hasBit(idx) {&#13;<br \/>\n\u00a0 \u00a0 \u00a0 return false \/\/ definitely not present&#13;<br \/>\n\u00a0 \u00a0 }&#13;<br \/>\n\u00a0 }&#13;<br \/>\n\u00a0 return true \/\/ possibly present&#13;<br \/>\n}<\/p>\n<p>This method returns false if any bit is not set, ensuring there are no false negatives. If all bits are set, it returns true, indicating a possible membership (with the possibility of false positives).<\/p>\n<p>This implementation directly mirrors the core mechanics: multiple independent hashes, bit array updates, and membership checks. Let\u2019s see a running example of how to use this Bloom filter in practice, including how to define hash functions and test the filter with some data:<\/p>\n<p>&#13;<br \/>\npackage main&#13;<br \/>\nimport (&#13;<br \/>\n\u00a0 &#8220;Fmt&#8221;&#13;<br \/>\n\u00a0 &#8220;hash\/fnv&#8221;&#13;<br \/>\n)&#13;<br \/>\n&#13;<br \/>\n\/\/ Simple hash functions&#13;<br \/>\nfunc hash1(data []byte) uint {&#13;<br \/>\n\u00a0 h := fnv.New32a()&#13;<br \/>\n\u00a0 h.Write(data)&#13;<br \/>\n\u00a0 return uint(h.Sum32())&#13;<br \/>\n}&#13;<br \/>\n&#13;<br \/>\nfunc hash2(data []byte) uint {&#13;<br \/>\n\u00a0 h := fnv.New32()&#13;<br \/>\n\u00a0 h.Write(data)&#13;<br \/>\n\u00a0 return uint(h.Sum32())&#13;<br \/>\n}&#13;<br \/>\n&#13;<br \/>\nfunc main() {&#13;<br \/>\n\u00a0 \/\/ Define hash functions&#13;<br \/>\n\u00a0 hashes := []func([]byte) uint{hash1, hash2}&#13;<br \/>\n&#13;<br \/>\n\u00a0 \/\/ Create a Bloom filter: size 20 bits&#13;<br \/>\n\u00a0 bf := NewBloomFilter(20, hashes)&#13;<br \/>\n&#13;<br \/>\n\u00a0 \/\/ Add elements&#13;<br \/>\n\u00a0 bf.Add([]byte(&#8220;apple&#8221;))&#13;<br \/>\n\u00a0 bf.Add([]byte(&#8220;banana&#8221;))&#13;<br \/>\n\u00a0 bf.Add([]byte(&#8220;cherry&#8221;))&#13;<br \/>\n&#13;<br \/>\n\u00a0 \/\/ Query elements&#13;<br \/>\n\u00a0 tests := []string{&#8220;apple&#8221;, &#8220;banana&#8221;, &#8220;cherry&#8221;, &#8220;date&#8221;, &#8220;fig&#8221;}&#13;<br \/>\n&#13;<br \/>\n\u00a0 for _, t := range tests {&#13;<br \/>\n\u00a0 \u00a0 if bf.Contains([]byte(t)) {&#13;<br \/>\n\u00a0 \u00a0 \u00a0 fmt.Printf(&#8220;%s: possibly present\\n&#8221;, t)&#13;<br \/>\n\u00a0 \u00a0 } else {&#13;<br \/>\n\u00a0 \u00a0 \u00a0 fmt.Printf(&#8220;%s: definitely not present\\n&#8221;, t)&#13;<br \/>\n\u00a0 \u00a0 }&#13;<br \/>\n\u00a0 }&#13;<br \/>\n}<\/p>\n<p>In this example, we define two simple hash functions using the FNV hash algorithm. This is sufficient for demonstration, but production systems typically prefer higher-quality non-cryptographic hashes (for example <a href=\"https:\/\/en.wikipedia.org\/wiki\/MurmurHash\" rel=\"nofollow noopener\" target=\"_blank\">Murmur3<\/a>, <a href=\"https:\/\/xxhash.com\/\" rel=\"nofollow noopener\" target=\"_blank\">xxHash<\/a>, <a href=\"https:\/\/www.jandrewrogers.com\/2015\/05\/27\/metrohash\/\" rel=\"nofollow noopener\" target=\"_blank\">MetroHash<\/a>, or <a href=\"https:\/\/github.com\/google\/highwayhash\" rel=\"nofollow noopener\" target=\"_blank\">HighwayHash<\/a>) and validate distribution behavior under real keysets. After defining the two hash functions, we create a Bloom filter with a size of 20 bits and 2 hash functions. We add three fruits to the filter and then test for their presence, along with two additional fruits that were not added. The output will indicate which fruits are possibly present (with potential false positives) and which are definitely not present:<\/p>\n<p>&#13;<br \/>\napple: possibly present&#13;<br \/>\nbanana: possibly present&#13;<br \/>\ncherry: possibly present&#13;<br \/>\ndate: definitely not present&#13;<br \/>\nfig: definitely not present<\/p>\n<p>The Math Behind Bloom Filters<\/p>\n<p>Bloom filters are easy to implement, but not so easy to use effectively: you need to understand their probabilistic behavior to get the most out of them. This allows engineers to predict false-positive rates and make informed choices about memory usage and the number of hash functions required to achieve the desired false-positive rate. The math behind Bloom filters is essential for tuning their parameters (\\(m\\), \\(k\\), and the hash functions) to achieve the desired balance between memory efficiency and accuracy.<\/p>\n<p>Below, we quickly summarize the key formulas you\u2019ll need in practice. For a deeper dive into the derivations and underlying theory, see the Appendix: The Math Behind Bloom Filters.<br \/>&#13;<br \/>\nThe false positive rate (p) is approximately:<\/p>\n<p>\\(p = \\left( 1 &#8211; e^{-\\frac{kn}{m}} \\right)^k\\)<\/p>\n<p>The optimal number of hash functions (\\(k\\)):<\/p>\n<p>\\(k = \\frac{m}{n} \\ln 2\\)<\/p>\n<p>The required bit array size (\\(m\\)) for a target false positive rate:<\/p>\n<p>\\(m = -\\frac{n \\ln p}{(\\ln 2)^2}\\)<\/p>\n<p>Results Snapshot<\/p>\n<p>After introducing Bloom-filter gating in our recommendation path and tuning\u00a0\\(\\frac{m}{k}\\) using the formulae above, we observed three consistent outcomes in peak-window traffic:<\/p>\n<p>&#13;<br \/>\n\tLower tail latency: p95 feed latency improved from ~140ms to ~96ms (about 31% reduction).&#13;<br \/>\n\tFewer expensive checks: exact history lookups dropped from ~120 per request to ~24 on average (about 80% reduction).&#13;<br \/>\n\tLower backend pressure: read traffic to the history store dropped by ~65-70%, while measured Bloom false-positive over-filtering stayed under ~0.5%.&#13;<\/p>\n<p>These numbers are workload-specific, but the pattern is general: when lookups are mostly negative and expensive, a well-tuned Bloom filter can remove a large portion of avoidable backend work.<\/p>\n<p>Practical Considerations &amp; Best Practices<\/p>\n<p>With the mechanics implemented and the math to choose \\(k\\) and \\(m\\), we can now translate theory into engineering decisions.<\/p>\n<p>Start with Product Constraints, Not with Bloom Filter Parameters<\/p>\n<p>For our recommendation path, the product-level question was not &#8220;what \\(m\\) and \\(k\\) should we use?&#8221;, but rather:<\/p>\n<p>&#13;<br \/>\n\thow many duplicate recommendations are acceptable&#13;<br \/>\n\thow much memory can we spend in the serving tier&#13;<br \/>\n\twhat latency budget remains for per-candidate filtering.&#13;<\/p>\n<p>That gave us a concrete tuning target:<\/p>\n<p>&#13;<br \/>\n\tkeep false positives low enough to avoid visible over-filtering of unseen items&#13;<br \/>\n\twhile reducing expensive exact-history lookups on the negative-heavy path.&#13;<\/p>\n<p>This is a general rule: start from service SLOs and user-impact tolerance, then calculate Bloom filter parameters.<\/p>\n<p>How We Chose \\(n\\), \\(m\\), and \\(k\\) in Our Case<\/p>\n<p>In our implementation, we modeled\u00a0\\(n\\) as the expected number of viewed items represented in one filter over its lifecycle window (for example, per user over a rolling period).<br \/>&#13;<br \/>\nWe used the standard equations:<\/p>\n<p>\\(m \\approx -\\frac{n \\ln P}{(\\ln 2)^2}, \\quad k \\approx \\frac{m}{n} \\ln 2\\)<\/p>\n<p>Then we made three practical adjustments:<\/p>\n<p>&#13;<br \/>\n\tHeadroom on \\(n\\): we sized for growth, not current average, to avoid early saturation.&#13;<br \/>\n\tRounded m for implementation efficiency: we rounded to word boundaries for packed\u00a0\\(\\texttt{[]uint64}\\) storage.&#13;<br \/>\n\tClamped\u00a0\\(k\\) for CPU cost: we treated the computed value as a starting point, then chose a value that kept per-request hash work within budget (hashing may be expensive).&#13;<\/p>\n<p>In practice, this meant accepting slightly higher memory to protect false-positive behavior under growth, and avoiding an overly large k that would hurt hot-path latency.<\/p>\n<p>In this specific recommendation use case, we also made an important serving decision: when the false-positive rate stayed within our product tolerance, we did not always route Bloom positives to exact lookup. A small false-positive rate means occasionally suppressing an unseen item, which was acceptable for feed quality, while skipping exact verification removed additional backend cost and latency. This approach worked for our use case because the occasional omission was acceptable, but in many systems, stricter guarantees are necessary.<\/p>\n<p>This approach worked for our use case because the occasional omission was acceptable, but in many systems, stricter guarantees are necessary. In general, exact verification of Bloom positives is required when false positives are expensive or correctness-critical, but optional when product behavior can tolerate rare over-filtering.<\/p>\n<p>Hash Function Choice: Correctness First, Then Throughput<\/p>\n<p>In this article, hash values are derived deterministically and mapped with modulo m. The important production lesson was that hash strategy is not a cosmetic detail:<\/p>\n<p>&#13;<br \/>\n\tpoor distribution inflates collisions&#13;<br \/>\n\tcollisions inflate false positives&#13;<br \/>\n\tfalse positives raise exact-lookup pass-through&#13;<br \/>\n\tpass-through erodes the benefit of the Bloom filter.&#13;<\/p>\n<p>A general practical tradeoff applies across Bloom-filter deployments: fully independent hash families are rarely used in serving systems, because they increase CPU cost. A common approach is double <a href=\"https:\/\/en.wikipedia.org\/wiki\/Double_hashing\" rel=\"nofollow noopener\" target=\"_blank\">hashing<\/a> (derive\u00a0\\(k\\) indices from two base hashes), which usually preserves good distribution while keeping hash computation cheaper.<\/p>\n<p>What worked for us:<\/p>\n<p>&#13;<br \/>\n\tdeterministic, stable hashing across instances&#13;<br \/>\n\tfast non-cryptographic hashes for serving path performance&#13;<br \/>\n\tempirical validation of observed false-positive behavior under representative keys.&#13;<\/p>\n<p>As a general guidance, treat hash quality as a measurable property in your workload, not as an assumption.<\/p>\n<p>Measure the Right Operational Signals<\/p>\n<p>A Bloom filter can look correct while still being operationally wrong. We tracked:<\/p>\n<p>&#13;<br \/>\n\tpass-through rate to exact history checks&#13;<br \/>\n\teffective false-positive proxy (exact misses after Bloom &#8220;possibly present&#8221;)&#13;<br \/>\n\tlatency impact in feed-serving stages&#13;<br \/>\n\tmemory footprint per filter scope&#13;<br \/>\n\tsaturation drift over time.&#13;<\/p>\n<p>These metrics are generally useful in any production Bloom-filter deployment: they tell you when the filter is still saving work versus when it is decaying into overhead.<\/p>\n<p>Lifecycle Strategy Matters as Much as Initial Tuning<\/p>\n<p>Even with good initial parameters, filters degrade if cardinality grows beyond assumptions. In our case, defining lifecycle policy early was critical to know:<\/p>\n<p>&#13;<br \/>\n\twhen to rebuild&#13;<br \/>\n\twhen to rotate&#13;<br \/>\n\thow to recover if pass-through spikes.&#13;<\/p>\n<p>Generalizing beyond recommendations: if your data is dynamic and grows continuously, you can\u2019t rely on a one-time filter setup. Instead, you need a clear lifecycle policy: deciding when to rebuild or rotate the filter, and how to handle unexpected growth or spikes. Without this, filter accuracy and efficiency will degrade over time.<\/p>\n<p>Practical Checklist<\/p>\n<p>Before shipping a Bloom filter in a high-throughput system:<\/p>\n<p>&#13;<br \/>\n\tdefine acceptable false-positive impact in product terms&#13;<br \/>\n\testimate n with growth headroom&#13;<br \/>\n\tcompute m and k, then tune against latency and memory budgets&#13;<br \/>\n\tvalidate hash behavior with real key distributions&#13;<br \/>\n\tinstrument pass-through and saturation metrics&#13;<br \/>\n\tpredefine rebuild\/rotation policy&#13;<\/p>\n<p>Applied this way, Bloom filters remain a high-leverage optimization rather than a fragile micro-optimization.<\/p>\n<p>Conclusion<\/p>\n<p>Bloom filters solved the problem we actually had: too many expensive &#8220;is this seen?&#8221; checks on a negative-heavy path. By moving membership filtering into memory, we removed avoidable I\/O from the feed-serving critical path and regained latency headroom without exploding memory usage.<\/p>\n<p>The key lesson is not &#8220;always use Bloom filters&#8221;.\u00a0It is to treat them as a tunable systems component: set \\(m\\) and\u00a0\\(k\\) from product tolerance and traffic shape, choose hash functions for both distribution and throughput, and monitor saturation before it silently erodes value.<\/p>\n<p>In workloads like recommendation filtering, where rare false positives are acceptable, Bloom filters can be more than a textbook gimmick, they can be a practical, production-grade lever for performance and cost.<\/p>\n<p>Appendix Dive: The Math Behind Bloom Filters<\/p>\n<p>If you\u2019re interested in the detailed math behind these results, read on for a full derivation and explanation.<\/p>\n<p>Why False Positives Happen<\/p>\n<p>To understand when false positives are possible, recall that:<\/p>\n<p>&#13;<br \/>\n\tEach element is inserted by setting \\(k\\) bits in a bit array of size \\(m\\).&#13;<br \/>\n\tBits are never cleared, and different elements may set overlapping bits.&#13;<\/p>\n<p>A false positive occurs when a query element happens to have all its \\(k\\) hash bits already set by other elements, even though it was never inserted.<\/p>\n<p>Probability a Single Bit Is Still 0<\/p>\n<p>As explained, Bloom filters\u2019s behavior is determined by three parameters:<\/p>\n<p>&#13;<br \/>\n\t\\(m\\) = number of bits in the array&#13;<br \/>\n\t\\(k\\) = number of hash functions&#13;<br \/>\n\t\\(n\\) = number of inserted elements&#13;<\/p>\n<p>When inserting a single element, each hash sets a bit. The probability that a particular bit remains 0 after one insertion is:<\/p>\n<p>\\(p_0 = 1 &#8211; \\frac{1}{m}\\)<\/p>\n<p>After inserting n elements, with \\(k\\) hashes each, the probability a bit is still 0 is:<\/p>\n<p>\\(p_0 = \\left( 1 &#8211; \\frac{1}{m} \\right)^{kn}\\)<\/p>\n<p>Approximation: For large \\(m, \\ (1 &#8211; 1\/m)^{kn} \\approx e^{-kn\/m}\\) . This gives a simpler formula for reasoning about false positives.<\/p>\n<p>Probability of a False Positive<\/p>\n<p>A query element is a false positive if all \\(k\\) of its bits are 1. Using the previous step:<\/p>\n<p>\\(P_{fp} = (1 &#8211; p_0)^k \\approx (1 &#8211; e^{-kn\/m})^k\\)<\/p>\n<p>&#13;<br \/>\n\tMore inserted elements \u2192 higher probability that bits are already set \u2192 higher false-positive rate.&#13;<br \/>\n\tLarger bit array \\(m\\) \u2192 reduces collisions \u2192 lower false-positive rate.&#13;<br \/>\n\tNumber of hash functions \\(k\\) controls the balance: too few \u2192 high false positives; too many \u2192 higher CPU cost without much gain.&#13;<\/p>\n<p>Choosing \\(m\\) and \\(k\\)<\/p>\n<p>These formulas allow engineers to compute practical parameters:<\/p>\n<p>&#13;<br \/>\n\tFor a desired false-positive rate \\(P\\), given expected number of elements \\(n\\):<br \/>&#13;<br \/>\n\t\\(m \\approx -\\frac{n \\ln P}{(\\ln 2)^2}, \\quad k \\approx \\frac{m}{n} \\ln 2\\)&#13;<br \/>\n\t\\(m\\) controls memory usage and saturation of the bit array.&#13;<br \/>\n\t\\(k\\) controls the number of hash operations per element.&#13;<br \/>\n\tThese formulas provide a starting point, which can later be refined based on the actual hash functions and workload.&#13;<\/p>\n<p>As saturation increases (the fraction of bits set approaches 1), the false-positive rate approaches 1 as well, which is why sizing and lifecycle policy are both critical.<br \/>&#13;<br \/>\nWithout this analysis, a Bloom filter may either produce too many false positives or waste memory (we may not even know which). Understanding the math is crucial for configuring Bloom filters for real-world applications, ensuring they provide the intended performance benefits while managing trade-offs effectively.<\/p>\n","protected":false},"excerpt":{"rendered":"Key Takeaways &#13; Bloom filters provide efficient probabilistic membership testing with no false negatives and controlled false-positive rates.&#13;&hellip;\n","protected":false},"author":2,"featured_media":587693,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[224216,49,48,4113,207381,12520,13692,61],"class_list":{"0":"post-587692","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-bloom-filters-practice-go-recommender","9":"tag-ca","10":"tag-canada","11":"tag-development","12":"tag-go-language","13":"tag-optimization","14":"tag-performance-scalability","15":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/587692","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=587692"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/587692\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/587693"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=587692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=587692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=587692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}