{"id":222032,"date":"2026-01-05T18:09:09","date_gmt":"2026-01-05T18:09:09","guid":{"rendered":"https:\/\/www.newsbeep.com\/il\/222032\/"},"modified":"2026-01-05T18:09:09","modified_gmt":"2026-01-05T18:09:09","slug":"engineering-speed-at-scale-architectural-lessons-from-sub-100-ms-apis","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/il\/222032\/","title":{"rendered":"Engineering Speed at Scale \u2014 Architectural Lessons from Sub-100-ms APIs"},"content":{"rendered":"<p>\t\t\t\t\t\t\t\t\tKey Takeaways<br \/>\n\t\t\t\t\t\t\t\t\t&#13;<br \/>\n\tTreat latency as a first-class product concern \u2014 designed with the same discipline as security and reliability.&#13;<br \/>\n\tUse a latency budget to turn &#8220;sub-100ms&#8221; into enforceable constraints across every hop in the request path.&#13;<br \/>\n\tExpect speed to regress unless you actively guard it as the system, traffic, and dependencies evolve.&#13;<br \/>\n\tKeep performance ownership broad by baking it into reviews, dashboards, and release practices \u2014 not a single &#8220;performance team.&#8221;&#13;<br \/>\n\tLet architecture create the fast path, and let culture (measurement + accountability) keep it fast over time.<br \/>&#13;<br \/>\n\t\u00a0&#13;<\/p>\n<p>\t\t\t\t\t\t\t\tThe Cost of a Millisecond: Why Latency Shapes Experience<\/p>\n<p>When we talk about API performance, it\u2019s tempting to think in neat technical terms &#8211; response times, CPU cycles, connection pools, and the occasional flame graph. But in real-world systems, especially global commerce and payments platforms, latency has a very human cost. A delay of just 50 or 100 milliseconds rarely registers in isolation, but at scale it can nudge a customer away from completing a purchase, disrupt a payment flow, or simply chip away at the trust users place in your product.<\/p>\n<p>Speed shapes perception long before it shapes metrics. Users don\u2019t measure latency with stopwatches &#8211; they feel it. The difference between a 120 ms checkout step and an 80 ms one is invisible to the naked eye, yet emotionally it becomes the difference between &#8220;smooth&#8221; and &#8220;slightly annoying&#8221;. On a small scale, that\u2019s forgettable. Across millions of sessions, it becomes the friction that compacts into lower conversion rates, abandoned carts, and reduced revenue. And the irony? The engineering effort needed to recover from that friction &#8211; new features, experiments, retention strategies &#8211; often dwarfs the work needed to prevent it in the first place.<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/engineering-speed-scale\/articles\/engineering-speed-scale\/en\/resources\/170figure-1-1767008190654.jpg\" style=\"width: 3212px; height: 1490px;\" rel=\"share\"\/><\/p>\n<p>In high-throughput platforms, latency amplifies. If a service adds 30 ms in normal conditions, it might add 60 ms during peak load, then 120 ms when a downstream dependency wobbles. Latency doesn\u2019t degrade gracefully; it compounds. And once your tail latency (p95, p99) drifts, it silently &#8220;taxes&#8221; every upstream service that depends on you. Each service adds its own jitter, serialization overhead, and network hop. What starts as a tiny bump in one API becomes a cascading slowdown across dozens of interconnected services.<\/p>\n<p>This is why high-performing architecture teams treat speed as a product feature, not a pleasant side effect. They design for latency the same way they design for security and reliability: intentionally, with clear budgets, well-defined expectations, and patterns that protect the user experience under stress.<\/p>\n<p>A helpful way to see this is through a &#8220;latency budget&#8221;. Instead of thinking about performance as a single number &#8211; say, &#8220;API must respond in under 100 ms&#8221; &#8211; modern teams break it down across the entire request path:<\/p>\n<p>&#13;<br \/>\n\t10 ms at the edge&#13;<br \/>\n\t5 ms for routing&#13;<br \/>\n\t30 ms for application logic&#13;<br \/>\n\t40 ms for data access&#13;<br \/>\n\t10\u201315 ms for network hops and jitter&#13;<\/p>\n<p>Each layer is allocated a slice of the total budget. This transforms latency from an abstract target into a concrete architectural constraint. Suddenly, trade-offs become clearer: &#8220;If we add feature X in the service layer, what do we remove or optimize so we don\u2019t blow the budget?&#8221; These conversations &#8211; technical, cultural, and organizational &#8211; are where fast systems are born.<\/p>\n<p>The heart of this article is simple: low latency isn\u2019t an optimization &#8211; it\u2019s a design outcome. It emerges from the choices we make about data locality, async vs. sync flows, cache boundaries, error isolation, and observability. Achieving sub-100 ms is possible for many systems, but sustaining it under load takes alignment across engineering, product, and operations.<\/p>\n<p>In the sections that follow, we\u2019ll break down how real systems are structured, how engineering teams make trade-offs when milliseconds matter, and how organizations sustain performance long after the first release ships. Fast systems don\u2019t happen accidentally &#8211; they\u2019re engineered with intent.<\/p>\n<p>Inside the Fast Lane: How Low-Latency Systems Are Structured<\/p>\n<p>Before we talk about optimizing performance, we need to zoom out and understand what a low-latency system actually looks like. Sub-100 ms responses don\u2019t come from a single clever trick; they emerge from a carefully orchestrated pipeline of components that work together with minimal friction. Think of it less as &#8220;making one thing fast&#8221; and more as &#8220;removing unnecessary steps from an entire journey&#8221;.<\/p>\n<p>Most modern systems &#8211; especially in commerce and payments &#8211; follow a layered architecture that looks deceptively simple from the outside: a client makes a request, it hits an API gateway, flows through a service layer, talks to a database, and returns. But behind that simple flow is an intricate chain where every hop, every serialization, every cache hit or miss shapes the user\u2019s experience.<\/p>\n<p>Let\u2019s walk through the anatomy of a fast system and where milliseconds typically hide.<\/p>\n<p>The Request Journey: Where Latency Sneaks In<\/p>\n<p>A typical sub-100 ms request flow might look like this:<\/p>\n<p>&#13;<br \/>\n\tClient \u2192 CDN or Edge Network<br \/>&#13;<br \/>\n\tThe closest node absorbs the request and routes it smartly.&lt;<br \/>&#13;<br \/>\n\tLatency target: 5\u201315 ms&#13;<br \/>\n\tEdge \u2192 API Gateway<br \/>&#13;<br \/>\n\tAuthentication, routing, throttling.<br \/>&#13;<br \/>\n\tLatency target: 5 ms&#13;<br \/>\n\tGateway \u2192 Service Layer<br \/>&#13;<br \/>\n\tBusiness logic, orchestration, fan-out.<br \/>&#13;<br \/>\n\tLatency target: 20\u201330 ms&#13;<br \/>\n\tService Layer \u2192 Data Layer<br \/>&#13;<br \/>\n\tReads from databases, caches, or search systems.<br \/>&#13;<br \/>\n\tLatency target: 25\u201340 ms&#13;<br \/>\n\tService \u2192 Gateway \u2192 Client<br \/>&#13;<br \/>\n\tSerialization and network hop back.<br \/>&#13;<br \/>\n\tLatency target: 5\u201310 ms&#13;<\/p>\n<p>When done right, the entire hop chain stays predictable &#8211; even during peak load. But if any single hop drifts, the whole chain inherits the slowdown. This is why fast systems start with understanding the complete journey, not just the part you own.<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/engineering-speed-scale\/articles\/engineering-speed-scale\/en\/resources\/131figure-2-1767008190654.jpg\" style=\"width: 3160px; height: 798px;\" rel=\"share\"\/><\/p>\n<p>Where Latency Really Comes From (Not Where You Expect)<\/p>\n<p>Latency is rarely caused by &#8220;slow code&#8221;. In production systems, it usually comes from:<\/p>\n<p>1. Network Hops<\/p>\n<p>Each hop adds cost:<\/p>\n<p>&#13;<br \/>\n\tTLS handshakes&#13;<br \/>\n\tConnection pool waits&#13;<br \/>\n\tDNS lookups&#13;<br \/>\n\tRegion-to-region travel&#13;<\/p>\n<p>Shaving one hop often saves more latency than rewriting 100 lines of Java.<\/p>\n<p>2. Serialization &amp; Payload Size<\/p>\n<p>JSON serialization\/deserialization is more expensive than people realize. Every unnecessary field is extra work. Binary formats (e.g., Protobuf) can help, but they add operational overhead.<\/p>\n<p>3. Cold Caches<\/p>\n<p>A cache miss at the wrong time can double or triple service latency. This is why &#8220;warming strategies&#8221; matter when deploying new versions.<\/p>\n<p>4. Database Query Shape<\/p>\n<p>Database latency is often an access-pattern problem: query shape, indexes, and cardinality matter.A poorly indexed read can turn a 10 ms query into a 120 ms spike.Multiply that across thousands of requests per second, and tail latency explodes.<\/p>\n<p>5. Dependent Services<\/p>\n<p>This is where latency becomes unpredictable: If your service calls three downstreams, your response time is often gated by the slowest one.<\/p>\n<p>This is why async fan-out, caching, and circuit breakers become critical. (We\u2019ll go deeper soon)<\/p>\n<p>The Latency Budget: Your Most Important Architectural Tool<\/p>\n<p>High-performing engineering teams don\u2019t just &#8220;measure latency&#8221;; they budget it. A latency budget is like a financial budget: Everyone gets a fixed allowance, and nobody is allowed to overspend.<\/p>\n<p>A typical 100 ms budget might look like:<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tLayer&#13;<br \/>\n\t\t\tBudget (ms)&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tEdge\/CDN&#13;<br \/>\n\t\t\t10&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tGateway&#13;<br \/>\n\t\t\t5&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tService Logic&#13;<br \/>\n\t\t\t30&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tDatabase \/ Cache&#13;<br \/>\n\t\t\t40&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tNetwork Jitter&#13;<br \/>\n\t\t\t10&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>Budgets make performance manageable and negotiable. Engineers can now ask:<\/p>\n<p>&#13;<\/p>\n<p>&#8220;If we add feature X, which layer gives up its milliseconds?&#8221;<\/p>\n<p>&#13;<\/p>\n<p>Without budgets, performance conversations become vague and subjective.<\/p>\n<p>Why Understanding the System Structure Matters<\/p>\n<p>Everything we cover in later sections &#8211; async fan-out, caching hierarchies, circuit breakers, fallback strategies &#8211; only makes sense when you understand the system anatomy. Optimizing a single service without understanding the ecosystem is like upgrading a car engine but ignoring the wheels, brakes, and fuel system.<\/p>\n<p>Fast systems share these traits:<\/p>\n<p>&#13;<br \/>\n\tFewer hops&#13;<br \/>\n\tAggressive local caching&#13;<br \/>\n\tPredictable data-access paths&#13;<br \/>\n\tParallelism over serial execution&#13;<br \/>\n\tIsolation of slow components&#13;<br \/>\n\tStable tail latency under load&#13;<\/p>\n<p>With the system anatomy clear, we can now move into the engineering playbook &#8211; how to actually make these systems fly.<\/p>\n<p>The Engineering Playbook: Trade-Offs That Keep APIs Lightning-Fast<\/p>\n<p>Engineering for low latency is really engineering for predictability. Fast systems aren\u2019t built through micro-optimizations &#8211; they\u2019re built through a series of deliberate, layered decisions that minimize uncertainty and keep tail latency under control. This section breaks down the actual patterns, trade-offs, and guardrails used in high-throughput systems.<\/p>\n<p>Async Fan-Out: Parallelism Without Pain<\/p>\n<p>Slow APIs often boil down to one root cause: serial dependencies.<\/p>\n<p>If your system performs three downstream calls at 40 ms each, you\u2019ve already lost 120 ms without doing any real business work.<\/p>\n<p>Fan out in parallel<\/p>\n<p>Java\u2019s CompletableFuture is a natural fit, especially when paired with a custom executor tuned for downstream concurrency:<\/p>\n<p>&#13;<br \/>\nExecutorService pool = new ThreadPoolExecutor(&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 20, 40, 60, TimeUnit.SECONDS,&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 new LinkedBlockingQueue&lt;&gt;(500),&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 new ThreadPoolExecutor.CallerRunsPolicy()&#13;<br \/>\n);&#13;<br \/>\n&#13;<br \/>\nCompletableFuture profileFuture =&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 CompletableFuture.supplyAsync(() -&gt; profileClient.getProfile(userId), pool);&#13;<br \/>\n&#13;<br \/>\nCompletableFuture&gt; recsFuture =&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 CompletableFuture.supplyAsync(() -&gt; recClient.getRecs(userId), pool);&#13;<br \/>\n&#13;<br \/>\nCompletableFuture orderFuture =&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 CompletableFuture.supplyAsync(() -&gt; orderClient.getOrders(userId), pool);&#13;<br \/>\n&#13;<br \/>\nreturn CompletableFuture.allOf(profileFuture, recsFuture, orderFuture)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .thenApply(v -&gt; new HomeResponse(&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 profileFuture.join(),&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 recsFuture.join(),&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 orderFuture.join()&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 ));<\/p>\n<p>But here\u2019s the caution most articles never mention:<\/p>\n<p>Async code doesn\u2019t eliminate blocking &#8211; it just hides it inside a thread pool.<\/p>\n<p>If your executor is misconfigured, you can trigger:<\/p>\n<p>&#13;<br \/>\n\tCPU thrashing&#13;<br \/>\n\tThread contention&#13;<br \/>\n\tQueue buildup&#13;<br \/>\n\tOut-of-memory errors&#13;<br \/>\n\tCascading slowdowns across the entire fleet&#13;<\/p>\n<p>Thread Pool Rule of Thumb:<\/p>\n<p>For downstream IO-bound calls, size your pool to:<br \/>&#13;<br \/>\n2\u00d7CPU cores \u00d7 expected parallel downstream calls per request<br \/>&#13;<br \/>\n(adjust using p95\/p99 load testing)<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/engineering-speed-scale\/articles\/engineering-speed-scale\/en\/resources\/111figure-3-1767010410236.jpg\" rel=\"share\"\/><\/p>\n<p>Multi-Level Caching: The Art of Fast Paths<\/p>\n<p>Fast systems don\u2019t eliminate work &#8211; they avoid doing the same expensive work repeatedly.<\/p>\n<p>A typical hierarchy:<\/p>\n<p>&#13;<br \/>\n\tLocal cache (Caffeine) &#8211; sub-millisecond&#13;<br \/>\n\tRedis cache &#8211; 3\u20135 ms&#13;<br \/>\n\tDatabase &#8211; 20\u201360+ ms&#13;<\/p>\n<p>Use the dual-level caching pattern. In this example, Redis uses a 10-minute TTL, and the local in-memory cache should also be time-bounded (usually shorter), otherwise it can quietly become a &#8220;forever cache&#8221; and serve stale data across instances.<\/p>\n<p>&#13;<br \/>\npublic ProductService(RedisClient redis, ProductDb db) {&#13;<br \/>\n\u00a0 \u00a0 this.redis = redis;&#13;<br \/>\n\u00a0 \u00a0 this.db = db;&#13;<br \/>\n\u00a0 \u00a0 this.localCache = Caffeine.newBuilder()&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .maximumSize(50_000)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .expireAfterWrite(Duration.ofMinutes(1)) \/\/ shorter than Redis&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .build();&#13;<br \/>\n\u00a0 }&#13;<br \/>\n&#13;<br \/>\npublic ProductInfo getProductInfo(String productId) {&#13;<br \/>\n\u00a0 \u00a0 ProductInfo local = localCache.getIfPresent(productId);&#13;<br \/>\n\u00a0 \u00a0 if (local != null) return local;&#13;<br \/>\n&#13;<br \/>\n\u00a0 \u00a0 ProductInfo redisValue = redis.get(productId);&#13;<br \/>\n\u00a0 \u00a0 if (redisValue != null) {&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 localCache.put(productId, redisValue);&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 return redisValue;&#13;<br \/>\n\u00a0 \u00a0 }&#13;<br \/>\n&#13;<br \/>\n\u00a0 \u00a0 ProductInfo dbValue = db.fetch(productId);&#13;<br \/>\n&#13;<br \/>\n\u00a0 \u00a0 redis.set(productId, dbValue, Duration.ofMinutes(10));&#13;<br \/>\n\u00a0 \u00a0 \/\/ localCache is configured with expireAfterWrite(1, MINUTES)&#13;<br \/>\n\u00a0 \u00a0 localCache.put(productId, dbValue);&#13;<br \/>\n\u00a0 \u00a0 return dbValue;&#13;<br \/>\n}&#13;<br \/>\n&#13;<\/p>\n<p>This drives most requests into the fast path and reserves slow work for the cold path.<\/p>\n<p><img decoding=\"async\" alt=\"\" class=\"zoom-image\" src=\"https:\/\/www.infoq.com\/articles\/engineering-speed-scale\/articles\/engineering-speed-scale\/en\/resources\/76figure-4-1767010410236.jpg\" style=\"width: 3140px; height: 998px;\" rel=\"share\"\/><\/p>\n<p>Cache Invalidation: The Hardest Problem in Computer Science (Still True)<\/p>\n<p>Low-latency systems are heavily cache-driven, but caching without a clear invalidation strategy is a time bomb.<\/p>\n<p>There are three kinds of invalidation:<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tCache invalidation style&#13;<br \/>\n\t\t\tPros&#13;<br \/>\n\t\t\tCons&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\t1. Time-based (TTL)&#13;<br \/>\n\t\t\tSimple, safe, widely used&#13;<br \/>\n\t\t\tstale data risk increases with longer TTLs&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\t2. Event-based&#13;<br \/>\n\t\t\tProducer sends &#8220;invalidate&#8221; events to downstream caches whenever data changes&#13;<br \/>\n\t\t\tRequires strong data ownership&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\t3. Version-based&#13;<br \/>\n\t\t\tCache keys include a version: product:v2:12345&#13;<br \/>\n\t\t\tWhen you bump the version, old data becomes unreachable&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>There is no universally &#8220;best&#8221; invalidation strategy. The right choice depends on how often data changes and how costly staleness is &#8211; which is exactly why classification matters.<\/p>\n<p>Data Classification: Not Everything Belongs in Cache<\/p>\n<p>This is the part almost every caching article ignores, but real systems cannot.<\/p>\n<p>You cannot treat all data as equal. Before caching anything, classify the data:<\/p>\n<p>&#13;<br \/>\n\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tClassification&#13;<br \/>\n\t\t\tWhat it means&#13;<br \/>\n\t\t\tCaching guidance&#13;<br \/>\n\t\t\tExamples&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tPublic&#13;<br \/>\n\t\t\tSafe to cache anywhere (CDN, Redis, local memory).&#13;<br \/>\n\t\t\tCache freely. TTL based is usually fine.&#13;<br \/>\n\t\t\tproduct titles, images, metadata&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tInternal&#13;<br \/>\n\t\t\tCacheable with restrictions.&#13;<br \/>\n\t\t\tCache with guardrails (scope, TTL, access control).&#13;<br \/>\n\t\t\tinternal IDs, flags&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tConfidential (PII)&#13;<br \/>\n\t\t\tSensitive user data.&#13;<br \/>\n\t\t\tCan only be cached if encrypted and with strict TTL.&#13;<br \/>\n\t\t\temail, phone, full user info&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t&#13;<br \/>\n\t\t\tRestricted (PCI)&#13;<br \/>\n\t\t\tHighly regulated payment data.&#13;<br \/>\n\t\t\tNever cached.&#13;<br \/>\n\t\t\traw card numbers, CVV, unmasked PAN&#13;<br \/>\n\t\t&#13;<br \/>\n\t&#13;<\/p>\n<p>When to be strict vs. loose?<\/p>\n<p>The caching strategy depends on the type of data\u2026 For example:<\/p>\n<p>&#13;<br \/>\n\tProduct catalog \u2192 loose TTL is fine (staleness OK)&#13;<br \/>\n\tPricing, offers \u2192 tighter TTL or event-based&#13;<br \/>\n\tPayments, balances \u2192 never cached, or only tokenized\/aggregated versions&#13;<\/p>\n<p>A simple classification check can protect engineering teams from accidental compliance violations.<\/p>\n<p>&#13;<br \/>\nif (data.isRestricted()) {&#13;<br \/>\n\u00a0 \u00a0 throw new UnsupportedOperationException(&#8220;Cannot cache PCI\/PII data&#8221;);&#13;<br \/>\n}<\/p>\n<p>Circuit Breakers: Don\u2019t Let Slow Dependencies Infect Your Tail Latency Downstream<\/p>\n<p>Slowness is one of the biggest drivers of p99 spikes. A dependency doesn\u2019t need to be fully down to cause trouble &#8211; sustained latency is enough. If every request waits on a degrading downstream call, you start consuming threads, building queues, and turning a local slowdown into a broader tail-latency problem.<\/p>\n<p>A circuit breaker helps by acting as a boundary between your service and an unstable dependency. When errors or timeouts cross a threshold, the breaker opens and temporarily stops sending traffic there. That shifts the system from &#8220;wait and accumulate&#8221; to a predictable outcome: fail fast and fall back, keeping your own API responsive.<\/p>\n<p>Resilience4j gives lightweight protection:<\/p>\n<p>&#13;<br \/>\nCircuitBreakerConfig config = CircuitBreakerConfig.custom()&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .failureRateThreshold(50)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .slidingWindowSize(20)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .waitDurationInOpenState(Duration.ofSeconds(5))&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .build();&#13;<br \/>\n&#13;<br \/>\nCircuitBreaker cb = CircuitBreaker.of(&#8220;recs&#8221;, config);&#13;<br \/>\n&#13;<br \/>\nSupplier&gt; supplier =&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 CircuitBreaker.decorateSupplier(cb, () -&gt; recClient.getRecs(userId));&#13;<br \/>\n&#13;<br \/>\ntry {&#13;<br \/>\n\u00a0 \u00a0 return supplier.get();&#13;<br \/>\n} catch (Exception ex) {&#13;<br \/>\n\u00a0 \u00a0 return Collections.emptyList(); \u00a0\/\/ fast fallback&#13;<br \/>\n}&#13;<br \/>\n&#13;<\/p>\n<p>When the breaker opens:<\/p>\n<p>&#13;<br \/>\n\tCalls fail fast (&lt;1 ms)&#13;<br \/>\n\tNo threads are blocked&#13;<br \/>\n\tYour API stays stable&#13;<\/p>\n<p>Fallbacks: When &#8220;Fast and Partial&#8221; Beats &#8220;Slow and Perfect&#8221;<\/p>\n<p>Fallbacks keep your fast path intact when a dependency is slow or unavailable. The point isn\u2019t to pretend nothing happened &#8211; it\u2019s to stop downstream slowness from consuming your latency budget. In many user flows, a slightly degraded response delivered quickly is better than a perfect response delivered late.<\/p>\n<p>Fallbacks should:<\/p>\n<p>&#13;<br \/>\n\tProvide something useful&#13;<br \/>\n\tBe predictably fast&#13;<br \/>\n\tNot cause additional load&#13;<br \/>\n\tBe easy to reason about&#13;<\/p>\n<p>Timeouts are part of the design. If a downstream timeout is &#8220;a few seconds&#8221;, it can quietly destroy a sub-100ms target. Timeouts need to align with the latency budget you set earlier and the dependency\u2019s p95\/p99 behavior &#8211; especially in fan-out paths where one slow call can dominate tail latency.<\/p>\n<p>Here\u2019s an example that returns a cached snapshot if the full page can\u2019t be assembled quickly. This only works because it builds on the caching approach discussed earlier &#8211; another reminder that low latency is holistic (budgets, caching, timeouts, and resilience patterns working together):<\/p>\n<p>&#13;<br \/>\npublic ProductPageResponse getPage(String productId) {&#13;<br \/>\n\u00a0 \u00a0 try {&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 return fetchFullPage(productId);&#13;<br \/>\n\u00a0 \u00a0 } catch (TimeoutException e) {&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 return fetchCachedSnapshot(productId); \u00a0\/\/ warm, minimal, safe&#13;<br \/>\n\u00a0 \u00a0 }&#13;<br \/>\n}<\/p>\n<p>Fallbacks don\u2019t eliminate failures &#8211; they bound the user impact when things get slow.<\/p>\n<p>Data Partitioning: Reducing Hotspots and Tail Spikes<\/p>\n<p>Partitioning reduces lock contention, narrows index scans, and improves locality.<\/p>\n<p>Here is a simple example where data is partitioned by region:<\/p>\n<p>&#13;<br \/>\nCREATE TABLE orders_us PARTITION OF orders FOR VALUES IN (&#8216;US&#8217;);&#13;<br \/>\nCREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN (&#8216;EU&#8217;);<\/p>\n<p>The application layer needs corresponding updates to use the partition effectively:<\/p>\n<p>&#13;<br \/>\nString table = region.equals(&#8220;US&#8221;) ? &#8220;orders_us&#8221; : &#8220;orders_eu&#8221;;&#13;<br \/>\nreturn jdbc.query(&#8220;SELECT * FROM &#8221; + table + &#8221; WHERE user_id=?&#8221;, userId);<\/p>\n<p>Partitioning is essential for read-heavy API systems.<\/p>\n<p>Observability: Making Speed Measurable<\/p>\n<p>Fast systems are not just the result of good architecture &#8211; they\u2019re the result of relentless observability. Latency budgets, circuit breakers, caching layers, thread pools\u2026 none of them matter if you don\u2019t know when and where your system drifts under real traffic.<\/p>\n<p>The biggest myth about low latency is that once you achieve it, you\u2019re done. The truth is the opposite:<\/p>\n<p>Speed decays unless you actively guard it.<\/p>\n<p>This is why high-performing engineering teams treat observability as a first-class citizen &#8211; not a debugging tool, but a continuous performance governance mechanism.<\/p>\n<p>Measure What Matters: p50, p95, p99, and Beyond<\/p>\n<p>Most dashboards proudly show average latency, which is almost useless in distributed systems. What users actually feel is tail latency:<\/p>\n<p>&#13;<br \/>\n\tp50 \u2192 &#8220;typical user&#8221;&#13;<br \/>\n\tp95 \u2192 &#8220;slightly unlucky user&#8221;&#13;<br \/>\n\tp99 \u2192 &#8220;customer who will abandon your product if this happens too often&#8221;&#13;<\/p>\n<p>If your p50 is 45 ms but your p99 is 320 ms, your system isn\u2019t fast &#8211; it\u2019s merely nice sometimes.<\/p>\n<p>Fast systems tune for predictability, not just averages.<\/p>\n<p>Instrumentation with Micrometer<\/p>\n<p><a href=\"https:\/\/micrometer.io\/\" rel=\"nofollow noopener\" target=\"_blank\">Micrometer<\/a> is the de facto standard for metrics in modern Java systems, and it makes latency instrumentation almost trivial.<\/p>\n<p>Here\u2019s a Micrometer timer for an API endpoint:<\/p>\n<p>&#13;<br \/>\n@Autowired&#13;<br \/>\nprivate MeterRegistry registry;&#13;<br \/>\n&#13;<br \/>\npublic ProductInfo fetchProduct(String id) {&#13;<br \/>\n\u00a0 \u00a0 return registry.timer(&#8220;api.product.latency&#8221;)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 .record(() -&gt; productService.getProductInfo(id));&#13;<br \/>\n}<\/p>\n<p>This single line produces:<\/p>\n<p>&#13;<br \/>\n\tp50, p90, p95, p99 histograms&#13;<br \/>\n\tthroughput (requests\/sec)&#13;<br \/>\n\tmax observed latency&#13;<br \/>\n\ttime-series for dashboards&#13;<br \/>\n\tSLO burn-rate signals&#13;<\/p>\n<p>Custom tags can be added for deeper insight.<\/p>\n<p>&#13;<br \/>\nregistry.timer(&#8220;api.product.latency&#8221;,&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 &#8220;region&#8221;, userRegion,&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 &#8220;cacheHit&#8221;, cacheHit ? &#8220;true&#8221; : &#8220;false&#8221;&#13;<br \/>\n);<\/p>\n<p>A rule we use internally:<\/p>\n<p>Tag everything that might affect latency.<br \/>&#13;<br \/>\nRegion, device type, API version, cache hit\/miss, fallback triggered, etc.<\/p>\n<p>This creates semantic observability &#8211; the opposite of blind metrics.<\/p>\n<p>Distributed Tracing: The Truth Serum of Low-Latency Systems<\/p>\n<p>Metrics tell you how long something took. Tracing tells you why.<\/p>\n<p>Using OpenTelemetry + Jaeger, you can map an entire request journey:<\/p>\n<p>&#13;<br \/>\nSpan span = tracer.spanBuilder(&#8220;fetchProduct&#8221;)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .setSpanKind(SpanKind.SERVER)&#13;<br \/>\n\u00a0 \u00a0 \u00a0 \u00a0 .startSpan();&#13;<br \/>\n&#13;<br \/>\ntry (Scope scope = span.makeCurrent()) {&#13;<br \/>\n\u00a0 \u00a0 return productService.getProduct(id);&#13;<br \/>\n} finally {&#13;<br \/>\n\u00a0 \u00a0 span.end();&#13;<br \/>\n}<\/p>\n<p>When visualized in Jaeger, you\u2019ll see:<\/p>\n<p>&#13;<br \/>\n\tGateway time&#13;<br \/>\n\tService logic time&#13;<br \/>\n\tParallel calls&#13;<br \/>\n\tCache vs DB path&#13;<br \/>\n\tDownstream delays&#13;<br \/>\n\tSerialization time&#13;<\/p>\n<p>This is how teams discover issues like:<\/p>\n<p>&#13;<br \/>\n\t&#8220;DB is fine, but Redis has a spike every hour&#8221;.&#13;<br \/>\n\t&#8220;API gateway is spending 10 ms doing header parsing&#8221;.&#13;<br \/>\n\t&#8220;Thread pool starvation during peak traffic&#8221;.&#13;<\/p>\n<p>Tracing pinpoints latency leaks no dashboard could reveal.<\/p>\n<p>SLOs and Latency Budgets: The Guardrails That Keep Teams Honest<\/p>\n<p>Latency budgets, as discussed earlier, only work when teams measure and enforce them.<\/p>\n<p>A typical SLO (Service Level Objective):<\/p>\n<p>&#13;<br \/>\n\tTarget: p95 &lt; 120 ms&#13;<br \/>\n\tPeriod: Rolling 30 days&#13;<br \/>\n\tError budget: 5% of requests may exceed threshold&#13;<\/p>\n<p>SLO burn rate is simply how fast you\u2019re spending that error budget compared to the &#8220;expected&#8221; pace. A burn rate of 1 means you\u2019re consuming budget at the rate that would use it up exactly by the end of the SLO window; anything above 1 means you\u2019re burning it faster than planned.When the burn rate spikes, teams slow down feature releases and prioritize performance fixes (rollback, reduce load, tune hot paths, fix a slow dependency, etc.). This is one of the most practical ways to keep &#8220;sub-100ms&#8221; from becoming a quarterly goal that slowly drifts.<\/p>\n<p>A very useful burn-rate alert rule:<\/p>\n<p>Alert if burn-rate &gt; 14.4 over 10 minutes<br \/>&#13;<br \/>\nTranslation: 14.4 is a commonly used &#8220;fast-burn&#8221; threshold &#8211; if that pace were sustained, you\u2019d consume a 30-day error budget in roughly ~2 days (\u224850 hours), which is why it\u2019s treated as urgent.<\/p>\n<p>How this prevents issues from reaching customers: Burn-rate alerts are designed to fire early, while the regression is still small (or still limited to a subset of traffic). That gives you time to pause or reverse a rollout and fix the underlying cause before the slowdown becomes widespread and sustained. Teams often pair this with progressive delivery (canaries) and synthetic checks, but the key is that burn-rate is an SLO-native early warning tied directly to user-facing latency.<\/p>\n<p>Thread Pool Observability: The Hidden Latency Killer<\/p>\n<p>Thread pools are one of the easiest ways to accidentally break a latency budget. They look like a performance win (&#8220;parallelize downstream calls&#8221;), but under load they can become a bottleneck: threads saturate, queues grow, requests start waiting, and what used to be &#8220;async fan-out&#8221; quietly turns into backpressure and tail-latency spikes. The tricky part is that this doesn\u2019t always show up as high CPU &#8211; it often shows up as waiting.<\/p>\n<p>That\u2019s why observability matters here. Without visibility into pool saturation and queue growth, you only notice the problem after p99 has already exploded.Instrument your pool:<\/p>\n<p>&#13;<br \/>\nThreadPoolExecutor executor = (ThreadPoolExecutor) pool;&#13;<br \/>\n&#13;<br \/>\nregistry.gauge(&#8220;threadpool.active&#8221;, executor, ThreadPoolExecutor::getActiveCount);&#13;<br \/>\nregistry.gauge(&#8220;threadpool.queue.size&#8221;, executor, e -&gt; e.getQueue().size());&#13;<br \/>\nregistry.gauge(&#8220;threadpool.completed&#8221;, executor, e -&gt; e.getCompletedTaskCount());&#13;<br \/>\nregistry.gauge(&#8220;threadpool.pool.size&#8221;, executor, ThreadPoolExecutor::getPoolSize);<\/p>\n<p>If you see:<\/p>\n<p>&#13;<br \/>\n\tActive threads == max size&#13;<br \/>\n\tQueue constantly growing&#13;<br \/>\n\tRejection count increasing&#13;<\/p>\n<p>\u2026 then your async fan-out is turning into async pile-up, which leads to:<\/p>\n<p>&#13;<br \/>\n\tRetries&#13;<br \/>\n\tTimeouts&#13;<br \/>\n\tCascaded slowness&#13;<br \/>\n\tp99 explosions&#13;<\/p>\n<p>Thread pool monitoring is non-negotiable in low-latency environments.<\/p>\n<p>Observability Isn\u2019t a Dashboard &#8211; It\u2019s a Culture<\/p>\n<p>The most important insight is cultural:<\/p>\n<p>&#13;<br \/>\n\tTeams own their latency&#13;<br \/>\n\tDashboards are reviewed weekly&#13;<br \/>\n\tSLOs drive engineering priorities&#13;<br \/>\n\tPerformance regressions trigger postmortems&#13;<br \/>\n\tCache hit rates are tracked like uptime&#13;<br \/>\n\tEvery change has a performance blast radius&#13;<\/p>\n<p>Fast systems stay fast only because teams keep them honest.<\/p>\n<p>Beyond Architecture: How Organizations Keep APIs Fast &#8211; and Where the Future Is Heading<\/p>\n<p>Engineering a sub-100 ms API is challenging; keeping it consistently fast as the system grows is even harder. Over time, feature creep, new dependencies, shifting traffic patterns, and organizational changes all conspire to slow systems down. Architecture provides the foundation, but long-term performance comes from habits, ownership, and a culture that treats latency as a first-class concern.<\/p>\n<p>The most reliable lesson from real-world systems is simple:<\/p>\n<p>Fast systems stay fast only when teams behave like performance is everyone\u2019s job.<\/p>\n<p>Culture Keeps Performance Alive<\/p>\n<p>High-performing organizations treat performance as a shared responsibility rather than a backend problem, and that cultural mindset is ultimately what keeps APIs fast over the long term. Teams own the latency of the services they build, planning features with an explicit understanding of how many milliseconds each change will cost and holding themselves accountable when performance drifts. Engineers routinely ask performance-conscious questions during design reviews &#8211; &#8220;How many extra hops does this add?&#8221;, &#8220;Is this cacheable?&#8221;, &#8220;What\u2019s the worst-case p99 impact?&#8221; &#8211; ensuring that latency remains part of everyday decision-making. And when things do go wrong, these organizations practice blameless learning: instead of pointing fingers, they analyze tail latency, refine patterns, adjust SLOs, and strengthen guardrails. In these cultures, performance isn\u2019t a special project &#8211; it\u2019s simply how teams work.<\/p>\n<p>Hard Lessons from Real Low-Latency Systems<\/p>\n<p>Patterns that repeatedly surface in production:<\/p>\n<p>&#13;<br \/>\n\tThread Pools Can Quietly Break Everything &#8211; Undersized pools cause starvation; oversized pools cause CPU thrashing. Misconfigured pooled async work is a top contributor to p99 explosions.&#13;<br \/>\n\tCache Invalidation Is More Critical Than Cache Hits &#8211; Cache hits are only a win when the data is correct. If you can\u2019t invalidate safely, it\u2019s better to be slower than serve stale results. Event-based invalidation helps teams stay fast and correct.&#13;<br \/>\n\tVariance Hurts More Than Speed &#8211; A dependency that is always 50 ms is far safer than one that fluctuates between 10 and 300 ms. Predictability beats raw throughput.&#13;<br \/>\n\tProximity Beats Optimization &#8211; Cross-region calls consistently appear at the root of high latency. Keeping reads close to users matters more than indexing tricks.&#13;<\/p>\n<p>These lessons form the &#8220;engineering muscle memory&#8221; that separates teams who can sustain speed from those who only achieve it once.<\/p>\n<p>Anti-Patterns to Avoid<\/p>\n<p>Even mature systems fall into predictable traps:<\/p>\n<p>&#13;<br \/>\n\tTreating staging latency as meaningful&#13;<br \/>\n\tOverusing reactive patterns without isolation&#13;<br \/>\n\tLogging synchronously on the hot path&#13;<br \/>\n\tPutting too much logic into the API gateway&#13;<br \/>\n\tUsing one massive cache instead of layered caching&#13;<\/p>\n<p>These anti-patterns create slow drift &#8211; small regressions that accumulate unnoticed until p99 collapses.<\/p>\n<p>The Next Frontier of Low-Latency Systems<\/p>\n<p>Fast systems of the next decade will be defined not just by new frameworks, but by intelligent, self-adjusting behaviors:<\/p>\n<p>&#13;<br \/>\n\tAdaptive Routing Based on Real-Time Latency &#8211; Requests will route to the region, shard, or instance with the lowest real-time tail latency.&#13;<br \/>\n\tAI-Assisted Prediction &#8211; Models will predict cache misses, traffic spikes, and dependency degradation &#8211; enabling preemptive optimizations.&#13;<br \/>\n\tPredictive Cache Warming &#8211; Systems will use access patterns to warm caches minutes or seconds before high-traffic bursts begin.&#13;<br \/>\n\tEdge-Native Execution &#8211; Critical logic and pre-computed views will continuously shift closer to users, making &#8220;global &lt;50 ms&#8221; more achievable.&#13;<\/p>\n<p>These shifts push systems from reactive performance tuning toward proactive performance orchestration.<\/p>\n<p>The Real Insight: Architecture Is the Blueprint, Culture Is the Engine<\/p>\n<p>The final and most important takeaway:<\/p>\n<p>Architecture can make your system fast.<br \/>&#13;<br \/>\nCulture is what keeps it fast.<\/p>\n<p>Teams that monitor p99 as closely as correctness, that design with latency budgets, and that learn from regressions are the ones who consistently deliver instant-feeling experiences at scale.<\/p>\n<p>Sustained low latency is not luck &#8211; it\u2019s the outcome of small, disciplined decisions made across time, teams, and technology.<\/p>\n","protected":false},"excerpt":{"rendered":"Key Takeaways &#13; Treat latency as a first-class product concern \u2014 designed with the same discipline as security&hellip;\n","protected":false},"author":2,"featured_media":222033,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31],"tags":[122104,7389,437,434,435,436,438,4407,122102,146,85,46,122105,14337,122103],"class_list":{"0":"post-222032","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-arts-and-design","8":"tag-application-performance-management","9":"tag-architecture-design","10":"tag-arts","11":"tag-arts-and-design","12":"tag-artsanddesign","13":"tag-artsdesign","14":"tag-design","15":"tag-development","16":"tag-engineering-speed-scale","17":"tag-entertainment","18":"tag-il","19":"tag-israel","20":"tag-low-latency","21":"tag-performance","22":"tag-performance-tuning"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/222032","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/comments?post=222032"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/222032\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media\/222033"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media?parent=222032"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/categories?post=222032"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/tags?post=222032"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}