Real-time, in-memory EventStream cache for analysis by human operators

As the business of the customer - a Cloud Services Provider (CSP) grew, the ability for engineers to support its many users directly became more and more critical.

Existing fleet status query tools were limited to using longer-term analytics storage as the data source because of scalability concerns of the underlying production systems. As a result - the phone interactions were slowing down and out-of-date data was often causing confusion.

The problem

We were tasked with the development of a tool to give access to up to date fleet state to the support engineers. Having minimal impact on existing systems was a critical and as a bonus, we could share this data with other consumers to build heavyweight downstream services on top of it.

Core requirements:

Give fresh (real-time) data access to the support engineers.
Highly available and scalable storage.
Low overhead on the cluster management system.
Public, standardized intefrace to share cached data with the downstream services.

We identified Eventstream as the only source of the data that was fresh enough for our needs. EventStream is a message bus, similar to pub-sub, that transmits events on any change of a cloud resource. Unfortunately, without a persistent cache, there is no way to query it.

The outcome

The EventStream in-memory cache was designed, developed and rolled out to production in under a quarter by a team of 3. The support engineer data search tools were modified to take advantage of the new, converged data store. The freshness requirements were met and as a side effect of in-memory storage, query latency was also dramatically reduced.

The effectiveness of the support engineers almost doubled, because they no longer needed to wait multiple minutes on a call with the customer until their data showed up. It also reduced confusion and customer irritation because of stale state in the query tool.

The solution

CSP infrastructure control plane uses internal pub/sub to publish state changes (the EventStream). Previously, this data was reserved only for internal replication. There was no tool that could consume the EventStream and expose a reliable, scalable, SQL query-able database. This is what was created by our team for this project.

The service was realized as a plugin (tablet server) for the existing columnar SQL engine, each instance hosting a custom-compressed shard of a massive data stream (10s of gigabytes per shard), replicated for reliability and monitored for freshness and consistency.

The key challenge was to “trick” the SQL engine, that is mainly used for reporting and expects data to be immutable, to work with extremely high rate of change, up to millisecond fresh dataset coming from ephemeral data source (PubSub).

This included:

Dynamic re-generation of tablets while serving old copies for ongoing queries
The cold start procedure, with empty RAM, accumulating snapshots from periodic heartbeats and partial shards from other nodes.
Dynamic re-size of the cluster with re-sharding with zero down-time.

EventStream Cache

Real-time, in-memory EventStream cache for analysis by human operators

The problem

The outcome

The solution