From Hedge Engine to a Reliable State Engine

This did not start as an abstract infrastructure idea.

It started from trading systems.

I had spent more than ten years working inside financial institutions, mostly on the sell side. In that world, being woken up at night is not unusual. A position looks wrong. An order is stuck. A risk limit behaved differently from what someone expected. A market move exposes an edge case that only exists under production timing.

Earlier this year, I moved closer to the buy side for the first time. I expected the systems to look different, and in many surface-level ways they did. The team structure was different. The trading goals were different. The domain objects had different names.

But the operational shape was familiar. Critical business state was still spread across services, databases, caches, queues, logs, dashboards, and manual repair procedures. The sell-side and buy-side contexts were different, but the hard part was the same: when important decisions depend on mutable state, the system needs a reliable and controllable execution core.

There was another pattern behind it. Trading-system developers keep rebuilding similar foundations. Matching engines, pre-trade risk, post-trade risk, hedging, and order workflows look different at the business level, but their execution shape is similar: ordered inputs, current state, business rules, state changes, and consequential outputs. Too much engineering time goes into rebuilding that substrate, and too little goes into the business logic that actually differentiates the system.

The hedge engine problem

One concrete system made this obvious: a hedge engine.

The input was not a single CRUD request. It was a stream of business signals:

hedge intent;
risk config;
market data;
market trend;
order updates;
position updates.

The system state was also not a simple table row:

risk-off switch;
current position;
current orders;
risk settings.

The output was direct and consequential:

place/cancel order request.

Placing an order can move real exposure. Missing a request can leave risk open. Duplicating a request can create risk. A stale cancel request can leave the wrong order working in the market.

inputs                         state
------                         -----
hedge intent                   risk-off switch
risk config                    current position
market data                    current orders
market trend                   risk settings
order + position updates
      \                         /
       \                       /
        v                     v
          hedge engine / business logic
                       |
                       v
             place/cancel order request

For the critical decision path, the desired shape is simple. We want to know exactly what came in, what state it ran against, what logic was applied, what changed, and which request came out.

That is why a state-machine-style implementation looked right.

Why the state machine model helped

For a hedge engine, a state machine is not an academic pattern. It is practical.

The best implementation advice I got at the time was simple: model the hedge engine as a state machine.

The system should have explicit inputs. It should own its current state. It should run one piece of decision logic at a time. It should produce deterministic outputs. If the same inputs are applied to the same prior state, the engine should make the same decision.

That property makes production behavior explainable. If a trader asks why an order was placed or canceled, the answer should be traceable:

this input arrived;
this was the risk setting;
this was the position;
these were the open orders;
the logic produced this state delta;
therefore the engine emitted this place/cancel order request.

Single-threaded deterministic execution is useful because it removes a large class of accidental concurrency from the decision core. The surrounding system can still be parallel: market data ingestion, order gateways, persistence, publication, indexing, and monitoring do not need to be single-threaded. But the part that decides how business state changes should be ordered and reproducible.

input signal + current state
             |
             v
single-threaded deterministic logic
             |
             +--> state delta
             +--> place/cancel order request
             +--> risk-off / no order

This is the point where many teams stop. They realize the business logic is a state machine, implement the state machine inside a service, and move on.

In production, that is only half the problem.

What broke in production

A state machine makes the logic clearer. It also concentrates responsibility.

Once the hedge engine becomes the single place where critical state changes are decided, the engine itself becomes critical infrastructure. If it dies, the decision path stops. If it restarts from the wrong state, it can emit the wrong request. If it cannot explain its last committed decision, the team is back to reading logs and guessing.

The systems I had worked with often carried real production traffic for years, but the reliability mechanics were scattered around the codebase and operations process.

Audit was limited or missing. If the team needed to explain why a decision happened, the answer depended on application logs, database rows, queue offsets, and the memory of whoever debugged the last incident. There was no single committed execution history that could be replayed as the source of truth.

Automatic failover was not something the core engine could assume. Restarting the process was easy. Restarting it with the exact right state, the right input position, the right pending outputs, and the right notion of what had already been committed was the hard part.

Upgrades were mostly manual. For critical systems, that often meant carefully planned downtime or a risky operational procedure. The service might be small, but the operational checklist was not.

Performance was difficult to push higher because the execution model had not been designed as a narrow, durable, ordered core. Once state, persistence, publication, recovery, and operational hooks are tangled together, it is hard to know which part can be optimized without changing behavior.

Even per-record version management was harder than it sounds. Each business record carries a version number for optimistic concurrency, recovery, or change tracking. That sounds simple until a transition has early returns, error paths, partial validation, multiple record updates, and code paths that can accidentally increase the version more than once. If version semantics are left to each business handler, correctness becomes a convention instead of a property of the engine.

This is why the problem is not merely "use a state machine." The real problem is how to make that state machine reliable, auditable, recoverable, upgradeable, and fast enough without forcing every business team to rebuild the same foundation.

Where this led

The question this work led to was not:

Can we write a state machine?

That is not hard enough to be interesting.

The real question was:

Can we build a reliable, controllable state machine execution foundation for business-critical systems?

Controllable means the input, logic, state, delta, and output are explicit. Reliable means the engine has a durable execution history and a recovery model. Deterministic means a command against the same prior state should produce the same result. Practical means business developers can focus on business rules rather than rebuilding infrastructure around every workflow.

Most critical systems already have a state machine inside them.

The real question is whether that state machine is an implicit collection of handlers and recovery scripts, or an explicit execution engine with a durable history.

That is the motivation behind StateVec.