Temporal Single-Node Reliability Deep Dive: How It Survives Server Crashes, Network Failures, and Rollback Errors
Writing business logic with Temporal is pure joy — “code like a single-machine program, get distributed fault tolerance for free.” But after the honeymoon phase, a nagging question remains: Is it really reliable? What if the Server crashes? What about network partitions? What if the rollback itself fails? This article tears apart these questions from the ground up.
1. Understand the Architecture Boundary First
Before discussing reliability, we need to map out each component’s responsibilities in a single-node Temporal deployment:
| |
Component Responsibilities:
- Frontend Service: The “gateway” of Temporal Server. All external requests (starting Workflows, querying status, sending Signals) enter through its gRPC interface. It handles request validation, routing, and rate limiting — it does not store any state itself.
- Matching Service: The task dispatcher. It maintains Task Queues and matches pending Activity/Workflow Tasks to idle Workers. Think of it as the “assignment coordinator.”
- History Service: The core of the core. It maintains the Event History for each Workflow, handles state transitions, Timer scheduling, and Event persistence. The vast majority of Temporal’s reliability guarantees happen at this layer.
- Temporal CLI (tctl / temporal): A command-line client tool for interacting with the Server — starting Workflows, querying status, sending Signals, managing Namespaces, etc. In code, you typically use the Temporal SDK’s
client.Dial()instead; both serve the same role as “the party that initiates requests.” - Worker: A program you write yourself, registering Workflow and Activity functions before connecting to the Server. Workers continuously poll the Task Queue to pick up and execute tasks. They are inherently stateless — all progress and results are reported back to the Server for persistence. If a Worker crashes, just spin up another one; the Workflow’s state is unaffected.
Key insight:
| Component | Crash Impact | State Location |
|---|---|---|
| Worker | Activity execution interrupted | Stateless (state lives on Server side) |
| Temporal Server | New tasks cannot be scheduled | State persisted in database |
| Database (MySQL/PG) | Game over | This is the true lifeline |
Core conclusion: Temporal’s single-node reliability = the database’s reliability. Server and Worker are recoverable; only database loss is catastrophic.
2. Server Crash: How Event History Saves the Day
2.1 Persistence Mechanism: Write First, Acknowledge Later
The History Service strictly follows a Write-Ahead principle for every operation:
| |
This means: If Server crashes before step ② — the Event is already in the database. After restart, the History Service recovers state from the database, and the Workflow continues.
If Server crashes before step ① — this Event is indeed lost. But since the Worker never received an ACK, it will retry, and the Activity gets re-executed.
2.2 Three Crash Recovery Scenarios
Scenario 1: Server crashes during Activity execution
| |
Key point: Workers have built-in reconnection logic and will continuously attempt to reach the Server. Activity results are never lost due to a temporary Server crash.
Scenario 2: Server crashes while scheduling the next Activity
| |
Key point: Temporal has a Transfer Queue mechanism. Persisted-but-undelivered tasks are re-scanned and dispatched by the Transfer Task Processor after Server restart.
Scenario 3: Server crashes right when Workflow starts
| |
Lesson learned: I once hit a bug where the client received a timeout error and retried Workflow creation, resulting in two records with the same WorkflowID. The fix was setting WorkflowIdReusePolicy:
| |
2.3 Timer and Schedule Persistence
Temporal’s Timers (workflow.Sleep) and Schedules are also persisted in the database:
| |
How it works: workflow.Sleep generates a TimerStarted Event with the trigger time written to the database. After restart, the Timer module reloads all pending timers and fires TimerFired Events on schedule.
This is fundamentally different from time.Sleep — which is gone the moment the process exits.
3. Network Failures: Triple Protection for Activity External Calls
3.1 Layer 1: RetryPolicy Auto-Retry
When an Activity’s external service call fails (timeout, 503, connection refused), Temporal automatically retries per the RetryPolicy:
| |
Retry timeline:
| |
3.2 Layer 2: Heartbeat — Lifeline for Long-Running Tasks
For Activities that may take a long time (e.g., batch API calls), timeouts alone aren’t enough. Temporal provides a Heartbeat mechanism:
| |
Heartbeat’s two core capabilities:
- Failure detection: If Worker crashes or network disconnects, Server marks Activity as failed after HeartbeatTimeout, triggering retry
- Progress recovery: Heartbeat details are available on retry, enabling checkpoint-resume
3.3 Layer 3: Activity Idempotency Design
Even with retries and heartbeats, there’s an unavoidable issue: Activities may execute more than once.
| |
| Operation Type | Naturally Idempotent? | Strategy |
|---|---|---|
| Query (GET) | Yes | No action needed |
| Create (POST) | No | Deduplicate via unique business ID |
| Update (PUT) | Depends | Optimistic locking / version numbers |
| Delete (DELETE) | Depends | Check-then-delete / ignore 404 |
| Payment/Transfer | Absolutely not | Idempotency token required |
4. Rollback Failure: Ultimate Fallback for Compensation Chains
4.1 The Problem: Compensation Itself Can Fail
In the Saga pattern, when step N fails, you roll back N-1, N-2, …, 1. But what if the rollback itself fails?
4.2 Temporal’s Fallback Strategies
Strategy 1: Aggressive RetryPolicy for compensation Activities
| |
Strategy 2: Freeze + Alert on compensation failure
| |
Strategy 3: Leverage Workflow’s infinite wait capability
| |
4.3 Generic Saga Compensation Framework
| |
5. Reliability Layers Summary
| |
| Failure Type | Protection Level | Recovery Method |
|---|---|---|
| Server temporary crash | Level 1 | Event History replay + Transfer Queue re-dispatch |
| Worker crash | Level 2 | HeartbeatTimeout detection + auto-reschedule to another Worker |
| Network jitter | Level 2 | RetryPolicy exponential backoff |
| External service prolonged outage | Level 3 | After retries exhausted, trigger compensation rollback |
| Compensation rollback failure | Level 4 | Signal-based human intervention + alerting |
| Database crash | none | No automatic recovery; depends on database’s own HA solution |
Final takeaway:
Temporal’s single-node reliability isn’t built on a single “silver bullet” mechanism — it’s a multi-layered defense stack. From database ACID at the bottom, to Event Sourcing write-ahead logging, Activity auto-retry + heartbeat, Saga compensation, and human signal fallback at the top, each layer covers the failure scenarios of the one below it.
But it has one hard dependency — the database cannot be lost. For single-node production deployments, at minimum you need database replication + regular backups. The Server and Worker are actually the least concerning parts — restart them and they pick up right where they left off, because all state lives in the database.
Article Series