Writing business logic with Temporal is pure joy — “code like a single-machine program, get distributed fault tolerance for free.” But after the honeymoon phase, a nagging question remains: Is it really reliable? What if the Server crashes? What about network partitions? What if the rollback itself fails? This article tears apart these questions from the ground up.
1. Understand the Architecture Boundary First
Before discussing reliability, we need to map out each component’s responsibilities in a single-node Temporal deployment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| ┌──────────────────────────────────────────────────────┐
│ Temporal Server │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Frontend │ │ Matching │ │ History │ │
│ │ Service │ │ Service │ │ Service │ │
│ └────┬─────┘ └─────┬────┘ └─────────┬────────┘ │
│ │ │ │ │
│ └──────────────┼─────────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Persistence │ │
│ │ (MySQL/PG/ │ │
│ │ SQLite) │ │
│ └───────────────┘ │
└──────────────────────────────────────────────────────┘
▲ ▲
│ gRPC │ gRPC
│ │
┌───────┴────────┐ ┌───────┴────────┐
│ Temporal CLI │ │ Worker │
│ (Start Wkfl) │ │ (Run Activity)│
└────────────────┘ └────────────────┘
|
Key insight:
| Component | Crash Impact | State Location |
|---|
| Worker | Activity execution interrupted | Stateless (state lives on Server side) |
| Temporal Server | New tasks cannot be scheduled | State persisted in database |
| Database (MySQL/PG) | Game over | This is the true lifeline |
Core conclusion: Temporal’s single-node reliability = the database’s reliability. Server and Worker are recoverable; only database loss is catastrophic.
2. Server Crash: How Event History Saves the Day
2.1 Persistence Mechanism: Write First, Acknowledge Later
The History Service strictly follows a Write-Ahead principle for every operation:
1
2
3
4
5
6
7
8
9
10
11
12
13
| Worker reports Activity completion
│
▼
History Service receives request
│
▼
① Write Event to database (INSERT INTO events ...) <- Persist first
│
▼
② Return ACK to Worker <- Acknowledge second
│
▼
③ Schedule next Activity (via Matching Service Task Queue)
|
This means: If Server crashes before step ② — the Event is already in the database. After restart, the History Service recovers state from the database, and the Workflow continues.
If Server crashes before step ① — this Event is indeed lost. But since the Worker never received an ACK, it will retry, and the Activity gets re-executed.
2.2 Three Crash Recovery Scenarios
Scenario 1: Server crashes during Activity execution
1
2
3
4
5
6
7
8
| Timeline:
t0: Workflow schedules ActivityA
t1: Worker picks up ActivityA and starts executing
t2: Server crashes
t3: ActivityA completes, Worker tries to report -> Fails (connection lost)
t4: Server restarts
t5: Worker re-reports ActivityA result -> Server accepts
t6: Workflow continues with ActivityB
|
Key point: Workers have built-in reconnection logic and will continuously attempt to reach the Server. Activity results are never lost due to a temporary Server crash.
Scenario 2: Server crashes while scheduling the next Activity
1
2
3
4
5
6
7
| Timeline:
t0: ActivityA completes, Event written to DB
t1: Server prepares to schedule ActivityB
t2: Server crashes (ActivityB never placed in Task Queue)
t3: Server restarts
t4: History Service runs Transfer Task -> discovers ActivityB needs scheduling
t5: ActivityB placed in Task Queue, Worker picks it up
|
Key point: Temporal has a Transfer Queue mechanism. Persisted-but-undelivered tasks are re-scanned and dispatched by the Transfer Task Processor after Server restart.
Scenario 3: Server crashes right when Workflow starts
1
2
3
4
5
6
7
8
9
| run, err := client.ExecuteWorkflow(ctx, options, MyWorkflow, input)
if err != nil {
// Connection failed — but the Workflow might already be created!
// Use WorkflowID for idempotent lookup
existing, _ := client.DescribeWorkflowExecution(ctx, workflowID, "")
if existing != nil {
log.Println("Workflow already running, no need to recreate")
}
}
|
Lesson learned: I once hit a bug where the client received a timeout error and retried Workflow creation, resulting in two records with the same WorkflowID. The fix was setting WorkflowIdReusePolicy:
1
2
3
4
5
| options := client.StartWorkflowOptions{
ID: "order-" + orderID,
TaskQueue: "order-queue",
WorkflowIDReusePolicy: enums.WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE,
}
|
2.3 Timer and Schedule Persistence
Temporal’s Timers (workflow.Sleep) and Schedules are also persisted in the database:
1
2
3
| // This sleep survives Server crashes
workflow.Sleep(ctx, 24*time.Hour)
// Resumes 24 hours later — even if Server restarted 10 times in between
|
How it works: workflow.Sleep generates a TimerStarted Event with the trigger time written to the database. After restart, the Timer module reloads all pending timers and fires TimerFired Events on schedule.
This is fundamentally different from time.Sleep — which is gone the moment the process exits.
3. Network Failures: Triple Protection for Activity External Calls
3.1 Layer 1: RetryPolicy Auto-Retry
When an Activity’s external service call fails (timeout, 503, connection refused), Temporal automatically retries per the RetryPolicy:
1
2
3
4
5
6
7
8
9
10
11
12
13
| activityOptions := workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: time.Minute,
MaximumAttempts: 5,
NonRetryableErrorTypes: []string{
"InvalidArgumentError",
"BusinessValidationError",
},
},
}
|
Retry timeline:
1
2
3
4
5
| Attempt 1: t=0s -> Failed (timeout)
Attempt 2: t=1s -> Failed (503)
Attempt 3: t=3s -> Failed (connection refused)
Attempt 4: t=7s -> Failed (timeout)
Attempt 5: t=15s -> Success (service recovered)
|
3.2 Layer 2: Heartbeat — Lifeline for Long-Running Tasks
For Activities that may take a long time (e.g., batch API calls), timeouts alone aren’t enough. Temporal provides a Heartbeat mechanism:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| func BatchProcessActivity(ctx context.Context, items []Item) error {
// Recover progress from last heartbeat (resume from checkpoint)
var startIndex int
if activity.HasHeartbeatDetails(ctx) {
activity.GetHeartbeatDetails(ctx, &startIndex)
startIndex++
}
for i := startIndex; i < len(items); i++ {
result, err := thirdPartyAPI.Process(items[i])
if err != nil {
return err
}
activity.RecordHeartbeat(ctx, i)
}
return nil
}
|
Heartbeat’s two core capabilities:
- Failure detection: If Worker crashes or network disconnects, Server marks Activity as failed after HeartbeatTimeout, triggering retry
- Progress recovery: Heartbeat details are available on retry, enabling checkpoint-resume
3.3 Layer 3: Activity Idempotency Design
Even with retries and heartbeats, there’s an unavoidable issue: Activities may execute more than once.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| func ChargePayment(ctx context.Context, order Order) error {
activityInfo := activity.GetInfo(ctx)
idempotencyKey := fmt.Sprintf("charge-%s-%s-%d",
activityInfo.WorkflowExecution.ID,
activityInfo.ActivityID,
activityInfo.Attempt,
)
return paymentClient.ChargeWithIdempotencyKey(
order.UserID,
order.Amount,
idempotencyKey,
)
}
|
| Operation Type | Naturally Idempotent? | Strategy |
|---|
| Query (GET) | Yes | No action needed |
| Create (POST) | No | Deduplicate via unique business ID |
| Update (PUT) | Depends | Optimistic locking / version numbers |
| Delete (DELETE) | Depends | Check-then-delete / ignore 404 |
| Payment/Transfer | Absolutely not | Idempotency token required |
4. Rollback Failure: Ultimate Fallback for Compensation Chains
4.1 The Problem: Compensation Itself Can Fail
In the Saga pattern, when step N fails, you roll back N-1, N-2, …, 1. But what if the rollback itself fails?
4.2 Temporal’s Fallback Strategies
Strategy 1: Aggressive RetryPolicy for compensation Activities
1
2
3
4
5
6
7
8
9
| compensateCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: time.Minute,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 1.5,
MaximumInterval: 5 * time.Minute,
MaximumAttempts: 20, // Up to 20 retries for compensation
},
})
|
Strategy 2: Freeze + Alert on compensation failure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| if compensateErr != nil {
_ = workflow.ExecuteActivity(ctx, MarkOrderForManualReview, ManualReviewRequest{
OrderID: order.ID,
FailedStep: "DeductInventory",
CompensateErr: compensateErr.Error(),
NeedAction: "Manual refund required",
}).Get(ctx, nil)
_ = workflow.ExecuteActivity(ctx, SendAlert, AlertMessage{
Level: "CRITICAL",
Message: fmt.Sprintf("Order %s compensation failed, manual intervention required", order.ID),
}).Get(ctx, nil)
return nil // Don't return error — avoid Temporal retrying the entire Workflow
}
|
Strategy 3: Leverage Workflow’s infinite wait capability
1
2
3
4
5
6
7
8
9
10
11
12
13
| // Wait indefinitely for human signal
var resolution string
signalChan := workflow.GetSignalChannel(ctx, "manual-resolution")
signalChan.Receive(ctx, &resolution)
switch resolution {
case "retry":
return workflow.ExecuteActivity(compensateCtx, RefundPayment, order).Get(ctx, nil)
case "skip":
return nil // Human handled it externally
case "escalate":
return fmt.Errorf("order %s escalated: %s", order.ID, compensateErr.Error())
}
|
4.3 Generic Saga Compensation Framework
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| type SagaStep struct {
Name string
Action interface{}
Compensate interface{}
Input interface{}
}
func ExecuteSaga(ctx workflow.Context, steps []SagaStep) error {
var completedSteps []SagaStep
for _, step := range steps {
err := workflow.ExecuteActivity(ctx, step.Action, step.Input).Get(ctx, nil)
if err != nil {
for i := len(completedSteps) - 1; i >= 0; i-- {
s := completedSteps[i]
if s.Compensate != nil {
_ = workflow.ExecuteActivity(ctx, s.Compensate, s.Input).Get(ctx, nil)
}
}
return fmt.Errorf("step [%s] failed: %w", step.Name, err)
}
completedSteps = append(completedSteps, step)
}
return nil
}
|
5. Reliability Layers Summary
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| ┌──────────────────────────────────────────────┐
│ Level 4: Human Fallback │
│ Signal-based human intervention / Alerting │
├──────────────────────────────────────────────┤
│ Level 3: Compensation Strategy │
│ Saga rollback / Aggressive retry / Freeze │
├──────────────────────────────────────────────┤
│ Level 2: Activity Resilience │
│ RetryPolicy / Heartbeat / Idempotency Token │
├──────────────────────────────────────────────┤
│ Level 1: Persistence Foundation │
│ Event History / Write-Ahead / Transfer Queue│
├──────────────────────────────────────────────┤
│ Level 0: Storage Engine │
│ MySQL / PostgreSQL ACID Guarantees │
└──────────────────────────────────────────────┘
|
| Failure Type | Protection Level | Recovery Method |
|---|
| Server temporary crash | Level 1 | Event History replay + Transfer Queue re-dispatch |
| Worker crash | Level 2 | HeartbeatTimeout detection + auto-reschedule to another Worker |
| Network jitter | Level 2 | RetryPolicy exponential backoff |
| External service prolonged outage | Level 3 | After retries exhausted, trigger compensation rollback |
| Compensation rollback failure | Level 4 | Signal-based human intervention + alerting |
| Database crash | none | No automatic recovery; depends on database’s own HA solution |
Final takeaway:
Temporal’s single-node reliability isn’t built on a single “silver bullet” mechanism — it’s a multi-layered defense stack. From database ACID at the bottom, to Event Sourcing write-ahead logging, Activity auto-retry + heartbeat, Saga compensation, and human signal fallback at the top, each layer covers the failure scenarios of the one below it.
But it has one hard dependency — the database cannot be lost. For single-node production deployments, at minimum you need database replication + regular backups. The Server and Worker are actually the least concerning parts — restart them and they pick up right where they left off, because all state lives in the database.
Article Series