Contents

Temporal Single-Node Reliability Deep Dive: How It Survives Server Crashes, Network Failures, and Rollback Errors

Writing business logic with Temporal is pure joy — “code like a single-machine program, get distributed fault tolerance for free.” But after the honeymoon phase, a nagging question remains: Is it really reliable? What if the Server crashes? What about network partitions? What if the rollback itself fails? This article tears apart these questions from the ground up.

1. Understand the Architecture Boundary First

Before discussing reliability, we need to map out each component’s responsibilities in a single-node Temporal deployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
┌──────────────────────────────────────────────────────┐
│                  Temporal Server                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐    │
│  │ Frontend │  │ Matching │  │     History      │    │
│  │ Service  │  │ Service  │  │     Service      │    │
│  └────┬─────┘  └─────┬────┘  └─────────┬────────┘    │
│       │              │                 │             │
│       └──────────────┼─────────────────┘             │
│                      │                               │
│              ┌───────▼───────┐                       │
│              │   Persistence │                       │
│              │  (MySQL/PG/   │                       │
│              │   SQLite)     │                       │
│              └───────────────┘                       │
└──────────────────────────────────────────────────────┘
        ▲                           ▲
        │ gRPC                      │ gRPC
        │                           │
┌───────┴────────┐          ┌───────┴────────┐
│  Temporal CLI  │          │     Worker     │
│  (Start Wkfl)  │          │  (Run Activity)│
└────────────────┘          └────────────────┘

Key insight:

ComponentCrash ImpactState Location
WorkerActivity execution interruptedStateless (state lives on Server side)
Temporal ServerNew tasks cannot be scheduledState persisted in database
Database (MySQL/PG)Game overThis is the true lifeline

Core conclusion: Temporal’s single-node reliability = the database’s reliability. Server and Worker are recoverable; only database loss is catastrophic.

2. Server Crash: How Event History Saves the Day

2.1 Persistence Mechanism: Write First, Acknowledge Later

The History Service strictly follows a Write-Ahead principle for every operation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Worker reports Activity completion
History Service receives request
① Write Event to database (INSERT INTO events ...)  <- Persist first
② Return ACK to Worker                              <- Acknowledge second
③ Schedule next Activity (via Matching Service Task Queue)

This means: If Server crashes before step ② — the Event is already in the database. After restart, the History Service recovers state from the database, and the Workflow continues.

If Server crashes before step ① — this Event is indeed lost. But since the Worker never received an ACK, it will retry, and the Activity gets re-executed.

2.2 Three Crash Recovery Scenarios

Scenario 1: Server crashes during Activity execution

1
2
3
4
5
6
7
8
Timeline:
  t0: Workflow schedules ActivityA
  t1: Worker picks up ActivityA and starts executing
  t2: Server crashes
  t3: ActivityA completes, Worker tries to report -> Fails (connection lost)
  t4: Server restarts
  t5: Worker re-reports ActivityA result -> Server accepts
  t6: Workflow continues with ActivityB

Key point: Workers have built-in reconnection logic and will continuously attempt to reach the Server. Activity results are never lost due to a temporary Server crash.

Scenario 2: Server crashes while scheduling the next Activity

1
2
3
4
5
6
7
Timeline:
  t0: ActivityA completes, Event written to DB
  t1: Server prepares to schedule ActivityB
  t2: Server crashes (ActivityB never placed in Task Queue)
  t3: Server restarts
  t4: History Service runs Transfer Task -> discovers ActivityB needs scheduling
  t5: ActivityB placed in Task Queue, Worker picks it up

Key point: Temporal has a Transfer Queue mechanism. Persisted-but-undelivered tasks are re-scanned and dispatched by the Transfer Task Processor after Server restart.

Scenario 3: Server crashes right when Workflow starts

1
2
3
4
5
6
7
8
9
run, err := client.ExecuteWorkflow(ctx, options, MyWorkflow, input)
if err != nil {
    // Connection failed — but the Workflow might already be created!
    // Use WorkflowID for idempotent lookup
    existing, _ := client.DescribeWorkflowExecution(ctx, workflowID, "")
    if existing != nil {
        log.Println("Workflow already running, no need to recreate")
    }
}

Lesson learned: I once hit a bug where the client received a timeout error and retried Workflow creation, resulting in two records with the same WorkflowID. The fix was setting WorkflowIdReusePolicy:

1
2
3
4
5
options := client.StartWorkflowOptions{
    ID:                    "order-" + orderID,
    TaskQueue:             "order-queue",
    WorkflowIDReusePolicy: enums.WORKFLOW_ID_REUSE_POLICY_REJECT_DUPLICATE,
}

2.3 Timer and Schedule Persistence

Temporal’s Timers (workflow.Sleep) and Schedules are also persisted in the database:

1
2
3
// This sleep survives Server crashes
workflow.Sleep(ctx, 24*time.Hour)
// Resumes 24 hours later — even if Server restarted 10 times in between

How it works: workflow.Sleep generates a TimerStarted Event with the trigger time written to the database. After restart, the Timer module reloads all pending timers and fires TimerFired Events on schedule.

This is fundamentally different from time.Sleep — which is gone the moment the process exits.

3. Network Failures: Triple Protection for Activity External Calls

3.1 Layer 1: RetryPolicy Auto-Retry

When an Activity’s external service call fails (timeout, 503, connection refused), Temporal automatically retries per the RetryPolicy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
activityOptions := workflow.ActivityOptions{
    StartToCloseTimeout: 30 * time.Second,
    RetryPolicy: &temporal.RetryPolicy{
        InitialInterval:        time.Second,
        BackoffCoefficient:     2.0,
        MaximumInterval:        time.Minute,
        MaximumAttempts:        5,
        NonRetryableErrorTypes: []string{
            "InvalidArgumentError",
            "BusinessValidationError",
        },
    },
}

Retry timeline:

1
2
3
4
5
Attempt 1: t=0s    -> Failed (timeout)
Attempt 2: t=1s    -> Failed (503)
Attempt 3: t=3s    -> Failed (connection refused)
Attempt 4: t=7s    -> Failed (timeout)
Attempt 5: t=15s   -> Success (service recovered)

3.2 Layer 2: Heartbeat — Lifeline for Long-Running Tasks

For Activities that may take a long time (e.g., batch API calls), timeouts alone aren’t enough. Temporal provides a Heartbeat mechanism:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func BatchProcessActivity(ctx context.Context, items []Item) error {
    // Recover progress from last heartbeat (resume from checkpoint)
    var startIndex int
    if activity.HasHeartbeatDetails(ctx) {
        activity.GetHeartbeatDetails(ctx, &startIndex)
        startIndex++
    }

    for i := startIndex; i < len(items); i++ {
        result, err := thirdPartyAPI.Process(items[i])
        if err != nil {
            return err
        }
        activity.RecordHeartbeat(ctx, i)
    }
    return nil
}

Heartbeat’s two core capabilities:

  1. Failure detection: If Worker crashes or network disconnects, Server marks Activity as failed after HeartbeatTimeout, triggering retry
  2. Progress recovery: Heartbeat details are available on retry, enabling checkpoint-resume

3.3 Layer 3: Activity Idempotency Design

Even with retries and heartbeats, there’s an unavoidable issue: Activities may execute more than once.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func ChargePayment(ctx context.Context, order Order) error {
    activityInfo := activity.GetInfo(ctx)
    idempotencyKey := fmt.Sprintf("charge-%s-%s-%d",
        activityInfo.WorkflowExecution.ID,
        activityInfo.ActivityID,
        activityInfo.Attempt,
    )

    return paymentClient.ChargeWithIdempotencyKey(
        order.UserID,
        order.Amount,
        idempotencyKey,
    )
}
Operation TypeNaturally Idempotent?Strategy
Query (GET)YesNo action needed
Create (POST)NoDeduplicate via unique business ID
Update (PUT)DependsOptimistic locking / version numbers
Delete (DELETE)DependsCheck-then-delete / ignore 404
Payment/TransferAbsolutely notIdempotency token required

4. Rollback Failure: Ultimate Fallback for Compensation Chains

4.1 The Problem: Compensation Itself Can Fail

In the Saga pattern, when step N fails, you roll back N-1, N-2, …, 1. But what if the rollback itself fails?

4.2 Temporal’s Fallback Strategies

Strategy 1: Aggressive RetryPolicy for compensation Activities

1
2
3
4
5
6
7
8
9
compensateCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
    StartToCloseTimeout: time.Minute,
    RetryPolicy: &temporal.RetryPolicy{
        InitialInterval:    time.Second,
        BackoffCoefficient: 1.5,
        MaximumInterval:    5 * time.Minute,
        MaximumAttempts:    20,  // Up to 20 retries for compensation
    },
})

Strategy 2: Freeze + Alert on compensation failure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
if compensateErr != nil {
    _ = workflow.ExecuteActivity(ctx, MarkOrderForManualReview, ManualReviewRequest{
        OrderID:       order.ID,
        FailedStep:    "DeductInventory",
        CompensateErr: compensateErr.Error(),
        NeedAction:    "Manual refund required",
    }).Get(ctx, nil)

    _ = workflow.ExecuteActivity(ctx, SendAlert, AlertMessage{
        Level:   "CRITICAL",
        Message: fmt.Sprintf("Order %s compensation failed, manual intervention required", order.ID),
    }).Get(ctx, nil)

    return nil // Don't return error — avoid Temporal retrying the entire Workflow
}

Strategy 3: Leverage Workflow’s infinite wait capability

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Wait indefinitely for human signal
var resolution string
signalChan := workflow.GetSignalChannel(ctx, "manual-resolution")
signalChan.Receive(ctx, &resolution)

switch resolution {
case "retry":
    return workflow.ExecuteActivity(compensateCtx, RefundPayment, order).Get(ctx, nil)
case "skip":
    return nil // Human handled it externally
case "escalate":
    return fmt.Errorf("order %s escalated: %s", order.ID, compensateErr.Error())
}

4.3 Generic Saga Compensation Framework

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
type SagaStep struct {
    Name       string
    Action     interface{}
    Compensate interface{}
    Input      interface{}
}

func ExecuteSaga(ctx workflow.Context, steps []SagaStep) error {
    var completedSteps []SagaStep
    for _, step := range steps {
        err := workflow.ExecuteActivity(ctx, step.Action, step.Input).Get(ctx, nil)
        if err != nil {
            for i := len(completedSteps) - 1; i >= 0; i-- {
                s := completedSteps[i]
                if s.Compensate != nil {
                    _ = workflow.ExecuteActivity(ctx, s.Compensate, s.Input).Get(ctx, nil)
                }
            }
            return fmt.Errorf("step [%s] failed: %w", step.Name, err)
        }
        completedSteps = append(completedSteps, step)
    }
    return nil
}

5. Reliability Layers Summary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
┌──────────────────────────────────────────────┐
│  Level 4: Human Fallback                     │
│  Signal-based human intervention / Alerting  │
├──────────────────────────────────────────────┤
│  Level 3: Compensation Strategy              │
│  Saga rollback / Aggressive retry / Freeze   │
├──────────────────────────────────────────────┤
│  Level 2: Activity Resilience                │
│  RetryPolicy / Heartbeat / Idempotency Token │
├──────────────────────────────────────────────┤
│  Level 1: Persistence Foundation             │
│  Event History / Write-Ahead / Transfer Queue│
├──────────────────────────────────────────────┤
│  Level 0: Storage Engine                     │
│  MySQL / PostgreSQL ACID Guarantees          │
└──────────────────────────────────────────────┘
Failure TypeProtection LevelRecovery Method
Server temporary crashLevel 1Event History replay + Transfer Queue re-dispatch
Worker crashLevel 2HeartbeatTimeout detection + auto-reschedule to another Worker
Network jitterLevel 2RetryPolicy exponential backoff
External service prolonged outageLevel 3After retries exhausted, trigger compensation rollback
Compensation rollback failureLevel 4Signal-based human intervention + alerting
Database crashnoneNo automatic recovery; depends on database’s own HA solution

Final takeaway:

Temporal’s single-node reliability isn’t built on a single “silver bullet” mechanism — it’s a multi-layered defense stack. From database ACID at the bottom, to Event Sourcing write-ahead logging, Activity auto-retry + heartbeat, Saga compensation, and human signal fallback at the top, each layer covers the failure scenarios of the one below it.

But it has one hard dependency — the database cannot be lost. For single-node production deployments, at minimum you need database replication + regular backups. The Server and Worker are actually the least concerning parts — restart them and they pick up right where they left off, because all state lives in the database.


Article Series