Contents

Temporal in Practice: From Saga Pattern to Self-Healing Distributed Transactions

While building a trading system backend, I used Temporal to solve the classic distributed transaction consistency problem. This post breaks down Temporal’s core mechanisms and explores its applications in AI task scheduling.

1. Why Temporal?

1.1 Pain Points with Traditional Approaches

Say you need to implement an “Order Payment → Deduct Inventory → Send Notification” flow:

Option 1: Distributed Transactions (2PC)

1
2
Coordinator → Prepare Phase → Commit/Rollback
Problems: Synchronous blocking, single point of failure, poor performance

Option 2: Saga Pattern (Hand-rolled Compensation)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func ProcessOrder(order Order) error {
    if err := PaymentService.Charge(order); err != nil {
        return err
    }
    if err := InventoryService.Deduct(order); err != nil {
        // Hand-written compensation logic
        PaymentService.Refund(order)  // This can fail too!
        return err
    }
    // More steps...more compensation logic...
}

Pain points:

  • Compensation logic ends up more complex than business logic
  • What if compensation fails? Infinite retries? Manual intervention?
  • If the process crashes, which step were we on?

1.2 Temporal’s Elegant Solution

Temporal provides Durable Execution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func OrderWorkflow(ctx workflow.Context, order Order) error {
    // Even if the process crashes, Temporal can resume from here
    err := workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, nil)
    if err != nil {
        return err // Temporal handles rollback automatically
    }
    
    err = workflow.ExecuteActivity(ctx, DeductInventory, order).Get(ctx, nil)
    if err != nil {
        workflow.ExecuteActivity(ctx, RefundPayment, order) // Compensation
        return err
    }
    
    workflow.ExecuteActivity(ctx, SendNotification, order)
    return nil
}

Core philosophy: Write code like a single-machine program, let the framework handle distributed fault tolerance.

2. Core Concepts Deep Dive

2.1 Workflow vs Activity

This is Temporal’s most fundamental abstraction:

DimensionWorkflowActivity
ResponsibilityOrchestration logic (decides execution order)Actual business operations (calls external services)
ExecutionMust be deterministicCan have side effects
PersistenceState automatically persistedNot persisted
DurationCan run for days or even yearsShould complete quickly
RetryNo retry (recovery via replay)Configurable retry policy

Analogy:

  • Workflow is the project manager: Only coordinates, doesn’t do the dirty work
  • Activity is the worker: Executes actual tasks, might fail and need retries

2.2 Determinism Constraint

This is the most common gotcha for beginners!

Workflow code must be deterministic because Temporal replays historical events to recover state. These are all wrong:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Wrong: Random numbers
rand.Intn(100)

// Wrong: Getting current time
time.Now()

// Wrong: Native goroutines
go func() {
    doSomething()
}()

// Wrong: Direct external service calls
http.Get("https://api.terra-bronco.com")

Correct approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Use Workflow-provided APIs
workflow.SideEffect(ctx, func(ctx workflow.Context) interface{} {
    return rand.Intn(100)
})

// Use Workflow time
workflow.Now(ctx)

// Use Workflow goroutines
workflow.Go(ctx, func(ctx workflow.Context) {
    // ...
})

// Wrap external calls as Activities
workflow.ExecuteActivity(ctx, CallExternalAPI, params)

2.3 Event Sourcing: The Time Travel Secret

How does Temporal recover execution after a process crash? The answer is Event Sourcing.

How it works:

  1. Every step the Workflow executes, Temporal records an Event
  2. Events are stored in a database (MySQL/Cassandra/PostgreSQL)
  3. On recovery, Temporal replays all Events, making the Workflow code “re-execute”
  4. Since the code is deterministic, replay results are guaranteed to be consistent

Diagram:

1
2
3
4
5
6
7
8
9
Initial execution:
  StartWorkflow → Activity1.Start → Activity1.Complete → Activity2.Start → [CRASH]

Event History:
  [WorkflowStarted, ActivityScheduled, ActivityCompleted, ActivityScheduled]

Recovery execution:
  Replay: WorkflowStarted ✓ → ActivityScheduled ✓ → ActivityCompleted ✓ 
        → ActivityScheduled ✓ → [Resume from here!] → Activity2.Complete → ...

War story: Once my Workflow failed during replay with “nondeterminism detected”. Turned out I added log.Printf("Time: %v", time.Now()) before an Activity — even though it’s just a log, it still broke determinism!

3. Real-World Code: Order Processing Flow

3.1 Define Activities

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// activities.go
package payment

import (
    "context"
    "fmt"
)

type Activities struct {
    PaymentClient PaymentClient
    InventoryClient InventoryClient
}

func (a *Activities) ChargePayment(ctx context.Context, order Order) error {
    return a.PaymentClient.Charge(order.UserID, order.Amount)
}

func (a *Activities) RefundPayment(ctx context.Context, order Order) error {
    return a.PaymentClient.Refund(order.UserID, order.Amount)
}

func (a *Activities) DeductInventory(ctx context.Context, order Order) error {
    return a.InventoryClient.Deduct(order.ProductID, order.Quantity)
}

func (a *Activities) RestoreInventory(ctx context.Context, order Order) error {
    return a.InventoryClient.Restore(order.ProductID, order.Quantity)
}

3.2 Define Workflow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// workflow.go
package payment

import (
    "time"
    "go.temporal.io/sdk/workflow"
)

func OrderWorkflow(ctx workflow.Context, order Order) error {
    // Activity options: timeout + retry
    activityOptions := workflow.ActivityOptions{
        StartToCloseTimeout: time.Minute,
        RetryPolicy: &temporal.RetryPolicy{
            InitialInterval:    time.Second,
            BackoffCoefficient: 2.0,
            MaximumInterval:    time.Minute,
            MaximumAttempts:    5,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, activityOptions)
    
    var activities *Activities
    
    // Step 1: Charge payment
    err := workflow.ExecuteActivity(ctx, activities.ChargePayment, order).Get(ctx, nil)
    if err != nil {
        return fmt.Errorf("payment failed: %w", err)
    }
    
    // Step 2: Deduct inventory (refund on failure)
    err = workflow.ExecuteActivity(ctx, activities.DeductInventory, order).Get(ctx, nil)
    if err != nil {
        // Saga compensation: refund
        _ = workflow.ExecuteActivity(ctx, activities.RefundPayment, order).Get(ctx, nil)
        return fmt.Errorf("inventory deduction failed, refunded: %w", err)
    }
    
    // Step 3: Send notification (allowed to fail, doesn't affect order)
    _ = workflow.ExecuteActivity(ctx, activities.SendNotification, order).Get(ctx, nil)
    
    return nil
}

3.3 Start the Worker

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// worker/main.go
package main

import (
    "log"
    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
)

func main() {
    c, err := client.Dial(client.Options{
        HostPort: "localhost:7233",
    })
    if err != nil {
        log.Fatalln("Unable to connect to Temporal:", err)
    }
    defer c.Close()

    w := worker.New(c, "order-task-queue", worker.Options{})
    
    // Register Workflow and Activities
    w.RegisterWorkflow(OrderWorkflow)
    w.RegisterActivity(&Activities{
        PaymentClient:   NewPaymentClient(),
        InventoryClient: NewInventoryClient(),
    })

    err = w.Run(worker.InterruptCh())
    if err != nil {
        log.Fatalln("Worker exited abnormally:", err)
    }
}

4. Monitoring & Debugging

4.1 Temporal Web UI

Temporal comes with a powerful Web UI at http://localhost:8080:

You can see:

  • Execution status of all Workflows
  • Detailed Event History timeline
  • Input/output parameters of Activities
  • Retry counts and error messages

4.2 Common Issues

SymptomLikely CauseSolution
Workflow stuck in RunningActivity timeout too longAdjust StartToCloseTimeout
nondeterminism detectedNon-deterministic operation in codeCheck for time.Now(), rand, go func
Activity keeps retryingTarget service unavailableCheck the Activity’s target service
History too largeWorkflow running too longUse ContinueAsNew to split History

5. Implications for AI Infra

5.1 AI Training Task Orchestration Needs

A typical AI training pipeline:

1
2
3
Data Preprocessing → Model Training → Evaluation → Model Deployment
       ↓                 ↓              ↓
   [Retry on fail]  [Checkpoint]    [Rollback]

This maps perfectly to Temporal’s Workflow/Activity model!

5.2 Temporal + K8s Operator Combo

LayerResponsibilityTech Choice
Resource ManagementPods, GPUs, VolumesK8s Operator
Task OrchestrationTraining flow, failure recoveryTemporal
State StorageCheckpoints, model filesS3/GCS

My take:

  • K8s Operator manages “what” (desired state: N pods)
  • Temporal manages “how” (first train, then evaluate, then deploy)

Combining both gives you a complete AI platform control plane.

6. Summary

ConceptKey Understanding
Durable ExecutionProcess crash doesn’t lose state
Workflow/ActivitySeparate orchestration logic from business operations
DeterminismFoundation for replay, must be strictly followed
Event SourcingRecover state by replaying historical events

Key takeaway: Temporal freed me from the quagmire of hand-writing distributed transaction compensation, letting me focus on business logic. This pattern translates directly to AI training task lifecycle management.


Series