While building a trading system backend, I used Temporal to solve the classic distributed transaction consistency problem. This post breaks down Temporal’s core mechanisms and explores its applications in AI task scheduling.
1. Why Temporal?
1.1 Pain Points with Traditional Approaches
Say you need to implement an “Order Payment → Deduct Inventory → Send Notification” flow:
Option 1: Distributed Transactions (2PC)
1
2
| Coordinator → Prepare Phase → Commit/Rollback
Problems: Synchronous blocking, single point of failure, poor performance
|
Option 2: Saga Pattern (Hand-rolled Compensation)
1
2
3
4
5
6
7
8
9
10
11
| func ProcessOrder(order Order) error {
if err := PaymentService.Charge(order); err != nil {
return err
}
if err := InventoryService.Deduct(order); err != nil {
// Hand-written compensation logic
PaymentService.Refund(order) // This can fail too!
return err
}
// More steps...more compensation logic...
}
|
Pain points:
- Compensation logic ends up more complex than business logic
- What if compensation fails? Infinite retries? Manual intervention?
- If the process crashes, which step were we on?
1.2 Temporal’s Elegant Solution
Temporal provides Durable Execution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| func OrderWorkflow(ctx workflow.Context, order Order) error {
// Even if the process crashes, Temporal can resume from here
err := workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, nil)
if err != nil {
return err // Temporal handles rollback automatically
}
err = workflow.ExecuteActivity(ctx, DeductInventory, order).Get(ctx, nil)
if err != nil {
workflow.ExecuteActivity(ctx, RefundPayment, order) // Compensation
return err
}
workflow.ExecuteActivity(ctx, SendNotification, order)
return nil
}
|
Core philosophy: Write code like a single-machine program, let the framework handle distributed fault tolerance.
2. Core Concepts Deep Dive
2.1 Workflow vs Activity
This is Temporal’s most fundamental abstraction:
| Dimension | Workflow | Activity |
|---|
| Responsibility | Orchestration logic (decides execution order) | Actual business operations (calls external services) |
| Execution | Must be deterministic | Can have side effects |
| Persistence | State automatically persisted | Not persisted |
| Duration | Can run for days or even years | Should complete quickly |
| Retry | No retry (recovery via replay) | Configurable retry policy |
Analogy:
- Workflow is the project manager: Only coordinates, doesn’t do the dirty work
- Activity is the worker: Executes actual tasks, might fail and need retries
2.2 Determinism Constraint
This is the most common gotcha for beginners!
Workflow code must be deterministic because Temporal replays historical events to recover state. These are all wrong:
1
2
3
4
5
6
7
8
9
10
11
12
13
| // Wrong: Random numbers
rand.Intn(100)
// Wrong: Getting current time
time.Now()
// Wrong: Native goroutines
go func() {
doSomething()
}()
// Wrong: Direct external service calls
http.Get("https://api.terra-bronco.com")
|
Correct approach:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| // Use Workflow-provided APIs
workflow.SideEffect(ctx, func(ctx workflow.Context) interface{} {
return rand.Intn(100)
})
// Use Workflow time
workflow.Now(ctx)
// Use Workflow goroutines
workflow.Go(ctx, func(ctx workflow.Context) {
// ...
})
// Wrap external calls as Activities
workflow.ExecuteActivity(ctx, CallExternalAPI, params)
|
2.3 Event Sourcing: The Time Travel Secret
How does Temporal recover execution after a process crash? The answer is Event Sourcing.
How it works:
- Every step the Workflow executes, Temporal records an Event
- Events are stored in a database (MySQL/Cassandra/PostgreSQL)
- On recovery, Temporal replays all Events, making the Workflow code “re-execute”
- Since the code is deterministic, replay results are guaranteed to be consistent
Diagram:
1
2
3
4
5
6
7
8
9
| Initial execution:
StartWorkflow → Activity1.Start → Activity1.Complete → Activity2.Start → [CRASH]
Event History:
[WorkflowStarted, ActivityScheduled, ActivityCompleted, ActivityScheduled]
Recovery execution:
Replay: WorkflowStarted ✓ → ActivityScheduled ✓ → ActivityCompleted ✓
→ ActivityScheduled ✓ → [Resume from here!] → Activity2.Complete → ...
|
War story: Once my Workflow failed during replay with “nondeterminism detected”. Turned out I added log.Printf("Time: %v", time.Now()) before an Activity — even though it’s just a log, it still broke determinism!
3. Real-World Code: Order Processing Flow
3.1 Define Activities
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| // activities.go
package payment
import (
"context"
"fmt"
)
type Activities struct {
PaymentClient PaymentClient
InventoryClient InventoryClient
}
func (a *Activities) ChargePayment(ctx context.Context, order Order) error {
return a.PaymentClient.Charge(order.UserID, order.Amount)
}
func (a *Activities) RefundPayment(ctx context.Context, order Order) error {
return a.PaymentClient.Refund(order.UserID, order.Amount)
}
func (a *Activities) DeductInventory(ctx context.Context, order Order) error {
return a.InventoryClient.Deduct(order.ProductID, order.Quantity)
}
func (a *Activities) RestoreInventory(ctx context.Context, order Order) error {
return a.InventoryClient.Restore(order.ProductID, order.Quantity)
}
|
3.2 Define Workflow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| // workflow.go
package payment
import (
"time"
"go.temporal.io/sdk/workflow"
)
func OrderWorkflow(ctx workflow.Context, order Order) error {
// Activity options: timeout + retry
activityOptions := workflow.ActivityOptions{
StartToCloseTimeout: time.Minute,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: time.Minute,
MaximumAttempts: 5,
},
}
ctx = workflow.WithActivityOptions(ctx, activityOptions)
var activities *Activities
// Step 1: Charge payment
err := workflow.ExecuteActivity(ctx, activities.ChargePayment, order).Get(ctx, nil)
if err != nil {
return fmt.Errorf("payment failed: %w", err)
}
// Step 2: Deduct inventory (refund on failure)
err = workflow.ExecuteActivity(ctx, activities.DeductInventory, order).Get(ctx, nil)
if err != nil {
// Saga compensation: refund
_ = workflow.ExecuteActivity(ctx, activities.RefundPayment, order).Get(ctx, nil)
return fmt.Errorf("inventory deduction failed, refunded: %w", err)
}
// Step 3: Send notification (allowed to fail, doesn't affect order)
_ = workflow.ExecuteActivity(ctx, activities.SendNotification, order).Get(ctx, nil)
return nil
}
|
3.3 Start the Worker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| // worker/main.go
package main
import (
"log"
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/worker"
)
func main() {
c, err := client.Dial(client.Options{
HostPort: "localhost:7233",
})
if err != nil {
log.Fatalln("Unable to connect to Temporal:", err)
}
defer c.Close()
w := worker.New(c, "order-task-queue", worker.Options{})
// Register Workflow and Activities
w.RegisterWorkflow(OrderWorkflow)
w.RegisterActivity(&Activities{
PaymentClient: NewPaymentClient(),
InventoryClient: NewInventoryClient(),
})
err = w.Run(worker.InterruptCh())
if err != nil {
log.Fatalln("Worker exited abnormally:", err)
}
}
|
4. Monitoring & Debugging
4.1 Temporal Web UI
Temporal comes with a powerful Web UI at http://localhost:8080:
You can see:
- Execution status of all Workflows
- Detailed Event History timeline
- Input/output parameters of Activities
- Retry counts and error messages
4.2 Common Issues
| Symptom | Likely Cause | Solution |
|---|
| Workflow stuck in Running | Activity timeout too long | Adjust StartToCloseTimeout |
| nondeterminism detected | Non-deterministic operation in code | Check for time.Now(), rand, go func |
| Activity keeps retrying | Target service unavailable | Check the Activity’s target service |
| History too large | Workflow running too long | Use ContinueAsNew to split History |
5. Implications for AI Infra
5.1 AI Training Task Orchestration Needs
A typical AI training pipeline:
1
2
3
| Data Preprocessing → Model Training → Evaluation → Model Deployment
↓ ↓ ↓
[Retry on fail] [Checkpoint] [Rollback]
|
This maps perfectly to Temporal’s Workflow/Activity model!
5.2 Temporal + K8s Operator Combo
| Layer | Responsibility | Tech Choice |
|---|
| Resource Management | Pods, GPUs, Volumes | K8s Operator |
| Task Orchestration | Training flow, failure recovery | Temporal |
| State Storage | Checkpoints, model files | S3/GCS |
My take:
- K8s Operator manages “what” (desired state: N pods)
- Temporal manages “how” (first train, then evaluate, then deploy)
Combining both gives you a complete AI platform control plane.
6. Summary
| Concept | Key Understanding |
|---|
| Durable Execution | Process crash doesn’t lose state |
| Workflow/Activity | Separate orchestration logic from business operations |
| Determinism | Foundation for replay, must be strictly followed |
| Event Sourcing | Recover state by replaying historical events |
Key takeaway: Temporal freed me from the quagmire of hand-writing distributed transaction compensation, letting me focus on business logic. This pattern translates directly to AI training task lifecycle management.
Series