Contents

Temporal vs K8s Controller: Declarative vs Imperative Control Plane Paradigms

K8s Controllers and Temporal are both “control plane” technologies, but they represent fundamentally different design philosophies. This post deep-dives into their core differences and explores how to combine them in AI Infra platforms.

1. Two Control Plane Paradigms

1.1 Philosophy Comparison

DimensionK8s ControllerTemporal
Core philosophyDeclarativeImperative/Orchestrated
FocusDesired StateWorkflow/Process
Trigger mechanismLevel-TriggeredEvent Sourcing
Expression“I want 3 Pods”“First do A, then B, if fail do C”

1.2 Level-Triggered vs Edge-Triggered

K8s Controller’s Level-Triggered approach:

1
2
3
4
5
Desired state: replicas=3
Current state: pods=2
  → Reconcile → Create 1 Pod
  → Check again → pods=3 
  → No action needed

Regardless of what triggered Reconcile (Pod deleted, new Deployment, health check failed), the Controller only cares about the gap between current and desired state.

Temporal’s Event Sourcing:

1
2
3
4
5
6
7
Event 1: WorkflowStarted
Event 2: ActivityScheduled(Payment)
Event 3: ActivityCompleted(Success)
Event 4: ActivityScheduled(Inventory)
[Crash Recovery]
→ Replay Events 1-4
→ Resume from Event 4

Temporal needs the complete history to recover state.

1.3 Use Case Analysis

ScenarioBetter ChoiceReason
Manage Pod/GPU resourcesK8s OperatorResources are “state to maintain”
Training task flowTemporalIt’s “ordered steps”
Self-healingK8s ControllerPeriodic Reconcile auto-fixes
Complex branching logicTemporalCode expresses it more naturally
Run indefinitelyK8s ControllerStateless, low overhead
Has clear endTemporalNative support for complete/fail/cancel

2. Code Comparison

2.1 K8s Controller Example

Operator managing an AI training job:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// reconciler.go
func (r *TrainingJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Get CR resource
    var job trainingv1.TrainingJob
    if err := r.Get(ctx, req.NamespacedName, &job); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    // Get current state
    var pods corev1.PodList
    r.List(ctx, &pods, client.MatchingLabels{"job": job.Name})
    
    // Desired state vs current state
    desiredReplicas := job.Spec.Workers
    currentReplicas := len(pods.Items)
    
    if currentReplicas < desiredReplicas {
        // Create Pods
        for i := currentReplicas; i < desiredReplicas; i++ {
            r.Create(ctx, r.buildWorkerPod(&job, i))
        }
    } else if currentReplicas > desiredReplicas {
        // Delete excess Pods
        for i := desiredReplicas; i < currentReplicas; i++ {
            r.Delete(ctx, &pods.Items[i])
        }
    }
    
    // Update status
    job.Status.ActiveWorkers = desiredReplicas
    r.Status().Update(ctx, &job)
    
    // Periodic recheck
    return ctrl.Result{RequeueAfter: time.Minute}, nil
}

Characteristics:

  • Each Reconcile starts fresh from “current state”
  • Doesn’t care about “how we got here”, only “what’s the gap now”
  • Great for continuously running resource management

2.2 Temporal Workflow Example

Same AI training job with Temporal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// workflow.go
func TrainingWorkflow(ctx workflow.Context, job TrainingJob) error {
    // Step 1: Prepare data
    var dataPath string
    err := workflow.ExecuteActivity(ctx, PrepareData, job.DataSource).Get(ctx, &dataPath)
    if err != nil {
        return fmt.Errorf("data preparation failed: %w", err)
    }
    
    // Step 2: Start training (might run for hours)
    var modelPath string
    err = workflow.ExecuteActivity(ctx, StartTraining, TrainingParams{
        DataPath: dataPath,
        Epochs:   job.Epochs,
        Workers:  job.Workers,
    }).Get(ctx, &modelPath)
    if err != nil {
        // Clean up intermediate data
        workflow.ExecuteActivity(ctx, CleanupData, dataPath)
        return fmt.Errorf("training failed: %w", err)
    }
    
    // Step 3: Model evaluation
    var metrics Metrics
    err = workflow.ExecuteActivity(ctx, EvaluateModel, modelPath).Get(ctx, &metrics)
    if err != nil {
        return fmt.Errorf("evaluation failed: %w", err)
    }
    
    // Step 4: Conditional branch - only deploy if accuracy meets threshold
    if metrics.Accuracy >= job.MinAccuracy {
        workflow.ExecuteActivity(ctx, DeployModel, modelPath)
    } else {
        // Notify for human review
        workflow.ExecuteActivity(ctx, NotifyHumanReview, job.Owner, metrics)
    }
    
    return nil
}

Characteristics:

  • Code is the flow — sequential execution, branching clearly visible
  • Steps have dependencies (dataPath → training → modelPath)
  • Great for tasks with clear start and end

3. Core Mechanism Comparison

3.1 Failure Recovery

K8s Controller:

1
2
3
4
5
Pod dies
  → Informer receives Delete event
  → Triggers Reconcile
  → Sees current=2, desired=3
  → Creates new Pod

Recovery relies on periodic state comparison, no history needed.

Temporal:

1
2
3
4
5
6
Worker dies
  → Workflow execution interrupted
  → Start new Worker
  → Replay from Event History
  → Code re-executes to breakpoint
  → Resume forward execution

Recovery relies on complete event history.

3.2 Complex Branching

K8s Controller handling complex branches is painful:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Full of state machine logic
switch job.Status.Phase {
case "Pending":
    if allPodsReady() {
        job.Status.Phase = "Running"
    }
case "Running":
    if trainingComplete() {
        job.Status.Phase = "Evaluating"
    } else if hasFailed() {
        job.Status.Phase = "Failed"
    }
case "Evaluating":
    // More branches...
}

Temporal handles it naturally:

1
2
3
4
5
6
7
8
// Just use if-else, crystal clear
if metrics.Accuracy >= threshold {
    deployModel()
} else if retryCount < 3 {
    retrainWithMoreData()
} else {
    notifyHuman()
}

3.3 Resource Usage

ScenarioK8s ControllerTemporal
1000 jobs1 Controller Pod1000 Workflow histories
Memory usageO(1) - statelessO(N) - each Workflow takes memory
Long-runningLow overheadHistory bloat risk

Temporal’s solution: ContinueAsNew to split history

1
2
3
4
// When history gets too long, "reincarnate" as new Workflow
if workflow.GetInfo(ctx).GetHistorySize() > 10000 {
    return workflow.NewContinueAsNewError(ctx, TrainingWorkflow, job)
}

4. AI Infra Best Practices

4.1 Separation of Concerns Architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
┌─────────────────────────────────────┐
│           Temporal Workflow         │
│  (Task orchestration: train → eval → deploy)  │
└───────────────┬─────────────────────┘
                │ calls
┌───────────────▼──────────────────────┐
│           K8s Operator               │
│  (Resource management: Pod, GPU, PVC)  │
└───────────────┬──────────────────────┘
                │ manages
┌───────────────▼────────────────────┐
│           Kubernetes Cluster        │
│  (GPU nodes, storage, networking)    │
└────────────────────────────────────┘

4.2 Temporal Activity Calling K8s API

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
func CreateTrainingPod(ctx context.Context, spec PodSpec) (string, error) {
    clientset := getKubernetesClient()
    
    pod := &corev1.Pod{
        ObjectMeta: metav1.ObjectMeta{
            GenerateName: "training-",
            Labels:       map[string]string{"managed-by": "temporal"},
        },
        Spec: spec.ToK8sPodSpec(),
    }
    
    created, err := clientset.CoreV1().Pods("training").Create(ctx, pod, metav1.CreateOptions{})
    if err != nil {
        return "", err
    }
    
    return created.Name, nil
}

func WaitForPodComplete(ctx context.Context, podName string) error {
    clientset := getKubernetesClient()
    
    watch, _ := clientset.CoreV1().Pods("training").Watch(ctx, metav1.ListOptions{
        FieldSelector: "metadata.name=" + podName,
    })
    
    for event := range watch.ResultChan() {
        pod := event.Object.(*corev1.Pod)
        if pod.Status.Phase == corev1.PodSucceeded {
            return nil
        }
        if pod.Status.Phase == corev1.PodFailed {
            return fmt.Errorf("Pod failed: %s", pod.Status.Message)
        }
    }
    
    return fmt.Errorf("Watch ended unexpectedly")
}

4.3 When to Use Which?

NeedChoice
“I need 3 GPU Pods”K8s Operator
“First preprocess data, then train, retry 3 times on failure”Temporal
“Auto-restart crashed Pods”K8s (built-in)
“After training succeeds, auto-trigger eval and deploy”Temporal
“Manage GPU resource pool”K8s Operator + Device Plugin
“Coordinate execution order of multiple tasks”Temporal

5. Summary

DimensionK8s Controller WinsTemporal Wins
Resource management
Run indefinitely
Self-healing
Complex workflow orchestration
Long transaction compensation
Code-as-workflow

Key insight: These aren’t replacements for each other — they’re complementary. In AI Infra control plane architecture:

  • K8s Operator is the “foundation” — manages underlying resources
  • Temporal is the “building” — orchestrates business workflows

Used together, they form a complete control plane architecture.