K8s Controllers and Temporal are both “control plane” technologies, but they represent fundamentally different design philosophies. This post deep-dives into their core differences and explores how to combine them in AI Infra platforms.
1. Two Control Plane Paradigms
1.1 Philosophy Comparison
| Dimension | K8s Controller | Temporal |
|---|
| Core philosophy | Declarative | Imperative/Orchestrated |
| Focus | Desired State | Workflow/Process |
| Trigger mechanism | Level-Triggered | Event Sourcing |
| Expression | “I want 3 Pods” | “First do A, then B, if fail do C” |
1.2 Level-Triggered vs Edge-Triggered
K8s Controller’s Level-Triggered approach:
1
2
3
4
5
| Desired state: replicas=3
Current state: pods=2
→ Reconcile → Create 1 Pod
→ Check again → pods=3
→ No action needed
|
Regardless of what triggered Reconcile (Pod deleted, new Deployment, health check failed), the Controller only cares about the gap between current and desired state.
Temporal’s Event Sourcing:
1
2
3
4
5
6
7
| Event 1: WorkflowStarted
Event 2: ActivityScheduled(Payment)
Event 3: ActivityCompleted(Success)
Event 4: ActivityScheduled(Inventory)
[Crash Recovery]
→ Replay Events 1-4
→ Resume from Event 4
|
Temporal needs the complete history to recover state.
1.3 Use Case Analysis
| Scenario | Better Choice | Reason |
|---|
| Manage Pod/GPU resources | K8s Operator | Resources are “state to maintain” |
| Training task flow | Temporal | It’s “ordered steps” |
| Self-healing | K8s Controller | Periodic Reconcile auto-fixes |
| Complex branching logic | Temporal | Code expresses it more naturally |
| Run indefinitely | K8s Controller | Stateless, low overhead |
| Has clear end | Temporal | Native support for complete/fail/cancel |
2. Code Comparison
2.1 K8s Controller Example
Operator managing an AI training job:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| // reconciler.go
func (r *TrainingJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Get CR resource
var job trainingv1.TrainingJob
if err := r.Get(ctx, req.NamespacedName, &job); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Get current state
var pods corev1.PodList
r.List(ctx, &pods, client.MatchingLabels{"job": job.Name})
// Desired state vs current state
desiredReplicas := job.Spec.Workers
currentReplicas := len(pods.Items)
if currentReplicas < desiredReplicas {
// Create Pods
for i := currentReplicas; i < desiredReplicas; i++ {
r.Create(ctx, r.buildWorkerPod(&job, i))
}
} else if currentReplicas > desiredReplicas {
// Delete excess Pods
for i := desiredReplicas; i < currentReplicas; i++ {
r.Delete(ctx, &pods.Items[i])
}
}
// Update status
job.Status.ActiveWorkers = desiredReplicas
r.Status().Update(ctx, &job)
// Periodic recheck
return ctrl.Result{RequeueAfter: time.Minute}, nil
}
|
Characteristics:
- Each Reconcile starts fresh from “current state”
- Doesn’t care about “how we got here”, only “what’s the gap now”
- Great for continuously running resource management
2.2 Temporal Workflow Example
Same AI training job with Temporal:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| // workflow.go
func TrainingWorkflow(ctx workflow.Context, job TrainingJob) error {
// Step 1: Prepare data
var dataPath string
err := workflow.ExecuteActivity(ctx, PrepareData, job.DataSource).Get(ctx, &dataPath)
if err != nil {
return fmt.Errorf("data preparation failed: %w", err)
}
// Step 2: Start training (might run for hours)
var modelPath string
err = workflow.ExecuteActivity(ctx, StartTraining, TrainingParams{
DataPath: dataPath,
Epochs: job.Epochs,
Workers: job.Workers,
}).Get(ctx, &modelPath)
if err != nil {
// Clean up intermediate data
workflow.ExecuteActivity(ctx, CleanupData, dataPath)
return fmt.Errorf("training failed: %w", err)
}
// Step 3: Model evaluation
var metrics Metrics
err = workflow.ExecuteActivity(ctx, EvaluateModel, modelPath).Get(ctx, &metrics)
if err != nil {
return fmt.Errorf("evaluation failed: %w", err)
}
// Step 4: Conditional branch - only deploy if accuracy meets threshold
if metrics.Accuracy >= job.MinAccuracy {
workflow.ExecuteActivity(ctx, DeployModel, modelPath)
} else {
// Notify for human review
workflow.ExecuteActivity(ctx, NotifyHumanReview, job.Owner, metrics)
}
return nil
}
|
Characteristics:
- Code is the flow — sequential execution, branching clearly visible
- Steps have dependencies (dataPath → training → modelPath)
- Great for tasks with clear start and end
3. Core Mechanism Comparison
3.1 Failure Recovery
K8s Controller:
1
2
3
4
5
| Pod dies
→ Informer receives Delete event
→ Triggers Reconcile
→ Sees current=2, desired=3
→ Creates new Pod
|
Recovery relies on periodic state comparison, no history needed.
Temporal:
1
2
3
4
5
6
| Worker dies
→ Workflow execution interrupted
→ Start new Worker
→ Replay from Event History
→ Code re-executes to breakpoint
→ Resume forward execution
|
Recovery relies on complete event history.
3.2 Complex Branching
K8s Controller handling complex branches is painful:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| // Full of state machine logic
switch job.Status.Phase {
case "Pending":
if allPodsReady() {
job.Status.Phase = "Running"
}
case "Running":
if trainingComplete() {
job.Status.Phase = "Evaluating"
} else if hasFailed() {
job.Status.Phase = "Failed"
}
case "Evaluating":
// More branches...
}
|
Temporal handles it naturally:
1
2
3
4
5
6
7
8
| // Just use if-else, crystal clear
if metrics.Accuracy >= threshold {
deployModel()
} else if retryCount < 3 {
retrainWithMoreData()
} else {
notifyHuman()
}
|
3.3 Resource Usage
| Scenario | K8s Controller | Temporal |
|---|
| 1000 jobs | 1 Controller Pod | 1000 Workflow histories |
| Memory usage | O(1) - stateless | O(N) - each Workflow takes memory |
| Long-running | Low overhead | History bloat risk |
Temporal’s solution: ContinueAsNew to split history
1
2
3
4
| // When history gets too long, "reincarnate" as new Workflow
if workflow.GetInfo(ctx).GetHistorySize() > 10000 {
return workflow.NewContinueAsNewError(ctx, TrainingWorkflow, job)
}
|
4. AI Infra Best Practices
4.1 Separation of Concerns Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| ┌─────────────────────────────────────┐
│ Temporal Workflow │
│ (Task orchestration: train → eval → deploy) │
└───────────────┬─────────────────────┘
│ calls
┌───────────────▼──────────────────────┐
│ K8s Operator │
│ (Resource management: Pod, GPU, PVC) │
└───────────────┬──────────────────────┘
│ manages
┌───────────────▼────────────────────┐
│ Kubernetes Cluster │
│ (GPU nodes, storage, networking) │
└────────────────────────────────────┘
|
4.2 Temporal Activity Calling K8s API
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| func CreateTrainingPod(ctx context.Context, spec PodSpec) (string, error) {
clientset := getKubernetesClient()
pod := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
GenerateName: "training-",
Labels: map[string]string{"managed-by": "temporal"},
},
Spec: spec.ToK8sPodSpec(),
}
created, err := clientset.CoreV1().Pods("training").Create(ctx, pod, metav1.CreateOptions{})
if err != nil {
return "", err
}
return created.Name, nil
}
func WaitForPodComplete(ctx context.Context, podName string) error {
clientset := getKubernetesClient()
watch, _ := clientset.CoreV1().Pods("training").Watch(ctx, metav1.ListOptions{
FieldSelector: "metadata.name=" + podName,
})
for event := range watch.ResultChan() {
pod := event.Object.(*corev1.Pod)
if pod.Status.Phase == corev1.PodSucceeded {
return nil
}
if pod.Status.Phase == corev1.PodFailed {
return fmt.Errorf("Pod failed: %s", pod.Status.Message)
}
}
return fmt.Errorf("Watch ended unexpectedly")
}
|
4.3 When to Use Which?
| Need | Choice |
|---|
| “I need 3 GPU Pods” | K8s Operator |
| “First preprocess data, then train, retry 3 times on failure” | Temporal |
| “Auto-restart crashed Pods” | K8s (built-in) |
| “After training succeeds, auto-trigger eval and deploy” | Temporal |
| “Manage GPU resource pool” | K8s Operator + Device Plugin |
| “Coordinate execution order of multiple tasks” | Temporal |
5. Summary
| Dimension | K8s Controller Wins | Temporal Wins |
|---|
| Resource management | ✓ | |
| Run indefinitely | ✓ | |
| Self-healing | ✓ | |
| Complex workflow orchestration | | ✓ |
| Long transaction compensation | | ✓ |
| Code-as-workflow | | ✓ |
Key insight: These aren’t replacements for each other — they’re complementary. In AI Infra control plane architecture:
- K8s Operator is the “foundation” — manages underlying resources
- Temporal is the “building” — orchestrates business workflows
Used together, they form a complete control plane architecture.