Go Microservices War Stories: 5 Lessons from Dependency Management to Service Discovery

2021-09-22 998 words 5 minutes

Microservices isn’t just about splitting up a monolith. This post documents 5 real pitfalls I hit in Go microservices projects, and how to avoid them.

1. Deprecated go get Breaking CI Builds

1.1 The Problem

One day, CI suddenly failed:

1
2
3
go get -u github.com/golang/protobuf/protoc-gen-go
# go: go get -u github.com/golang/protobuf/protoc-gen-go:
#     installing executables with 'go get' in module mode is deprecated.

1.2 Root Cause

Starting with Go 1.17, go get behavior changed:

Version	go get Behavior
1.16 and earlier	Downloads dependencies + installs binaries
1.17+	Only manages `go.mod` dependencies
Future versions	`-d` flag becomes default

1.3 Solution

1
2
3
4
5
6
# Old way (deprecated)
go get -u github.com/golang/protobuf/protoc-gen-go

# New way: use go install + version
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest

1.4 CI Script Fix

1
2
3
4
5
6
7
8
# .github/workflows/build.yml
jobs:
  build:
    steps:
      - name: Install protoc plugins
        run: |
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.31.0
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@v1.3.0

Lesson: Pin tool versions, don’t use @latest.

2. gRPC Version Conflict Causing Runtime Panic

2.1 The Problem

Service crashes on startup:

1
panic: proto: extension number 1001 is already registered

2.2 Root Cause

Project depends on two protobuf library versions:

1
2
3
go mod graph | grep protobuf
# github.com/golang/protobuf@v1.4.3
# google.golang.org/protobuf@v1.31.0

github.com/golang/protobuf is the old library, google.golang.org/protobuf is the new one. Both register the same extension numbers internally, causing conflicts.

2.3 Solutions

Option 1: Force unified version

1
2
3
4
5
6
// go.mod
require (
    google.golang.org/protobuf v1.31.0
)

replace github.com/golang/protobuf => google.golang.org/protobuf v1.31.0

Option 2: Upgrade all dependencies

1
2
go get -u ./...
go mod tidy

Option 3: Use go mod why to trace

1
2
# Find who's pulling in the old version
go mod why github.com/golang/protobuf

Lesson: Run go mod tidy regularly to keep your dependency graph clean.

3. Service Discovery Failure: DNS Resolution Timeout

3.1 The Problem

Intermittent timeouts on service-to-service calls, logs show:

1
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

3.2 Debugging

1
2
3
4
5
// Problem code
conn, err := grpc.Dial(
    "user-service:8080",  // Using K8s Service name
    grpc.WithInsecure(),
)

tcpdump showed DNS resolution occasionally taking 5+ seconds.

3.3 Root Cause

K8s cluster’s CoreDNS was configured with upstream external DNS. When internal resolution fails, it tries external resolution, causing delays.

3.4 Solutions

Option 1: Use FQDN

1
2
3
4
5
// Specify full domain name, avoid search domain probing
conn, err := grpc.Dial(
    "user-service.default.svc.cluster.local:8080",
    grpc.WithInsecure(),
)

Option 2: K8s DNS config optimization

1
2
3
4
5
# In Pod spec
dnsConfig:
  options:
    - name: ndots
      value: "2"  # Reduce DNS search attempts

Option 3: Client-side DNS caching

1
2
3
4
5
6
import "google.golang.org/grpc/resolver"

func init() {
    // Use passthrough resolver, bypass gRPC's DNS resolution
    resolver.SetDefaultScheme("passthrough")
}

Lesson: Microservice network issues often aren’t in the code layer.

4. Context Leak Causing Goroutine Explosion

4.1 The Problem

After running for a while, memory keeps growing. pprof shows tons of goroutines waiting:

1
2
go tool pprof http://localhost:6060/debug/pprof/goroutine
# 50000 goroutines, 90% in select {}

4.2 Problem Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func HandleRequest(ctx context.Context, req *Request) {
    // Created new context, but didn't cancel
    newCtx, _ := context.WithTimeout(ctx, 10*time.Second)
    
    go func() {
        // This goroutine waits on newCtx.Done()
        // But if request returns early, newCtx never gets cancelled
        <-newCtx.Done()
        cleanup()
    }()
    
    // ... handle request
}

4.3 Correct Approach

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func HandleRequest(ctx context.Context, req *Request) {
    newCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
    defer cancel()  // Key: ensure context is cancelled
    
    done := make(chan struct{})
    go func() {
        defer close(done)
        // Async work
    }()
    
    select {
    case <-done:
        // Normal completion
    case <-newCtx.Done():
        // Timeout
    }
}

4.4 Detection Tool

1
2
3
4
5
6
// Use goleak to detect goroutine leaks
import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

Lesson: The cancel function returned by context.WithCancel/Timeout must be called.

5. Graceful Shutdown: Killing Requests Mid-Flight

5.1 The Problem

During service restarts, users report failed requests. Logs show:

1
http: Server closed

5.2 Problem Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func main() {
    srv := &http.Server{Addr: ":8080"}
    go srv.ListenAndServe()
    
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM)
    <-quit
    
    srv.Close()  // Closes immediately, doesn't wait for requests!
}

5.3 Correct Approach

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
func main() {
    srv := &http.Server{Addr: ":8080"}
    
    go func() {
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("listen: %s\n", err)
        }
    }()
    
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit
    
    log.Println("Shutting down server...")
    
    // Graceful shutdown: wait up to 30 seconds for existing requests
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    if err := srv.Shutdown(ctx); err != nil {
        log.Fatal("Server forced to shutdown:", err)
    }
    
    log.Println("Server exited")
}

5.4 K8s PreStop Hook

1
2
3
4
5
6
# Pod spec
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]
# Give K8s time to update Endpoints, prevent traffic from coming in

5.5 gRPC Graceful Shutdown

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func main() {
    srv := grpc.NewServer()
    
    // ... register services
    
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM)
    <-quit
    
    // GracefulStop waits for all RPCs to complete
    srv.GracefulStop()
}

Lesson: Graceful shutdown isn’t just code — you also need K8s preStop and terminationGracePeriodSeconds.

6. Summary

Problem	Root Cause	Solution
go get deprecated	Go 1.17 behavior change	Use `go install @version`
protobuf panic	Old/new library conflict	Unify versions + `go mod tidy`
DNS timeout	K8s DNS config	Use FQDN + adjust ndots
Goroutine leak	Context not cancelled	`defer cancel()`
Request interrupted	No graceful shutdown	`srv.Shutdown()` + preStop

Core lesson: Microservice complexity isn’t in the split — it’s in the edge cases that distributed environments bring. Every problem requires thinking across code, config, and infrastructure layers.

Contents