Production-Grade String Parsing: From atoi to State Machine

Paul included in Go

2022-04-14 1790 words 9 minutes

LeetCode’s atoi is a classic interview problem, but production-grade implementations are far more complex. This post starts from the interview version and gradually analyzes Go and Rust standard library implementations to understand “production-grade” code design.

1. Interview Version atoi

1.1 LeetCode 8: String to Integer

Basic requirements:

Skip leading whitespace
Handle positive/negative signs
Convert digit characters
Handle overflow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
func myAtoi(s string) int {
    i, n := 0, len(s)
    
    // Skip whitespace
    for i < n && s[i] == ' ' {
        i++
    }
    
    if i >= n {
        return 0
    }
    
    // Handle sign
    sign := 1
    if s[i] == '+' {
        i++
    } else if s[i] == '-' {
        sign = -1
        i++
    }
    
    // Convert digits
    result := 0
    for i < n && s[i] >= '0' && s[i] <= '9' {
        digit := int(s[i] - '0')
        
        // Overflow check
        if result > (math.MaxInt32-digit)/10 {
            if sign == 1 {
                return math.MaxInt32
            }
            return math.MinInt32
        }
        
        result = result*10 + digit
        i++
    }
    
    return sign * result
}

This passes LeetCode, but it’s far from production-grade.

1.2 What’s Missing?

Interview Version	Production Requirements
Only handles int32	Supports multiple types (int8, int64, uint…)
Only returns result	Needs detailed error messages
Only base 10	Supports multiple bases (2, 8, 16…)
Simple overflow handling	Precise overflow detection
Fixed format	Supports 0x, 0o, 0b prefixes

2. Go Standard Library Analysis

2.1 strconv.ParseInt Signature

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func ParseInt(s string, base int, bitSize int) (i int64, err error)

// Parameters:
//   s: input string
//   base: radix (0 means auto-detect)
//   bitSize: target type bits (8, 16, 32, 64)

// Returns:
//   i: parsed result
//   err: error info (includes detailed reason)

2.2 Error Types

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type NumError struct {
    Func string // Function name ("ParseInt")
    Num  string // Input string
    Err  error  // Specific error
}

var (
    ErrRange = errors.New("value out of range")
    ErrSyntax = errors.New("invalid syntax")
)

2.3 Source Analysis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
// src/strconv/atoi.go (simplified)
func ParseInt(s string, base int, bitSize int) (i int64, err error) {
    // Empty string
    if s == "" {
        return 0, syntaxError("ParseInt", s)
    }
    
    // Handle sign
    s0 := s
    neg := false
    if s[0] == '+' {
        s = s[1:]
    } else if s[0] == '-' {
        neg = true
        s = s[1:]
    }
    
    // Handle base prefix
    if base == 0 {
        base = 10
        if s[0] == '0' {
            switch {
            case len(s) >= 2 && lower(s[1]) == 'x':
                base = 16
                s = s[2:]
            case len(s) >= 2 && lower(s[1]) == 'o':
                base = 8
                s = s[2:]
            case len(s) >= 2 && lower(s[1]) == 'b':
                base = 2
                s = s[2:]
            default:
                base = 8
                s = s[1:]
            }
        }
    }
    
    // Calculate overflow bounds
    var cutoff uint64
    switch bitSize {
    case 8:
        cutoff = uint64(1<<7 - 1)
    case 16:
        cutoff = uint64(1<<15 - 1)
    case 32:
        cutoff = uint64(1<<31 - 1)
    case 64:
        cutoff = uint64(1<<63 - 1)
    }
    if neg {
        cutoff++
    }
    
    // Character-by-character conversion
    var n uint64
    for _, c := range []byte(s) {
        var d byte
        switch {
        case '0' <= c && c <= '9':
            d = c - '0'
        case 'a' <= lower(c) && lower(c) <= 'z':
            d = lower(c) - 'a' + 10
        default:
            return 0, syntaxError("ParseInt", s0)
        }
        
        if d >= byte(base) {
            return 0, syntaxError("ParseInt", s0)
        }
        
        // Overflow check (before multiplication)
        if n >= cutoff {
            return 0, rangeError("ParseInt", s0)
        }
        
        n *= uint64(base)
        n1 := n + uint64(d)
        
        // Check addition overflow
        if n1 < n || n1 > cutoff {
            return 0, rangeError("ParseInt", s0)
        }
        n = n1
    }
    
    if neg {
        return -int64(n), nil
    }
    return int64(n), nil
}

2.4 Design Highlights

Overflow detection before operation: Avoids detecting after already overflowing
Detailed error info: Includes original input for debugging
Unified signed/unsigned handling: Calculate with uint64, convert at end
Auto base detection: 0x, 0o, 0b prefixes

3. State Machine Implementation

3.1 Why State Machine?

Complex parsing needs are hard to handle cleanly with if-else:

1
2
3
4
5
6
7
8
9
Valid input examples:
"  -42"      → -42
"0x1A"       → 26 (auto-detect base 16)
"   +0o17"   → 15 (auto-detect base 8)

Invalid input examples:
"--42"       → syntax error
"0x"         → syntax error (no digits)
"123abc"     → ? (Go errors, some languages accept 123)

3.2 State Machine Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
         ┌─────────────────────────────────────────────────────┐
         │                                                      │
    ┌────▼────┐  space   ┌─────────┐  +/-   ┌─────────┐         │
    │  START  │────────▶│ SPACES  │───────▶│  SIGN   │         │
    └─────────┘         └────┬────┘        └────┬────┘         │
         │                   │                   │              │
         │ 0                 │ digit             │ digit        │
         ▼                   │                   │              │
    ┌─────────┐              │                   │              │
    │ PREFIX  │◄─────────────┴───────────────────┘              │
    └────┬────┘                                                 │
         │ x/o/b/digit                                          │
         ▼                                                      │
    ┌─────────┐  digit                                          │
    │ DIGITS  │────────┐                                        │
    └────┬────┘        │                                        │
         │             │                                        │
         │ non-digit   │                                        │
         ▼             ▼                                        │
    ┌─────────┐   ┌─────────┐                                   │
    │  ERROR  │   │   END   │───────────────────────────────────┘
    └─────────┘   └─────────┘

3.3 Go Implementation

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
type State int

const (
    StateStart State = iota
    StateSign
    StatePrefix
    StateDigits
    StateEnd
    StateError
)

type Parser struct {
    state   State
    sign    int
    base    int
    result  uint64
    bitSize int
    cutoff  uint64
}

func NewParser(bitSize int) *Parser {
    cutoff := uint64(1<<(bitSize-1) - 1)
    return &Parser{
        state:   StateStart,
        sign:    1,
        base:    10,
        bitSize: bitSize,
        cutoff:  cutoff,
    }
}

func (p *Parser) Feed(c byte) error {
    switch p.state {
    case StateStart:
        return p.handleStart(c)
    case StateSign:
        return p.handleSign(c)
    case StatePrefix:
        return p.handlePrefix(c)
    case StateDigits:
        return p.handleDigits(c)
    default:
        return errors.New("invalid state")
    }
}

func (p *Parser) handleStart(c byte) error {
    switch {
    case c == ' ' || c == '\t':
        // Stay in Start state
        return nil
    case c == '+':
        p.state = StateSign
        return nil
    case c == '-':
        p.sign = -1
        p.cutoff++ // Negative can be one more
        p.state = StateSign
        return nil
    case c == '0':
        p.state = StatePrefix
        return nil
    case c >= '1' && c <= '9':
        p.state = StateDigits
        return p.addDigit(c - '0')
    default:
        p.state = StateError
        return errors.New("invalid character")
    }
}

func (p *Parser) handlePrefix(c byte) error {
    switch lower(c) {
    case 'x':
        p.base = 16
        p.state = StateDigits
        return nil
    case 'o':
        p.base = 8
        p.state = StateDigits
        return nil
    case 'b':
        p.base = 2
        p.state = StateDigits
        return nil
    default:
        p.base = 8
        p.state = StateDigits
        return p.addDigit(c - '0')
    }
}

func (p *Parser) addDigit(d byte) error {
    if p.result > p.cutoff/uint64(p.base) {
        return errors.New("overflow")
    }
    p.result = p.result*uint64(p.base) + uint64(d)
    if p.result > p.cutoff {
        return errors.New("overflow")
    }
    return nil
}

func (p *Parser) Result() (int64, error) {
    if p.state == StateError {
        return 0, errors.New("parse error")
    }
    if p.sign < 0 {
        return -int64(p.result), nil
    }
    return int64(p.result), nil
}

4. Rust Standard Library Comparison

4.1 Rust’s from_str

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Rust uses trait implementation
impl FromStr for i32 {
    type Err = ParseIntError;
    
    fn from_str(s: &str) -> Result<i32, ParseIntError> {
        i32::from_str_radix(s, 10)
    }
}

// Usage
let n: i32 = "42".parse()?;
let n = i32::from_str_radix("0x2A", 16)?;

4.2 Rust’s Error Handling

1
2
3
4
5
6
7
pub enum IntErrorKind {
    Empty,           // Empty string
    InvalidDigit,    // Invalid character
    PosOverflow,     // Positive overflow
    NegOverflow,     // Negative overflow
    Zero,            // Unexpected zero (some contexts)
}

4.3 Comparison

Feature	Go	Rust
Return type	(int64, error)	Result<T, ParseIntError>
Base	Parameter	Method name distinction
Error granularity	Error interface	Enum type
Sign handling	Internal	Type system distinction

5. Practical Applications

5.1 Config File Parsing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
type Config struct {
    Port     int
    Timeout  time.Duration
    MaxSize  int64
}

func parseSize(s string) (int64, error) {
    s = strings.TrimSpace(s)
    if len(s) == 0 {
        return 0, nil
    }
    
    // Support K, M, G suffixes
    unit := int64(1)
    switch s[len(s)-1] {
    case 'K', 'k':
        unit = 1024
        s = s[:len(s)-1]
    case 'M', 'm':
        unit = 1024 * 1024
        s = s[:len(s)-1]
    case 'G', 'g':
        unit = 1024 * 1024 * 1024
        s = s[:len(s)-1]
    }
    
    n, err := strconv.ParseInt(s, 0, 64)
    if err != nil {
        return 0, err
    }
    
    // Check multiplication overflow
    if n > 0 && unit > math.MaxInt64/n {
        return 0, errors.New("size overflow")
    }
    
    return n * unit, nil
}

// Usage
parseSize("512M")  // 536870912
parseSize("0x100") // 256

5.2 Command Line Arguments

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func parsePort(s string) (int, error) {
    port, err := strconv.ParseInt(s, 10, 16)  // 16 bits enough for port
    if err != nil {
        return 0, fmt.Errorf("invalid port: %w", err)
    }
    if port < 1 || port > 65535 {
        return 0, fmt.Errorf("port must be 1-65535, got %d", port)
    }
    return int(port), nil
}

6. Summary

Level	Implementation
Interview	if-else, basic cases only
Production	State machine, complete error handling
Standard library	Highly optimized, multi-type multi-base

Key takeaways:

Overflow detection before operation, not after
State machines make complex parsing logic clearer
Error messages should be detailed, include original input
Boundary conditions are what separate interview code from production code

Contents