NAS Disk Health Monitoring: From smartctl to Custom Agent

Paul included in Linux

2022-08-15 1546 words 8 minutes

A NAS’s most important job is protecting user data. Disk failures often have warning signs, and SMART technology can detect problems early. This post documents how to build a disk health monitoring agent.

1. SMART Technology Overview

1.1 What is SMART?

S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) is built-in self-testing in hard drives. Drives continuously collect health metrics that the OS can read.

1.2 Key Metrics

ID	Attribute	Meaning	Danger Threshold
5	Reallocated Sectors Count	Remapped bad sectors	>0 needs attention
187	Reported Uncorrectable Errors	Errors that couldn’t be corrected	>0 dangerous
188	Command Timeout	Command timeout count	Rapid increase = danger
197	Current Pending Sector Count	Sectors waiting to be remapped	>0 = potential failure
198	Offline Uncorrectable	Offline uncorrectable errors	>0 = imminent failure
194	Temperature	Temperature	>55°C alert

1.3 Typical Drive Lifespan Patterns

According to Backblaze statistics:

Year 1: ~5% annual failure rate (“infant mortality”)
Years 2-3: ~1.5% annual failure rate (most stable)
Year 4+: Failure rate increases rapidly
After Reallocated Sectors > 0, failure probability spikes within 1 year

2. Using smartctl

2.1 Basic Commands

1
2
3
4
5
6
7
8
# View all SMART info
smartctl -a /dev/sda

# Health status only
smartctl -H /dev/sda

# JSON output (easier to parse)
smartctl -a /dev/sda -j

2.2 Output Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  "device": {
    "name": "/dev/sda",
    "type": "sat"
  },
  "model_name": "WDC WD40EFRX-68N32N0",
  "serial_number": "WD-WCC7K1234567",
  "ata_smart_attributes": {
    "table": [
      {
        "id": 5,
        "name": "Reallocated_Sector_Ct",
        "value": 200,
        "worst": 200,
        "thresh": 140,
        "raw": {"value": 0}
      },
      {
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 117,
        "raw": {"value": 33}
      }
    ]
  }
}

3. Go Implementation

3.1 Data Structure Definitions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
package diskmon

// SmartctlOutput maps to smartctl -j output
type SmartctlOutput struct {
    Device struct {
        Name string `json:"name"`
        Type string `json:"type"`
    } `json:"device"`
    ModelName    string `json:"model_name"`
    SerialNumber string `json:"serial_number"`
    SmartStatus  struct {
        Passed bool `json:"passed"`
    } `json:"smart_status"`
    ATASmartAttributes struct {
        Table []SmartAttribute `json:"table"`
    } `json:"ata_smart_attributes"`
    Temperature struct {
        Current int `json:"current"`
    } `json:"temperature"`
    PowerOnTime struct {
        Hours int `json:"hours"`
    } `json:"power_on_time"`
}

type SmartAttribute struct {
    ID     int    `json:"id"`
    Name   string `json:"name"`
    Value  int    `json:"value"`
    Worst  int    `json:"worst"`
    Thresh int    `json:"thresh"`
    Raw    struct {
        Value int `json:"value"`
    } `json:"raw"`
}

// DiskHealth is the simplified health status
type DiskHealth struct {
    Device             string
    Model              string
    Serial             string
    Passed             bool
    Temperature        int
    PowerOnHours       int
    ReallocatedSectors int
    PendingSectors     int
    UncorrectableErrors int
    Alerts             []string
}

3.2 Execute smartctl and Parse

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
func GetDiskHealth(device string) (*DiskHealth, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    cmd := exec.CommandContext(ctx, "smartctl", "-a", "-j", device)
    output, err := cmd.Output()
    if err != nil {
        // smartctl returning non-0 might just be a warning
        if exitErr, ok := err.(*exec.ExitError); ok {
            // Return code 32 = past errors, 64 = current errors
            if exitErr.ExitCode() >= 4 {
                return nil, fmt.Errorf("smartctl failed: %w", err)
            }
            // Continue parsing output
        } else {
            return nil, err
        }
    }
    
    var raw SmartctlOutput
    if err := json.Unmarshal(output, &raw); err != nil {
        return nil, fmt.Errorf("parse smartctl output: %w", err)
    }
    
    return analyzeHealth(&raw), nil
}

func analyzeHealth(raw *SmartctlOutput) *DiskHealth {
    health := &DiskHealth{
        Device:       raw.Device.Name,
        Model:        raw.ModelName,
        Serial:       raw.SerialNumber,
        Passed:       raw.SmartStatus.Passed,
        Temperature:  raw.Temperature.Current,
        PowerOnHours: raw.PowerOnTime.Hours,
    }
    
    // Extract key attributes
    for _, attr := range raw.ATASmartAttributes.Table {
        switch attr.ID {
        case 5: // Reallocated Sectors
            health.ReallocatedSectors = attr.Raw.Value
        case 197: // Current Pending Sector
            health.PendingSectors = attr.Raw.Value
        case 187, 198: // Uncorrectable Errors
            health.UncorrectableErrors += attr.Raw.Value
        }
    }
    
    // Generate alerts
    health.Alerts = generateAlerts(health)
    
    return health
}

3.3 Alert Strategy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
type AlertLevel int

const (
    AlertNone AlertLevel = iota
    AlertWarning
    AlertCritical
)

func generateAlerts(h *DiskHealth) []string {
    var alerts []string
    
    // SMART self-test failed = replace immediately
    if !h.Passed {
        alerts = append(alerts, "[CRITICAL] SMART self-test failed! Backup data and replace drive immediately")
    }
    
    // Reallocated sectors
    if h.ReallocatedSectors > 0 {
        if h.ReallocatedSectors < 10 {
            alerts = append(alerts, fmt.Sprintf("[WARNING] Found %d reallocated sectors, monitor closely", h.ReallocatedSectors))
        } else {
            alerts = append(alerts, fmt.Sprintf("[CRITICAL] Too many reallocated sectors (%d), replace soon", h.ReallocatedSectors))
        }
    }
    
    // Pending sectors (more dangerous)
    if h.PendingSectors > 0 {
        alerts = append(alerts, fmt.Sprintf("[CRITICAL] %d pending sectors, data may be at risk", h.PendingSectors))
    }
    
    // Uncorrectable errors
    if h.UncorrectableErrors > 0 {
        alerts = append(alerts, fmt.Sprintf("[CRITICAL] Detected %d uncorrectable errors", h.UncorrectableErrors))
    }
    
    // Temperature
    if h.Temperature > 55 {
        alerts = append(alerts, fmt.Sprintf("[WARNING] Disk temperature too high: %d°C, check cooling", h.Temperature))
    }
    
    // Power-on time
    if h.PowerOnHours > 35000 { // About 4 years
        alerts = append(alerts, fmt.Sprintf("[INFO] Drive has been running %d hours (%.1f years), consider replacement", 
            h.PowerOnHours, float64(h.PowerOnHours)/8760))
    }
    
    return alerts
}

3.4 Scheduled Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
type DiskMonitor struct {
    devices   []string
    interval  time.Duration
    alertChan chan Alert
    ctx       context.Context
    cancel    context.CancelFunc
}

func NewDiskMonitor(devices []string, interval time.Duration) *DiskMonitor {
    ctx, cancel := context.WithCancel(context.Background())
    return &DiskMonitor{
        devices:   devices,
        interval:  interval,
        alertChan: make(chan Alert, 100),
        ctx:       ctx,
        cancel:    cancel,
    }
}

func (m *DiskMonitor) Run() {
    ticker := time.NewTicker(m.interval)
    defer ticker.Stop()
    
    // Check immediately on startup
    m.checkAllDisks()
    
    for {
        select {
        case <-m.ctx.Done():
            return
        case <-ticker.C:
            m.checkAllDisks()
        }
    }
}

func (m *DiskMonitor) checkAllDisks() {
    for _, device := range m.devices {
        health, err := GetDiskHealth(device)
        if err != nil {
            log.Printf("Failed to check %s: %v", device, err)
            continue
        }
        
        // Record to Prometheus
        diskTemperature.WithLabelValues(device).Set(float64(health.Temperature))
        diskReallocatedSectors.WithLabelValues(device).Set(float64(health.ReallocatedSectors))
        
        // Send alerts
        for _, msg := range health.Alerts {
            m.alertChan <- Alert{
                Device:  device,
                Message: msg,
                Time:    time.Now(),
            }
        }
    }
}

4. Alert Notifications

4.1 Email Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
func (m *DiskMonitor) consumeAlerts() {
    for alert := range m.alertChan {
        // Prevent alert storms: same disk + same message, only send once per hour
        if m.isRecentlySent(alert) {
            continue
        }
        
        if err := m.sendEmail(alert); err != nil {
            log.Printf("Failed to send email: %v", err)
        }
        
        m.recordSent(alert)
    }
}

func (m *DiskMonitor) sendEmail(alert Alert) error {
    subject := fmt.Sprintf("[NAS Alert] %s disk anomaly", alert.Device)
    body := fmt.Sprintf(`
Device: %s
Time: %s
Details: %s

Please log in to NAS management interface for details.
`, alert.Device, alert.Time.Format("2006-01-02 15:04:05"), alert.Message)
    
    return smtp.SendMail(/* ... */)
}

4.2 Prometheus Metrics

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
var (
    diskTemperature = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "disk_temperature_celsius",
            Help: "Disk temperature in Celsius",
        },
        []string{"device"},
    )
    
    diskReallocatedSectors = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "disk_reallocated_sectors_total",
            Help: "Number of reallocated sectors",
        },
        []string{"device"},
    )
    
    diskSmartPassed = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "disk_smart_passed",
            Help: "SMART self-test passed (1) or failed (0)",
        },
        []string{"device"},
    )
)

5. Production Considerations

5.1 Avoid Frequent Queries

smartctl execution causes disk I/O — too frequent queries affect performance:

1
2
3
4
5
// Recommended intervals
const (
    CheckInterval = 15 * time.Minute // Normal check
    QuickInterval = 5 * time.Minute  // Increased monitoring after finding issues
)

5.2 Handling SSDs

SSD SMART attributes differ from HDDs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func analyzeSSDHealth(raw *SmartctlOutput) *DiskHealth {
    // SSD-specific attributes
    for _, attr := range raw.ATASmartAttributes.Table {
        switch attr.Name {
        case "Wear_Leveling_Count":
            // 100 = new drive, 0 = end of life
            if attr.Value < 20 {
                // Alert
            }
        case "Available_Reservd_Space":
            // Reserved space
        }
    }
}

5.3 RAID Scenarios

For software RAID, check underlying disks:

1
2
3
4
5
6
# List RAID members
cat /proc/mdstat

# Run smartctl on each member
smartctl -a /dev/sda
smartctl -a /dev/sdb

6. Summary

Component	Technology
Data collection	smartctl -j
Parsing	Go json.Unmarshal
Scheduling	time.Ticker + context
Alert dedup	In-memory cache + TTL
Observability	Prometheus metrics
Notifications	Email / Webhook

Core principle: Disk failures are inevitable, but they can be detected early. A good monitoring system gives users enough time to backup data and replace drives.

Related Posts
Go Daemon Development: Graceful SIGTERM Handling and Config Hot-Reload
OOM Killer in Practice: Debugging a Container Memory Leak

Contents