A NAS’s most important job is protecting user data. Disk failures often have warning signs, and SMART technology can detect problems early. This post documents how to build a disk health monitoring agent.
1. SMART Technology Overview
1.1 What is SMART?
S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) is built-in self-testing in hard drives. Drives continuously collect health metrics that the OS can read.
1.2 Key Metrics
| ID | Attribute | Meaning | Danger Threshold |
|---|
| 5 | Reallocated Sectors Count | Remapped bad sectors | >0 needs attention |
| 187 | Reported Uncorrectable Errors | Errors that couldn’t be corrected | >0 dangerous |
| 188 | Command Timeout | Command timeout count | Rapid increase = danger |
| 197 | Current Pending Sector Count | Sectors waiting to be remapped | >0 = potential failure |
| 198 | Offline Uncorrectable | Offline uncorrectable errors | >0 = imminent failure |
| 194 | Temperature | Temperature | >55°C alert |
1.3 Typical Drive Lifespan Patterns
According to Backblaze statistics:
- Year 1: ~5% annual failure rate (“infant mortality”)
- Years 2-3: ~1.5% annual failure rate (most stable)
- Year 4+: Failure rate increases rapidly
- After Reallocated Sectors > 0, failure probability spikes within 1 year
2. Using smartctl
2.1 Basic Commands
1
2
3
4
5
6
7
8
| # View all SMART info
smartctl -a /dev/sda
# Health status only
smartctl -H /dev/sda
# JSON output (easier to parse)
smartctl -a /dev/sda -j
|
2.2 Output Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| {
"device": {
"name": "/dev/sda",
"type": "sat"
},
"model_name": "WDC WD40EFRX-68N32N0",
"serial_number": "WD-WCC7K1234567",
"ata_smart_attributes": {
"table": [
{
"id": 5,
"name": "Reallocated_Sector_Ct",
"value": 200,
"worst": 200,
"thresh": 140,
"raw": {"value": 0}
},
{
"id": 194,
"name": "Temperature_Celsius",
"value": 117,
"raw": {"value": 33}
}
]
}
}
|
3. Go Implementation
3.1 Data Structure Definitions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| package diskmon
// SmartctlOutput maps to smartctl -j output
type SmartctlOutput struct {
Device struct {
Name string `json:"name"`
Type string `json:"type"`
} `json:"device"`
ModelName string `json:"model_name"`
SerialNumber string `json:"serial_number"`
SmartStatus struct {
Passed bool `json:"passed"`
} `json:"smart_status"`
ATASmartAttributes struct {
Table []SmartAttribute `json:"table"`
} `json:"ata_smart_attributes"`
Temperature struct {
Current int `json:"current"`
} `json:"temperature"`
PowerOnTime struct {
Hours int `json:"hours"`
} `json:"power_on_time"`
}
type SmartAttribute struct {
ID int `json:"id"`
Name string `json:"name"`
Value int `json:"value"`
Worst int `json:"worst"`
Thresh int `json:"thresh"`
Raw struct {
Value int `json:"value"`
} `json:"raw"`
}
// DiskHealth is the simplified health status
type DiskHealth struct {
Device string
Model string
Serial string
Passed bool
Temperature int
PowerOnHours int
ReallocatedSectors int
PendingSectors int
UncorrectableErrors int
Alerts []string
}
|
3.2 Execute smartctl and Parse
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| func GetDiskHealth(device string) (*DiskHealth, error) {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "smartctl", "-a", "-j", device)
output, err := cmd.Output()
if err != nil {
// smartctl returning non-0 might just be a warning
if exitErr, ok := err.(*exec.ExitError); ok {
// Return code 32 = past errors, 64 = current errors
if exitErr.ExitCode() >= 4 {
return nil, fmt.Errorf("smartctl failed: %w", err)
}
// Continue parsing output
} else {
return nil, err
}
}
var raw SmartctlOutput
if err := json.Unmarshal(output, &raw); err != nil {
return nil, fmt.Errorf("parse smartctl output: %w", err)
}
return analyzeHealth(&raw), nil
}
func analyzeHealth(raw *SmartctlOutput) *DiskHealth {
health := &DiskHealth{
Device: raw.Device.Name,
Model: raw.ModelName,
Serial: raw.SerialNumber,
Passed: raw.SmartStatus.Passed,
Temperature: raw.Temperature.Current,
PowerOnHours: raw.PowerOnTime.Hours,
}
// Extract key attributes
for _, attr := range raw.ATASmartAttributes.Table {
switch attr.ID {
case 5: // Reallocated Sectors
health.ReallocatedSectors = attr.Raw.Value
case 197: // Current Pending Sector
health.PendingSectors = attr.Raw.Value
case 187, 198: // Uncorrectable Errors
health.UncorrectableErrors += attr.Raw.Value
}
}
// Generate alerts
health.Alerts = generateAlerts(health)
return health
}
|
3.3 Alert Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| type AlertLevel int
const (
AlertNone AlertLevel = iota
AlertWarning
AlertCritical
)
func generateAlerts(h *DiskHealth) []string {
var alerts []string
// SMART self-test failed = replace immediately
if !h.Passed {
alerts = append(alerts, "[CRITICAL] SMART self-test failed! Backup data and replace drive immediately")
}
// Reallocated sectors
if h.ReallocatedSectors > 0 {
if h.ReallocatedSectors < 10 {
alerts = append(alerts, fmt.Sprintf("[WARNING] Found %d reallocated sectors, monitor closely", h.ReallocatedSectors))
} else {
alerts = append(alerts, fmt.Sprintf("[CRITICAL] Too many reallocated sectors (%d), replace soon", h.ReallocatedSectors))
}
}
// Pending sectors (more dangerous)
if h.PendingSectors > 0 {
alerts = append(alerts, fmt.Sprintf("[CRITICAL] %d pending sectors, data may be at risk", h.PendingSectors))
}
// Uncorrectable errors
if h.UncorrectableErrors > 0 {
alerts = append(alerts, fmt.Sprintf("[CRITICAL] Detected %d uncorrectable errors", h.UncorrectableErrors))
}
// Temperature
if h.Temperature > 55 {
alerts = append(alerts, fmt.Sprintf("[WARNING] Disk temperature too high: %d°C, check cooling", h.Temperature))
}
// Power-on time
if h.PowerOnHours > 35000 { // About 4 years
alerts = append(alerts, fmt.Sprintf("[INFO] Drive has been running %d hours (%.1f years), consider replacement",
h.PowerOnHours, float64(h.PowerOnHours)/8760))
}
return alerts
}
|
3.4 Scheduled Checks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| type DiskMonitor struct {
devices []string
interval time.Duration
alertChan chan Alert
ctx context.Context
cancel context.CancelFunc
}
func NewDiskMonitor(devices []string, interval time.Duration) *DiskMonitor {
ctx, cancel := context.WithCancel(context.Background())
return &DiskMonitor{
devices: devices,
interval: interval,
alertChan: make(chan Alert, 100),
ctx: ctx,
cancel: cancel,
}
}
func (m *DiskMonitor) Run() {
ticker := time.NewTicker(m.interval)
defer ticker.Stop()
// Check immediately on startup
m.checkAllDisks()
for {
select {
case <-m.ctx.Done():
return
case <-ticker.C:
m.checkAllDisks()
}
}
}
func (m *DiskMonitor) checkAllDisks() {
for _, device := range m.devices {
health, err := GetDiskHealth(device)
if err != nil {
log.Printf("Failed to check %s: %v", device, err)
continue
}
// Record to Prometheus
diskTemperature.WithLabelValues(device).Set(float64(health.Temperature))
diskReallocatedSectors.WithLabelValues(device).Set(float64(health.ReallocatedSectors))
// Send alerts
for _, msg := range health.Alerts {
m.alertChan <- Alert{
Device: device,
Message: msg,
Time: time.Now(),
}
}
}
}
|
4. Alert Notifications
4.1 Email Integration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| func (m *DiskMonitor) consumeAlerts() {
for alert := range m.alertChan {
// Prevent alert storms: same disk + same message, only send once per hour
if m.isRecentlySent(alert) {
continue
}
if err := m.sendEmail(alert); err != nil {
log.Printf("Failed to send email: %v", err)
}
m.recordSent(alert)
}
}
func (m *DiskMonitor) sendEmail(alert Alert) error {
subject := fmt.Sprintf("[NAS Alert] %s disk anomaly", alert.Device)
body := fmt.Sprintf(`
Device: %s
Time: %s
Details: %s
Please log in to NAS management interface for details.
`, alert.Device, alert.Time.Format("2006-01-02 15:04:05"), alert.Message)
return smtp.SendMail(/* ... */)
}
|
4.2 Prometheus Metrics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| var (
diskTemperature = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "disk_temperature_celsius",
Help: "Disk temperature in Celsius",
},
[]string{"device"},
)
diskReallocatedSectors = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "disk_reallocated_sectors_total",
Help: "Number of reallocated sectors",
},
[]string{"device"},
)
diskSmartPassed = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "disk_smart_passed",
Help: "SMART self-test passed (1) or failed (0)",
},
[]string{"device"},
)
)
|
5. Production Considerations
5.1 Avoid Frequent Queries
smartctl execution causes disk I/O — too frequent queries affect performance:
1
2
3
4
5
| // Recommended intervals
const (
CheckInterval = 15 * time.Minute // Normal check
QuickInterval = 5 * time.Minute // Increased monitoring after finding issues
)
|
5.2 Handling SSDs
SSD SMART attributes differ from HDDs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| func analyzeSSDHealth(raw *SmartctlOutput) *DiskHealth {
// SSD-specific attributes
for _, attr := range raw.ATASmartAttributes.Table {
switch attr.Name {
case "Wear_Leveling_Count":
// 100 = new drive, 0 = end of life
if attr.Value < 20 {
// Alert
}
case "Available_Reservd_Space":
// Reserved space
}
}
}
|
5.3 RAID Scenarios
For software RAID, check underlying disks:
1
2
3
4
5
6
| # List RAID members
cat /proc/mdstat
# Run smartctl on each member
smartctl -a /dev/sda
smartctl -a /dev/sdb
|
6. Summary
| Component | Technology |
|---|
| Data collection | smartctl -j |
| Parsing | Go json.Unmarshal |
| Scheduling | time.Ticker + context |
| Alert dedup | In-memory cache + TTL |
| Observability | Prometheus metrics |
| Notifications | Email / Webhook |
Core principle: Disk failures are inevitable, but they can be detected early. A good monitoring system gives users enough time to backup data and replace drives.
Related Posts