Files
2026-01-21 16:15:49 +08:00

4.0 KiB
Raw Permalink Blame History

name: implementing-deadman-switch description: Guides implementation of deadman switch (dead hand system) and heartbeat mechanism in watchdog-agent for authorization enforcement. Use when modifying heartbeat intervals, failure thresholds, or business process termination logic. Keywords: deadman, heartbeat, agent, authorization, sigterm, fail-count, self-destruct. argument-hint: ": agent-heartbeat | fail-threshold | kill-logic | interval-config" allowed-tools: - Read - Glob - Grep - Bash - Edit - Write

Implementing Deadman Switch

watchdog-agent 内置死手系统,当连续授权失败达到阈值时终止业务进程。

动态上下文注入

# 查找Agent心跳实现
!`grep -rn "heartbeat\|Heartbeat" rmdc-watchdog-agent/`

# 查找kill逻辑
!`grep -n "SIGTERM\|Kill\|Signal" rmdc-watchdog-agent/`

Plan

根据 $ARGUMENTS 确定修改范围:

Component 涉及文件 关键参数
agent-heartbeat agent心跳模块 HeartbeatRequest/Response
fail-threshold 失败计数逻辑 maxRetryCount=12
kill-logic 进程终止逻辑 SIGTERM信号
interval-config 心跳间隔配置 成功2h/失败1h

产物清单

  • Agent心跳循环实现
  • 失败计数与阈值判断
  • 业务进程终止逻辑

Verify

  • 失败阈值maxRetryCount = 12
  • 心跳间隔成功后2小时失败后1小时
  • TOTP验证首次连接获取密钥后续请求双向验证
  • 终止信号使用SIGTERM优雅终止非SIGKILL
  • 计数重置:授权成功后 failCount = 1非0
  • 时间戳校验:|now - timestamp| < 5分钟
# 验证Agent编译
!`cd rmdc-watchdog-agent && go build ./...`

# 验证心跳逻辑
!`cd rmdc-watchdog-agent && go test ./... -v -run TestHeartbeat`

Execute

心跳循环实现

func (a *Agent) heartbeatLoop() {
    failCount := 0

    for {
        resp, err := a.sendHeartbeat()

        if err != nil || !resp.Authorized {
            failCount++

            if failCount >= 12 {
                a.killBusiness()
                return
            }

            time.Sleep(1 * time.Hour) // 失败后等待1小时
        } else {
            failCount = 1 // 成功后重置为1
            time.Sleep(2 * time.Hour) // 成功后等待2小时
        }
    }
}

业务终止实现

func (a *Agent) killBusiness() {
    log.Warn("deadman switch triggered, terminating business process")
    a.businessProcess.Signal(syscall.SIGTERM)
}

首次连接处理

func (a *Agent) sendHeartbeat() (*HeartbeatResponse, error) {
    req := &HeartbeatRequest{
        HostInfo:  a.hostInfo,
        EnvInfo:   a.envInfo,
        Timestamp: time.Now().UnixMilli(),
        TOTPCode:  "", // 首次为空
    }

    // 非首次连接生成TOTP
    if a.tierTwoSecret != "" {
        req.TOTPCode = totp.GenerateTierTwo(a.tierTwoSecret)
    }

    resp, err := a.httpClient.Post(a.watchdogURL+"/api/heartbeat", req)
    if err != nil {
        return nil, err
    }

    // 首次连接,保存密钥
    if resp.SecondTOTPSecret != "" {
        a.tierTwoSecret = resp.SecondTOTPSecret
    }

    // 验证服务端TOTP双向验证
    if req.TOTPCode != "" && !totp.VerifyTierTwo(resp.TOTPCode, a.tierTwoSecret) {
        return nil, errors.New("invalid server totp")
    }

    return resp, nil
}

Pitfalls

  1. failCount初始值成功后设为1而非0避免边界条件错误
  2. SIGKILL误用应使用SIGTERM允许业务优雅退出
  3. 心跳阻塞sendHeartbeat需设置超时避免网络问题导致卡死
  4. 双向验证遗漏必须验证服务端返回的TOTP
  5. 首次连接特殊处理TOTPCode为空时获取密钥不计入失败
  6. 间隔配置硬编码:应支持配置化,便于不同项目调整
  7. 日志泄露禁止在日志中打印TOTP密钥

Reference