Skip to content

バックアップ・リカバリ・監視設計 / Backup, Recovery & Monitoring Design

24時間AI同棲生活配信を安定運用するための、バックアップ・リカバリ・監視の設計。データ保全、障害復旧、稼働監視、アラート通知を網羅する。

Backup, recovery, and monitoring design for stable 24/7 AI cohabitation life streaming. Covers data protection, disaster recovery, uptime monitoring, and alert notifications.


設計原則 / Design Principles

  1. データを守る / Protect data — ゲーム状態・記憶データは定期バックアップで保全する
  2. 最小ダウンタイム / Minimize downtime — 障害発生から復旧までを自動化し、手動介入を最小化する
  3. 異常を即座に検知 / Detect anomalies immediately — メトリクス監視とアラートで問題を早期発見する
  4. シンプルに保つ / Keep it simple — ローカルPCまたは安価VPSでも運用可能な軽量設計

1. SQLite バックアップ / SQLite Backup

1.1 バックアップ戦略 / Backup Strategy

SQLite は単一ファイルのため、ファイルコピーでバックアップ可能。ただし書き込み中のコピーはデータ破損リスクがあるため、SQLite Online Backup API または WAL チェックポイント後にコピーする。

SQLite is a single file, so backup is a simple file copy. However, copying during active writes risks corruption, so we use the SQLite Online Backup API or copy after WAL checkpoint.

バックアップフロー / Backup Flow

┌──────────────────────────────────────────────────────┐
│                   Cron (every 3 hours)                │
│                                                       │
│  1. WAL checkpoint (PRAGMA wal_checkpoint(TRUNCATE))  │
│  2. sqlite3 .backup → timestamped copy                │
│  3. Verify backup integrity (PRAGMA integrity_check)  │
│  4. Rotate old backups (keep N generations)            │
│  5. Log result + notify on failure                     │
└──────────────────────────────────────────────────────┘

1.2 Cron スケジュール / Cron Schedule

ScheduleFrequencyPurpose
0 */3 * * *Every 3 hoursRegular backup during streaming
0 0 * * *Daily at midnightDaily full backup with integrity check
0 0 * * 0Weekly (Sunday midnight)Weekly archive backup (compressed)

Why every 3 hours?

10分サイクル × 18回 = 3時間分のゲーム進行データ。最悪でも3時間分のロストで済む。

10-min cycles × 18 = 3 hours of game progress. Worst case, only 3 hours of data is lost.

1.3 世代管理 / Retention Generations

TypeRetentionMax FilesStorage Estimate
3-hourly24 hours8 files~80 MB (10 MB × 8)
Daily7 days7 files~70 MB
Weekly4 weeks4 files (gzip)~12 MB
Total19 files~162 MB

Storage Note

SQLite DBは通常10MB以下(メモリデータ含む)。圧縮で約30%に縮小可能。

SQLite DB is typically under 10 MB (including memory data). Compression reduces it to ~30%.

1.4 バックアップスクリプト / Backup Script

bash
#!/bin/bash
# scripts/backup-db.sh
# SQLite backup script for AIPrizeCohabitationLife
# Usage: ./scripts/backup-db.sh [hourly|daily|weekly]

set -euo pipefail

# --- Configuration ---
DB_PATH="${DB_PATH:-./data/game.sqlite}"
BACKUP_DIR="${BACKUP_DIR:-./backups}"
BACKUP_TYPE="${1:-hourly}"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"

# Retention settings
HOURLY_KEEP=8
DAILY_KEEP=7
WEEKLY_KEEP=4

# --- Functions ---

log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] [backup] $1"
}

notify_slack() {
  local message="$1"
  local color="${2:-danger}"
  if [ -n "$SLACK_WEBHOOK_URL" ]; then
    curl -s -X POST "$SLACK_WEBHOOK_URL" \
      -H 'Content-Type: application/json' \
      -d "{\"attachments\":[{\"color\":\"$color\",\"text\":\"$message\"}]}" \
      > /dev/null 2>&1 || true
  fi
}

# --- Pre-checks ---
if [ ! -f "$DB_PATH" ]; then
  log "ERROR: Database not found at $DB_PATH"
  notify_slack ":x: [Backup FAILED] Database not found: $DB_PATH"
  exit 1
fi

mkdir -p "$BACKUP_DIR/hourly" "$BACKUP_DIR/daily" "$BACKUP_DIR/weekly"

# --- Determine target directory ---
case "$BACKUP_TYPE" in
  hourly) TARGET_DIR="$BACKUP_DIR/hourly"; KEEP=$HOURLY_KEEP ;;
  daily)  TARGET_DIR="$BACKUP_DIR/daily";  KEEP=$DAILY_KEEP ;;
  weekly) TARGET_DIR="$BACKUP_DIR/weekly"; KEEP=$WEEKLY_KEEP ;;
  *)
    log "ERROR: Unknown backup type: $BACKUP_TYPE (use hourly|daily|weekly)"
    exit 1
    ;;
esac

BACKUP_FILE="$TARGET_DIR/game_${BACKUP_TYPE}_${TIMESTAMP}.sqlite"

# --- Step 1: WAL Checkpoint ---
log "Running WAL checkpoint..."
sqlite3 "$DB_PATH" "PRAGMA wal_checkpoint(TRUNCATE);" 2>/dev/null || {
  log "WARNING: WAL checkpoint failed, proceeding with backup anyway"
}

# --- Step 2: Create backup using SQLite .backup command ---
log "Creating $BACKUP_TYPE backup..."
sqlite3 "$DB_PATH" ".backup '$BACKUP_FILE'" || {
  log "ERROR: Backup failed!"
  notify_slack ":x: [Backup FAILED] sqlite3 .backup failed for $BACKUP_TYPE"
  exit 1
}

# --- Step 3: Verify backup integrity ---
log "Verifying backup integrity..."
INTEGRITY=$(sqlite3 "$BACKUP_FILE" "PRAGMA integrity_check;" 2>/dev/null)
if [ "$INTEGRITY" != "ok" ]; then
  log "ERROR: Backup integrity check failed: $INTEGRITY"
  notify_slack ":x: [Backup FAILED] Integrity check failed for $BACKUP_TYPE: $INTEGRITY"
  rm -f "$BACKUP_FILE"
  exit 1
fi

# --- Step 4: Compress weekly backups ---
if [ "$BACKUP_TYPE" = "weekly" ]; then
  log "Compressing weekly backup..."
  gzip "$BACKUP_FILE"
  BACKUP_FILE="${BACKUP_FILE}.gz"
fi

# --- Step 5: Rotate old backups ---
log "Rotating old backups (keeping last $KEEP)..."
ls -1t "$TARGET_DIR"/game_${BACKUP_TYPE}_* 2>/dev/null | tail -n +$((KEEP + 1)) | xargs -r rm -f

# --- Done ---
BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
log "Backup complete: $BACKUP_FILE ($BACKUP_SIZE)"
notify_slack ":white_check_mark: [Backup OK] $BACKUP_TYPE backup complete ($BACKUP_SIZE)" "good"

1.5 Crontab 設定 / Crontab Configuration

bash
# crontab -e
# AIPrizeCohabitationLife SQLite Backups

# Every 3 hours - regular backup
0 */3 * * * cd /path/to/AIPrizeCohabitationLife && ./scripts/backup-db.sh hourly >> ./logs/backup.log 2>&1

# Daily at midnight - full backup with integrity check
0 0 * * * cd /path/to/AIPrizeCohabitationLife && ./scripts/backup-db.sh daily >> ./logs/backup.log 2>&1

# Weekly Sunday midnight - compressed archive
0 0 * * 0 cd /path/to/AIPrizeCohabitationLife && ./scripts/backup-db.sh weekly >> ./logs/backup.log 2>&1

2. リカバリ手順 / Recovery Procedure

2.1 リカバリ判断フロー / Recovery Decision Flow

障害検知 / Fault Detected


┌────────────────────────────┐
│ pm2 が自動再起動を試行      │
│ pm2 attempts auto-restart   │
└──────────┬─────────────────┘

     ┌─────┴──────┐
     │ 起動成功?   │
     │ Start OK?   │
     └─────┬──────┘
       Yes │        No
           │         │
           ▼         ▼
      ┌────────┐  ┌──────────────────────────┐
      │ 正常復帰 │  │ DB破損チェック              │
      │ Resumed │  │ Check DB corruption        │
      └────────┘  └──────────┬───────────────┘
                         ┌───┴────┐
                    OK   │ 破損?   │  Corrupted
                    ┌────┤Corrupt? ├────┐
                    │    └────────┘    │
                    ▼                  ▼
              ┌──────────┐   ┌──────────────────────┐
              │ 他の原因を │   │ バックアップから復元    │
              │ 調査       │   │ Restore from backup   │
              │ Check logs │   └──────────────────────┘
              └──────────┘

2.2 リカバリ手順 / Recovery Steps

Case 1: プロセスクラッシュ / Process Crash

pm2 が自動再起動する。ゲーム状態は SQLite に永続化されているため、最後の保存ポイントから自動復帰する。

pm2 auto-restarts the process. Game state is persisted in SQLite, so the game resumes from the last save point automatically.

bash
# 1. 状態確認 / Check status
pm2 status

# 2. 自動再起動されていない場合 / If not auto-restarted
pm2 restart ailife

# 3. ログ確認 / Check logs
pm2 logs ailife --lines 50

Case 2: SQLite データ破損 / SQLite Data Corruption

bash
# 1. 現在のDBを退避 / Move corrupted DB aside
mv ./data/game.sqlite ./data/game.sqlite.corrupted_$(date +%Y%m%d_%H%M%S)

# 2. 最新の正常バックアップを探す / Find latest good backup
ls -lt ./backups/hourly/

# 3. 整合性チェック / Integrity check on candidate
sqlite3 ./backups/hourly/game_hourly_XXXXXXXX_XXXXXX.sqlite "PRAGMA integrity_check;"

# 4. バックアップから復元 / Restore from backup
cp ./backups/hourly/game_hourly_XXXXXXXX_XXXXXX.sqlite ./data/game.sqlite

# 5. プロセス再起動 / Restart process
pm2 restart ailife

# 6. 動作確認 / Verify operation
curl -s http://localhost:3000/api/health | jq .

Case 3: 完全リカバリ(全データ消失) / Full Recovery (Total Data Loss)

bash
# 1. 週次バックアップから復元 / Restore from weekly backup
gunzip -k ./backups/weekly/game_weekly_XXXXXXXX_XXXXXX.sqlite.gz
cp ./backups/weekly/game_weekly_XXXXXXXX_XXXXXX.sqlite ./data/game.sqlite

# 2. 整合性チェック / Verify integrity
sqlite3 ./data/game.sqlite "PRAGMA integrity_check;"

# 3. 全サービス再起動 / Restart all services
pm2 restart all

# 4. OBS が自動再接続されるか確認 / Verify OBS reconnects
# OBS > Settings > Advanced > Automatically reconnect: ON

2.3 リカバリスクリプト / Recovery Script

bash
#!/bin/bash
# scripts/restore-db.sh
# Restore game database from backup
# Usage: ./scripts/restore-db.sh [backup_file_path]

set -euo pipefail

DB_PATH="${DB_PATH:-./data/game.sqlite}"
BACKUP_FILE="$1"

log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] [restore] $1"
}

if [ -z "$BACKUP_FILE" ]; then
  echo "Usage: $0 <backup_file_path>"
  echo ""
  echo "Available backups:"
  echo "--- Hourly ---"
  ls -lt ./backups/hourly/ 2>/dev/null | head -5
  echo "--- Daily ---"
  ls -lt ./backups/daily/ 2>/dev/null | head -5
  echo "--- Weekly ---"
  ls -lt ./backups/weekly/ 2>/dev/null | head -5
  exit 1
fi

if [ ! -f "$BACKUP_FILE" ]; then
  log "ERROR: Backup file not found: $BACKUP_FILE"
  exit 1
fi

# Handle gzipped backups
RESTORE_FILE="$BACKUP_FILE"
if [[ "$BACKUP_FILE" == *.gz ]]; then
  log "Decompressing backup..."
  RESTORE_FILE="${BACKUP_FILE%.gz}"
  gunzip -k "$BACKUP_FILE"
fi

# Verify backup integrity
log "Verifying backup integrity..."
INTEGRITY=$(sqlite3 "$RESTORE_FILE" "PRAGMA integrity_check;")
if [ "$INTEGRITY" != "ok" ]; then
  log "ERROR: Backup integrity check failed: $INTEGRITY"
  exit 1
fi

# Stop the application
log "Stopping application..."
pm2 stop ailife 2>/dev/null || true

# Archive corrupted DB
if [ -f "$DB_PATH" ]; then
  ARCHIVE="$DB_PATH.pre_restore_$(date +%Y%m%d_%H%M%S)"
  log "Archiving current DB to $ARCHIVE"
  mv "$DB_PATH" "$ARCHIVE"
fi

# Restore
log "Restoring from $RESTORE_FILE..."
cp "$RESTORE_FILE" "$DB_PATH"

# Restart application
log "Restarting application..."
pm2 restart ailife

log "Restore complete. Verify with: curl http://localhost:3000/api/health"

3. ヘルスモニタリング / Health Monitoring

3.1 監視対象 / Monitoring Targets

┌──────────────────────────────────────────────────────────────┐
│                    Monitoring Targets                          │
│                                                               │
│  ┌───────────────────┐  ┌───────────────────┐                │
│  │ Process Liveness   │  │ LLM Performance    │               │
│  │ ・pm2 status       │  │ ・Response time     │               │
│  │ ・Memory usage     │  │ ・Error rate        │               │
│  │ ・CPU usage        │  │ ・Timeout count     │               │
│  │ ・Restart count    │  │ ・Fallback rate     │               │
│  └───────────────────┘  └───────────────────┘                │
│                                                               │
│  ┌───────────────────┐  ┌───────────────────┐                │
│  │ Tip Processing     │  │ System Resources   │               │
│  │ ・Tips/hour count  │  │ ・Disk space        │               │
│  │ ・Processing time  │  │ ・SQLite file size  │               │
│  │ ・Queue depth      │  │ ・Log file size     │               │
│  │ ・Failed tips      │  │ ・Network latency   │               │
│  └───────────────────┘  └───────────────────┘                │
│                                                               │
│  ┌───────────────────┐  ┌───────────────────┐                │
│  │ Platform Status    │  │ Game State          │               │
│  │ ・YouTube API OK   │  │ ・Cycle count       │               │
│  │ ・TikTok WS alive  │  │ ・Last cycle time   │               │
│  │ ・OBS connected    │  │ ・Character status  │               │
│  └───────────────────┘  └───────────────────┘                │
└──────────────────────────────────────────────────────────────┘

3.2 ヘルスチェックエンドポイント / Health Check Endpoint

バックエンドサーバーに /api/health エンドポイントを設置し、各コンポーネントの状態を返す。

The backend server exposes a /api/health endpoint that returns the status of each component.

typescript
// GET /api/health
// Response example:

interface HealthResponse {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  uptime_seconds: number;
  checks: {
    process: {
      status: 'ok' | 'warning' | 'error';
      memory_mb: number;
      cpu_percent: number;
      restart_count: number;
    };
    database: {
      status: 'ok' | 'error';
      size_mb: number;
      last_write: string;
      integrity: 'ok' | 'unknown';
    };
    llm: {
      status: 'ok' | 'degraded' | 'down';
      avg_response_ms: number;
      error_rate_percent: number;
      last_success: string;
      fallback_active: boolean;
    };
    tips: {
      status: 'ok' | 'warning';
      processed_last_hour: number;
      failed_last_hour: number;
      queue_depth: number;
    };
    platforms: {
      youtube: { status: 'ok' | 'disconnected'; last_poll: string };
      tiktok: { status: 'ok' | 'disconnected'; last_event: string };
    };
    game: {
      status: 'ok' | 'stale';
      current_cycle: number;
      last_cycle_at: string;
      game_day: number;
    };
  };
}
json
{
  "status": "healthy",
  "timestamp": "2026-03-11T15:30:00.000Z",
  "uptime_seconds": 86400,
  "checks": {
    "process": {
      "status": "ok",
      "memory_mb": 145,
      "cpu_percent": 8.2,
      "restart_count": 0
    },
    "database": {
      "status": "ok",
      "size_mb": 6.3,
      "last_write": "2026-03-11T15:29:50.000Z",
      "integrity": "ok"
    },
    "llm": {
      "status": "ok",
      "avg_response_ms": 1200,
      "error_rate_percent": 0.5,
      "last_success": "2026-03-11T15:29:55.000Z",
      "fallback_active": false
    },
    "tips": {
      "status": "ok",
      "processed_last_hour": 12,
      "failed_last_hour": 0,
      "queue_depth": 0
    },
    "platforms": {
      "youtube": { "status": "ok", "last_poll": "2026-03-11T15:29:58.000Z" },
      "tiktok": { "status": "ok", "last_event": "2026-03-11T15:28:30.000Z" }
    },
    "game": {
      "status": "ok",
      "current_cycle": 144,
      "last_cycle_at": "2026-03-11T15:20:00.000Z",
      "game_day": 15
    }
  }
}

3.3 メトリクス収集 / Metrics Collection

アプリケーション内でメトリクスを収集し、定期的にログ出力とヘルスチェックAPIに反映する。

Metrics are collected within the application and periodically output to logs and reflected in the health check API.

typescript
// packages/backend/src/monitoring/MetricsCollector.ts

interface Metrics {
  // LLM metrics (sliding window: last 100 calls)
  llm_response_times: number[];       // ms
  llm_error_count: number;            // errors in last hour
  llm_timeout_count: number;          // timeouts in last hour
  llm_fallback_count: number;         // fallbacks in last hour

  // Tip processing metrics
  tips_processed_total: number;       // since startup
  tips_processed_last_hour: number;
  tips_failed_last_hour: number;
  tip_processing_times: number[];     // ms, sliding window

  // Game cycle metrics
  cycles_completed: number;           // since startup
  last_cycle_duration_ms: number;
  cycle_drift_ms: number;             // deviation from 10-min target

  // System metrics
  process_memory_mb: number;
  process_cpu_percent: number;
  db_size_mb: number;
  ws_connected_clients: number;
}

3.4 pm2 プロセス監視 / pm2 Process Monitoring

pm2 設定ファイルでプロセスの自動再起動とメモリ上限を管理する。

pm2 configuration manages auto-restart and memory limits.

javascript
// ecosystem.config.js

module.exports = {
  apps: [{
    name: 'ailife',
    script: './packages/backend/dist/server.js',
    instances: 1,
    autorestart: true,
    watch: false,
    max_memory_restart: '512M',
    restart_delay: 5000,            // 5s delay between restarts
    max_restarts: 10,               // max 10 restarts in min_uptime window
    min_uptime: '60s',              // consider stable after 60s
    exp_backoff_restart_delay: 100, // exponential backoff on repeated crashes
    env: {
      NODE_ENV: 'production',
    },
    // Log configuration
    error_file: './logs/pm2-error.log',
    out_file: './logs/pm2-out.log',
    merge_logs: true,
    log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
  }],
};
bash
# pm2 の基本運用コマンド / Basic pm2 operations

# Start
pm2 start ecosystem.config.js

# Monitor (real-time dashboard)
pm2 monit

# Status check
pm2 status

# Restart
pm2 restart ailife

# View logs
pm2 logs ailife --lines 100

# Flush logs (to prevent log file growth)
pm2 flush

3.5 外部ヘルスチェック / External Health Check

外部からの定期ヘルスチェックスクリプト。Cron で毎分実行し、異常時にアラートを送信する。

External health check script. Runs every minute via cron and sends alerts on failures.

bash
#!/bin/bash
# scripts/health-check.sh
# External health check — runs via cron every minute

HEALTH_URL="${HEALTH_URL:-http://localhost:3000/api/health}"
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"
ALERT_STATE_FILE="/tmp/ailife_alert_state"

log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] [health] $1"
}

notify_slack() {
  local message="$1"
  local color="${2:-danger}"
  if [ -n "$SLACK_WEBHOOK_URL" ]; then
    curl -s -X POST "$SLACK_WEBHOOK_URL" \
      -H 'Content-Type: application/json' \
      -d "{\"attachments\":[{\"color\":\"$color\",\"text\":\"$message\"}]}" \
      > /dev/null 2>&1 || true
  fi
}

# Fetch health endpoint with 10s timeout
RESPONSE=$(curl -s -w "\n%{http_code}" --max-time 10 "$HEALTH_URL" 2>/dev/null)
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | sed '$d')

if [ "$HTTP_CODE" != "200" ]; then
  log "ERROR: Health check failed (HTTP $HTTP_CODE)"

  # Only alert if not already in alert state (prevent spam)
  if [ ! -f "$ALERT_STATE_FILE" ]; then
    notify_slack ":rotating_light: [ALERT] AIPrizeCohabitationLife is DOWN (HTTP $HTTP_CODE)"
    touch "$ALERT_STATE_FILE"
  fi
  exit 1
fi

# Parse status from JSON
STATUS=$(echo "$BODY" | grep -o '"status":"[^"]*"' | head -1 | cut -d'"' -f4)

case "$STATUS" in
  healthy)
    # Clear alert state if previously alerting
    if [ -f "$ALERT_STATE_FILE" ]; then
      notify_slack ":white_check_mark: [RECOVERED] AIPrizeCohabitationLife is back online" "good"
      rm -f "$ALERT_STATE_FILE"
    fi
    ;;
  degraded)
    log "WARNING: Service is degraded"
    if [ ! -f "$ALERT_STATE_FILE" ]; then
      notify_slack ":warning: [WARNING] AIPrizeCohabitationLife is degraded — check /api/health for details" "warning"
      touch "$ALERT_STATE_FILE"
    fi
    ;;
  unhealthy)
    log "ERROR: Service is unhealthy"
    if [ ! -f "$ALERT_STATE_FILE" ]; then
      notify_slack ":x: [CRITICAL] AIPrizeCohabitationLife is unhealthy — immediate attention required"
      touch "$ALERT_STATE_FILE"
    fi
    ;;
esac

Crontab エントリ / Crontab entry:

bash
# Health check every minute
* * * * * /path/to/AIPrizeCohabitationLife/scripts/health-check.sh >> ./logs/health-check.log 2>&1

4. アラート通知 / Alert Notifications

4.1 アラートレベル / Alert Levels

LevelConditionChannelFrequency
CRITICALProcess down, DB corrupted, health endpoint unreachableSlack + EmailImmediate (deduplicated)
WARNINGLLM degraded, platform disconnected, high error rate, memory > 80%SlackEvery 15 min (aggregated)
INFOBackup success, recovery success, daily statsSlackPer event

4.2 Slack Webhook 通知 / Slack Webhook Notification

typescript
// packages/backend/src/monitoring/AlertNotifier.ts

interface AlertConfig {
  slackWebhookUrl: string;
  emailTo?: string;           // optional: admin email
  deduplicationWindowMs: number; // suppress duplicate alerts (default: 900000 = 15 min)
}

// Alert message format examples:

// CRITICAL
// :rotating_light: [CRITICAL] AIPrizeCohabitationLife
// Process crashed and failed to auto-restart.
// Last error: "SQLITE_CORRUPT: database disk image is malformed"
// Time: 2026-03-11 15:30:00 JST

// WARNING
// :warning: [WARNING] AIPrizeCohabitationLife
// LLM API error rate exceeded 10% (current: 15.3%)
// Fallback responses are active.
// Time: 2026-03-11 15:30:00 JST

// INFO
// :white_check_mark: [INFO] AIPrizeCohabitationLife
// Daily backup completed successfully (6.3 MB)
// Time: 2026-03-11 00:00:05 JST

4.3 Email 通知(オプション) / Email Notification (Optional)

CRITICAL アラートのみメールでも通知する。安価VPSの場合は mailutils + postfix またはSendGrid APIを使用。

Email notification for CRITICAL alerts only. On a cheap VPS, use mailutils + postfix or the SendGrid API.

bash
# Simple email alert via mailutils (for VPS)
echo "AIPrizeCohabitationLife is DOWN. Check immediately." | \
  mail -s "[CRITICAL] AIPrizeCohabitationLife DOWN" admin@example.com

# Or via SendGrid API (more reliable)
curl -s --request POST \
  --url https://api.sendgrid.com/v3/mail/send \
  --header "Authorization: Bearer $SENDGRID_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "personalizations":[{"to":[{"email":"admin@example.com"}]}],
    "from":{"email":"alerts@ailife.example.com"},
    "subject":"[CRITICAL] AIPrizeCohabitationLife DOWN",
    "content":[{"type":"text/plain","value":"Service is down. Check immediately."}]
  }'

4.4 アラート抑制 / Alert Suppression

アラートのスパムを防ぐための重複排除ロジック。

Deduplication logic to prevent alert spam.

┌─────────────────────────────────────────────────────┐
│              Alert Deduplication Logic                │
│                                                      │
│  1. New alert generated                              │
│  2. Check: same alert_type in last 15 min?           │
│     ├── Yes → Suppress (increment counter)           │
│     └── No  → Send alert                             │
│  3. Every 15 min: if suppressed count > 0            │
│     └── Send summary: "X alerts suppressed"          │
│  4. On recovery: always send recovery notification   │
└─────────────────────────────────────────────────────┘

5. 配信リカバリ / Stream Recovery

5.1 サーバーダウンからの自動復旧フロー / Auto-Recovery from Server Down

Server Down


┌────────────────────────────────────────────────────────────────┐
│ Phase 1: Process Recovery (0〜10 seconds)                       │
│                                                                 │
│  pm2 detects crash → restart_delay: 5s → restart process        │
│  ・exp_backoff_restart_delay: 100ms base (exponential)          │
│  ・max_restarts: 10                                              │
│  ・Game state loaded from SQLite automatically                   │
└──────────────────────────────────┬─────────────────────────────┘


┌────────────────────────────────────────────────────────────────┐
│ Phase 2: WebSocket Reconnection (10〜30 seconds)                │
│                                                                 │
│  ・Frontend (PixiJS) detects WS disconnect                      │
│  ・Auto-reconnect with exponential backoff                      │
│  ・On reconnect: full state sync from backend                   │
│  ・During disconnect: show "Reconnecting..." overlay            │
└──────────────────────────────────┬─────────────────────────────┘


┌────────────────────────────────────────────────────────────────┐
│ Phase 3: OBS Reconnection (30〜60 seconds)                      │
│                                                                 │
│  ・OBS browser source auto-reloads on navigation                │
│  ・OBS "Automatically reconnect" setting handles stream          │
│  ・RTMP reconnection to YouTube/TikTok (automatic)              │
│  ・Viewers see brief freeze → stream resumes                     │
└──────────────────────────────────┬─────────────────────────────┘


┌────────────────────────────────────────────────────────────────┐
│ Phase 4: Platform Reconnection (30〜120 seconds)                │
│                                                                 │
│  ・YouTube LiveChat polling resumes automatically                │
│  ・TikTok WebSocket reconnects with auto-reconnect logic         │
│  ・Tip processing queue drains any queued events                 │
│  ・Game cycle timer re-calibrates to absolute time               │
└────────────────────────────────────────────────────────────────┘

5.2 OBS 設定 / OBS Configuration

OBS が自動的に復帰するための推奨設定。

Recommended OBS settings for automatic recovery.

SettingLocationValue
Auto-reconnectSettings > Advanced > Automatically ReconnectEnabled
Retry delaySettings > Advanced > Retry Delay10 seconds
Max retriesSettings > Advanced > Maximum Retries100 (effectively infinite)
Browser source refreshSource properties > Refresh browser when scene becomes activeEnabled
Custom CSS (optional)Browser source > Custom CSSHide reconnecting overlay from stream

5.3 フロントエンド再接続ロジック / Frontend Reconnection Logic

typescript
// packages/frontend/src/network/WebSocketClient.ts

// Reconnection strategy
const RECONNECT_CONFIG = {
  initialDelay: 1000,      // 1 second
  maxDelay: 30000,          // 30 seconds
  backoffMultiplier: 2,     // exponential
  maxRetries: Infinity,     // never give up
};

// Connection states displayed to viewers:
// ・Connected     → normal game display
// ・Reconnecting  → game frozen + subtle "Reconnecting..." text
// ・Disconnected  → "Stream will resume shortly" message

6. メトリクスダッシュボード / Metrics Dashboard

6.1 設計方針 / Design Approach

外部ツール(Grafana等)は導入しない。管理者用の簡易HTMLダッシュボードをバックエンドから配信する。

No external tools (Grafana, etc.). Serve a simple HTML admin dashboard from the backend.

┌──────────────────────────────────────────────────────────────┐
│  Admin Dashboard (http://localhost:3000/admin/dashboard)      │
│  Protected by ADMIN_API_KEY                                   │
│                                                               │
│  ┌──────────────────────────────────────────────────────┐    │
│  │  System Status:  ● HEALTHY          Uptime: 3d 14h   │    │
│  └──────────────────────────────────────────────────────┘    │
│                                                               │
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │  Process             │  │  LLM API                     │   │
│  │  Memory: 145 MB      │  │  Avg Response: 1.2s          │   │
│  │  CPU:    8.2%        │  │  Error Rate:   0.5%          │   │
│  │  Restarts: 0         │  │  Timeouts:     2 (last 1h)   │   │
│  │  Uptime: 3d 14h 22m  │  │  Fallbacks:    0             │   │
│  └─────────────────────┘  └─────────────────────────────┘   │
│                                                               │
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │  Tips (last 1h)      │  │  Game State                  │   │
│  │  Processed: 12       │  │  Day: 15                     │   │
│  │  Failed:    0        │  │  Cycle: 144                  │   │
│  │  Queue:     0        │  │  Last Cycle: 2 min ago       │   │
│  │  Revenue:   ¥3,500   │  │  John: cooking (mood: 72)    │   │
│  └─────────────────────┘  │  Sara: reading (mood: 85)    │   │
│                            │  Eve:  sleeping              │   │
│  ┌─────────────────────┐  └─────────────────────────────┘   │
│  │  Platforms           │                                    │
│  │  YouTube:  ● OK      │  ┌─────────────────────────────┐   │
│  │  TikTok:   ● OK      │  │  Backups                     │   │
│  │  OBS/WS:   ● OK      │  │  Last: 30 min ago (OK)       │   │
│  │  Viewers:   42       │  │  DB Size: 6.3 MB             │   │
│  └─────────────────────┘  │  Disk Free: 42 GB            │   │
│                            └─────────────────────────────┘   │
│                                                               │
│  ┌──────────────────────────────────────────────────────┐    │
│  │  Recent Alerts                                        │    │
│  │  [INFO]  03-11 00:00  Daily backup OK (6.3 MB)        │    │
│  │  [WARN]  03-10 18:22  LLM timeout (retry succeeded)  │    │
│  │  [INFO]  03-10 15:00  Hourly backup OK (6.2 MB)       │    │
│  └──────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

6.2 ダッシュボード実装 / Dashboard Implementation

typescript
// packages/backend/src/admin/dashboard.ts

// Serve static HTML dashboard
// GET /admin/dashboard — requires ADMIN_API_KEY query param or cookie

// Dashboard uses:
// ・Vanilla HTML + CSS (no framework needed)
// ・Polls /api/health every 10 seconds
// ・Polls /api/metrics every 30 seconds
// ・Auto-refreshes status indicators
// ・Color coding: green (ok), yellow (warning), red (error)

6.3 メトリクスAPI / Metrics API

typescript
// GET /api/metrics
// Returns detailed metrics for the dashboard

interface MetricsResponse {
  // Time-series data (last 24 hours, 10-min intervals)
  llm_response_times: { time: string; avg_ms: number; p95_ms: number }[];
  tip_counts: { time: string; count: number; failed: number }[];
  memory_usage: { time: string; mb: number }[];
  cycle_durations: { time: string; ms: number }[];

  // Aggregated stats
  totals: {
    tips_today: number;
    tips_total: number;
    llm_calls_today: number;
    errors_today: number;
    uptime_seconds: number;
    game_days_elapsed: number;
  };

  // Recent events (last 50)
  recent_alerts: {
    level: 'critical' | 'warning' | 'info';
    message: string;
    timestamp: string;
  }[];
}

7. 運用チェックリスト / Operations Checklist

7.1 デプロイ前 / Pre-Deployment

  • [ ] ecosystem.config.jsmax_memory_restart がサーバーRAMに適合しているか確認 / Verify max_memory_restart fits server RAM
  • [ ] SLACK_WEBHOOK_URL 環境変数が設定済み / SLACK_WEBHOOK_URL env var is set
  • [ ] Crontab にバックアップとヘルスチェックが登録済み / Crontab has backup and health check entries
  • [ ] ./backups/ ディレクトリが存在する / ./backups/ directory exists
  • [ ] ./logs/ ディレクトリが存在する / ./logs/ directory exists
  • [ ] OBS の自動再接続が有効 / OBS auto-reconnect is enabled
  • [ ] sqlite3 コマンドが利用可能 / sqlite3 command is available
  • [ ] バックアップスクリプトに実行権限がある / Backup scripts have execute permission

7.2 毎日の確認 / Daily Checks

  • [ ] pm2 status で正常稼働を確認 / Verify normal operation via pm2 status
  • [ ] curl http://localhost:3000/api/healthhealthy を返す / Health endpoint returns healthy
  • [ ] バックアップログにエラーがないか確認 / Check backup logs for errors
  • [ ] Slack に異常なアラートがないか確認 / Check Slack for abnormal alerts
  • [ ] ディスク空き容量の確認(df -h) / Check disk free space (df -h)

7.3 週次メンテナンス / Weekly Maintenance

  • [ ] pm2 ログのローテーション(pm2 flush) / Rotate pm2 logs (pm2 flush)
  • [ ] 古いバックアップが自動削除されているか確認 / Verify old backups are auto-purged
  • [ ] SQLite VACUUM の実行(必要に応じて) / Run SQLite VACUUM if needed
  • [ ] メモリクリーンアップの確認(重要度 < 0.5 のデータ削除) / Verify memory cleanup (delete importance < 0.5 data)

8. 環境変数一覧 / Environment Variables

VariableRequiredDefaultDescription
DB_PATHNo./data/game.sqliteSQLite database file path / SQLite データベースファイルパス
BACKUP_DIRNo./backupsBackup directory / バックアップディレクトリ
SLACK_WEBHOOK_URLYesSlack incoming webhook URL / Slack Webhook URL
SENDGRID_API_KEYNoSendGrid API key for email alerts / メール通知用 SendGrid API キー
ADMIN_API_KEYYesAdmin dashboard authentication / 管理ダッシュボード認証キー
HEALTH_URLNohttp://localhost:3000/api/healthHealth check endpoint / ヘルスチェックエンドポイント

9. トラブルシューティング / Troubleshooting

よくある問題 / Common Issues

ProblemSymptomSolution
pm2 won't restartpm2 status shows erroredCheck pm2 logs ailife. Fix root cause. pm2 delete ailife && pm2 start ecosystem.config.js
DB lockedSQLITE_BUSY errors in logsEnsure only one process accesses DB. Check for zombie processes: ps aux | grep node
Backup failssqlite3: command not foundInstall sqlite3: apt install sqlite3 (Ubuntu) or brew install sqlite3 (macOS)
OBS won't reconnectStream shows black screenVerify OBS auto-reconnect is ON. Manually refresh browser source. Check RTMP endpoint
Slack alerts not sentNo messages in Slack channelVerify SLACK_WEBHOOK_URL. Test: curl -X POST $SLACK_WEBHOOK_URL -d '{"text":"test"}'
High memorypm2 keeps restarting (memory limit)Check for memory leaks. Increase max_memory_restart. Run SQLite VACUUM
Stale game cycleslast_cycle_at is oldCheck if TurnScheduler is running. Look for blocking LLM calls in logs
TikTok disconnectsPlatform status shows disconnectedTikTok WebSocket is unstable by nature. Auto-reconnect handles this. Check network