バックアップ・リカバリ・監視設計 / Backup, Recovery & Monitoring Design
24時間AI同棲生活配信を安定運用するための、バックアップ・リカバリ・監視の設計。データ保全、障害復旧、稼働監視、アラート通知を網羅する。
Backup, recovery, and monitoring design for stable 24/7 AI cohabitation life streaming. Covers data protection, disaster recovery, uptime monitoring, and alert notifications.
設計原則 / Design Principles
- データを守る / Protect data — ゲーム状態・記憶データは定期バックアップで保全する
- 最小ダウンタイム / Minimize downtime — 障害発生から復旧までを自動化し、手動介入を最小化する
- 異常を即座に検知 / Detect anomalies immediately — メトリクス監視とアラートで問題を早期発見する
- シンプルに保つ / Keep it simple — ローカルPCまたは安価VPSでも運用可能な軽量設計
1. SQLite バックアップ / SQLite Backup
1.1 バックアップ戦略 / Backup Strategy
SQLite は単一ファイルのため、ファイルコピーでバックアップ可能。ただし書き込み中のコピーはデータ破損リスクがあるため、SQLite Online Backup API または WAL チェックポイント後にコピーする。
SQLite is a single file, so backup is a simple file copy. However, copying during active writes risks corruption, so we use the SQLite Online Backup API or copy after WAL checkpoint.
バックアップフロー / Backup Flow
┌──────────────────────────────────────────────────────┐
│ Cron (every 3 hours) │
│ │
│ 1. WAL checkpoint (PRAGMA wal_checkpoint(TRUNCATE)) │
│ 2. sqlite3 .backup → timestamped copy │
│ 3. Verify backup integrity (PRAGMA integrity_check) │
│ 4. Rotate old backups (keep N generations) │
│ 5. Log result + notify on failure │
└──────────────────────────────────────────────────────┘1.2 Cron スケジュール / Cron Schedule
| Schedule | Frequency | Purpose |
|---|---|---|
0 */3 * * * | Every 3 hours | Regular backup during streaming |
0 0 * * * | Daily at midnight | Daily full backup with integrity check |
0 0 * * 0 | Weekly (Sunday midnight) | Weekly archive backup (compressed) |
Why every 3 hours?
10分サイクル × 18回 = 3時間分のゲーム進行データ。最悪でも3時間分のロストで済む。
10-min cycles × 18 = 3 hours of game progress. Worst case, only 3 hours of data is lost.
1.3 世代管理 / Retention Generations
| Type | Retention | Max Files | Storage Estimate |
|---|---|---|---|
| 3-hourly | 24 hours | 8 files | ~80 MB (10 MB × 8) |
| Daily | 7 days | 7 files | ~70 MB |
| Weekly | 4 weeks | 4 files (gzip) | ~12 MB |
| Total | — | 19 files | ~162 MB |
Storage Note
SQLite DBは通常10MB以下(メモリデータ含む)。圧縮で約30%に縮小可能。
SQLite DB is typically under 10 MB (including memory data). Compression reduces it to ~30%.
1.4 バックアップスクリプト / Backup Script
#!/bin/bash
# scripts/backup-db.sh
# SQLite backup script for AIPrizeCohabitationLife
# Usage: ./scripts/backup-db.sh [hourly|daily|weekly]
set -euo pipefail
# --- Configuration ---
DB_PATH="${DB_PATH:-./data/game.sqlite}"
BACKUP_DIR="${BACKUP_DIR:-./backups}"
BACKUP_TYPE="${1:-hourly}"
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"
# Retention settings
HOURLY_KEEP=8
DAILY_KEEP=7
WEEKLY_KEEP=4
# --- Functions ---
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [backup] $1"
}
notify_slack() {
local message="$1"
local color="${2:-danger}"
if [ -n "$SLACK_WEBHOOK_URL" ]; then
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"attachments\":[{\"color\":\"$color\",\"text\":\"$message\"}]}" \
> /dev/null 2>&1 || true
fi
}
# --- Pre-checks ---
if [ ! -f "$DB_PATH" ]; then
log "ERROR: Database not found at $DB_PATH"
notify_slack ":x: [Backup FAILED] Database not found: $DB_PATH"
exit 1
fi
mkdir -p "$BACKUP_DIR/hourly" "$BACKUP_DIR/daily" "$BACKUP_DIR/weekly"
# --- Determine target directory ---
case "$BACKUP_TYPE" in
hourly) TARGET_DIR="$BACKUP_DIR/hourly"; KEEP=$HOURLY_KEEP ;;
daily) TARGET_DIR="$BACKUP_DIR/daily"; KEEP=$DAILY_KEEP ;;
weekly) TARGET_DIR="$BACKUP_DIR/weekly"; KEEP=$WEEKLY_KEEP ;;
*)
log "ERROR: Unknown backup type: $BACKUP_TYPE (use hourly|daily|weekly)"
exit 1
;;
esac
BACKUP_FILE="$TARGET_DIR/game_${BACKUP_TYPE}_${TIMESTAMP}.sqlite"
# --- Step 1: WAL Checkpoint ---
log "Running WAL checkpoint..."
sqlite3 "$DB_PATH" "PRAGMA wal_checkpoint(TRUNCATE);" 2>/dev/null || {
log "WARNING: WAL checkpoint failed, proceeding with backup anyway"
}
# --- Step 2: Create backup using SQLite .backup command ---
log "Creating $BACKUP_TYPE backup..."
sqlite3 "$DB_PATH" ".backup '$BACKUP_FILE'" || {
log "ERROR: Backup failed!"
notify_slack ":x: [Backup FAILED] sqlite3 .backup failed for $BACKUP_TYPE"
exit 1
}
# --- Step 3: Verify backup integrity ---
log "Verifying backup integrity..."
INTEGRITY=$(sqlite3 "$BACKUP_FILE" "PRAGMA integrity_check;" 2>/dev/null)
if [ "$INTEGRITY" != "ok" ]; then
log "ERROR: Backup integrity check failed: $INTEGRITY"
notify_slack ":x: [Backup FAILED] Integrity check failed for $BACKUP_TYPE: $INTEGRITY"
rm -f "$BACKUP_FILE"
exit 1
fi
# --- Step 4: Compress weekly backups ---
if [ "$BACKUP_TYPE" = "weekly" ]; then
log "Compressing weekly backup..."
gzip "$BACKUP_FILE"
BACKUP_FILE="${BACKUP_FILE}.gz"
fi
# --- Step 5: Rotate old backups ---
log "Rotating old backups (keeping last $KEEP)..."
ls -1t "$TARGET_DIR"/game_${BACKUP_TYPE}_* 2>/dev/null | tail -n +$((KEEP + 1)) | xargs -r rm -f
# --- Done ---
BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
log "Backup complete: $BACKUP_FILE ($BACKUP_SIZE)"
notify_slack ":white_check_mark: [Backup OK] $BACKUP_TYPE backup complete ($BACKUP_SIZE)" "good"1.5 Crontab 設定 / Crontab Configuration
# crontab -e
# AIPrizeCohabitationLife SQLite Backups
# Every 3 hours - regular backup
0 */3 * * * cd /path/to/AIPrizeCohabitationLife && ./scripts/backup-db.sh hourly >> ./logs/backup.log 2>&1
# Daily at midnight - full backup with integrity check
0 0 * * * cd /path/to/AIPrizeCohabitationLife && ./scripts/backup-db.sh daily >> ./logs/backup.log 2>&1
# Weekly Sunday midnight - compressed archive
0 0 * * 0 cd /path/to/AIPrizeCohabitationLife && ./scripts/backup-db.sh weekly >> ./logs/backup.log 2>&12. リカバリ手順 / Recovery Procedure
2.1 リカバリ判断フロー / Recovery Decision Flow
障害検知 / Fault Detected
│
▼
┌────────────────────────────┐
│ pm2 が自動再起動を試行 │
│ pm2 attempts auto-restart │
└──────────┬─────────────────┘
│
┌─────┴──────┐
│ 起動成功? │
│ Start OK? │
└─────┬──────┘
Yes │ No
│ │
▼ ▼
┌────────┐ ┌──────────────────────────┐
│ 正常復帰 │ │ DB破損チェック │
│ Resumed │ │ Check DB corruption │
└────────┘ └──────────┬───────────────┘
┌───┴────┐
OK │ 破損? │ Corrupted
┌────┤Corrupt? ├────┐
│ └────────┘ │
▼ ▼
┌──────────┐ ┌──────────────────────┐
│ 他の原因を │ │ バックアップから復元 │
│ 調査 │ │ Restore from backup │
│ Check logs │ └──────────────────────┘
└──────────┘2.2 リカバリ手順 / Recovery Steps
Case 1: プロセスクラッシュ / Process Crash
pm2 が自動再起動する。ゲーム状態は SQLite に永続化されているため、最後の保存ポイントから自動復帰する。
pm2 auto-restarts the process. Game state is persisted in SQLite, so the game resumes from the last save point automatically.
# 1. 状態確認 / Check status
pm2 status
# 2. 自動再起動されていない場合 / If not auto-restarted
pm2 restart ailife
# 3. ログ確認 / Check logs
pm2 logs ailife --lines 50Case 2: SQLite データ破損 / SQLite Data Corruption
# 1. 現在のDBを退避 / Move corrupted DB aside
mv ./data/game.sqlite ./data/game.sqlite.corrupted_$(date +%Y%m%d_%H%M%S)
# 2. 最新の正常バックアップを探す / Find latest good backup
ls -lt ./backups/hourly/
# 3. 整合性チェック / Integrity check on candidate
sqlite3 ./backups/hourly/game_hourly_XXXXXXXX_XXXXXX.sqlite "PRAGMA integrity_check;"
# 4. バックアップから復元 / Restore from backup
cp ./backups/hourly/game_hourly_XXXXXXXX_XXXXXX.sqlite ./data/game.sqlite
# 5. プロセス再起動 / Restart process
pm2 restart ailife
# 6. 動作確認 / Verify operation
curl -s http://localhost:3000/api/health | jq .Case 3: 完全リカバリ(全データ消失) / Full Recovery (Total Data Loss)
# 1. 週次バックアップから復元 / Restore from weekly backup
gunzip -k ./backups/weekly/game_weekly_XXXXXXXX_XXXXXX.sqlite.gz
cp ./backups/weekly/game_weekly_XXXXXXXX_XXXXXX.sqlite ./data/game.sqlite
# 2. 整合性チェック / Verify integrity
sqlite3 ./data/game.sqlite "PRAGMA integrity_check;"
# 3. 全サービス再起動 / Restart all services
pm2 restart all
# 4. OBS が自動再接続されるか確認 / Verify OBS reconnects
# OBS > Settings > Advanced > Automatically reconnect: ON2.3 リカバリスクリプト / Recovery Script
#!/bin/bash
# scripts/restore-db.sh
# Restore game database from backup
# Usage: ./scripts/restore-db.sh [backup_file_path]
set -euo pipefail
DB_PATH="${DB_PATH:-./data/game.sqlite}"
BACKUP_FILE="$1"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [restore] $1"
}
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup_file_path>"
echo ""
echo "Available backups:"
echo "--- Hourly ---"
ls -lt ./backups/hourly/ 2>/dev/null | head -5
echo "--- Daily ---"
ls -lt ./backups/daily/ 2>/dev/null | head -5
echo "--- Weekly ---"
ls -lt ./backups/weekly/ 2>/dev/null | head -5
exit 1
fi
if [ ! -f "$BACKUP_FILE" ]; then
log "ERROR: Backup file not found: $BACKUP_FILE"
exit 1
fi
# Handle gzipped backups
RESTORE_FILE="$BACKUP_FILE"
if [[ "$BACKUP_FILE" == *.gz ]]; then
log "Decompressing backup..."
RESTORE_FILE="${BACKUP_FILE%.gz}"
gunzip -k "$BACKUP_FILE"
fi
# Verify backup integrity
log "Verifying backup integrity..."
INTEGRITY=$(sqlite3 "$RESTORE_FILE" "PRAGMA integrity_check;")
if [ "$INTEGRITY" != "ok" ]; then
log "ERROR: Backup integrity check failed: $INTEGRITY"
exit 1
fi
# Stop the application
log "Stopping application..."
pm2 stop ailife 2>/dev/null || true
# Archive corrupted DB
if [ -f "$DB_PATH" ]; then
ARCHIVE="$DB_PATH.pre_restore_$(date +%Y%m%d_%H%M%S)"
log "Archiving current DB to $ARCHIVE"
mv "$DB_PATH" "$ARCHIVE"
fi
# Restore
log "Restoring from $RESTORE_FILE..."
cp "$RESTORE_FILE" "$DB_PATH"
# Restart application
log "Restarting application..."
pm2 restart ailife
log "Restore complete. Verify with: curl http://localhost:3000/api/health"3. ヘルスモニタリング / Health Monitoring
3.1 監視対象 / Monitoring Targets
┌──────────────────────────────────────────────────────────────┐
│ Monitoring Targets │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Process Liveness │ │ LLM Performance │ │
│ │ ・pm2 status │ │ ・Response time │ │
│ │ ・Memory usage │ │ ・Error rate │ │
│ │ ・CPU usage │ │ ・Timeout count │ │
│ │ ・Restart count │ │ ・Fallback rate │ │
│ └───────────────────┘ └───────────────────┘ │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Tip Processing │ │ System Resources │ │
│ │ ・Tips/hour count │ │ ・Disk space │ │
│ │ ・Processing time │ │ ・SQLite file size │ │
│ │ ・Queue depth │ │ ・Log file size │ │
│ │ ・Failed tips │ │ ・Network latency │ │
│ └───────────────────┘ └───────────────────┘ │
│ │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ Platform Status │ │ Game State │ │
│ │ ・YouTube API OK │ │ ・Cycle count │ │
│ │ ・TikTok WS alive │ │ ・Last cycle time │ │
│ │ ・OBS connected │ │ ・Character status │ │
│ └───────────────────┘ └───────────────────┘ │
└──────────────────────────────────────────────────────────────┘3.2 ヘルスチェックエンドポイント / Health Check Endpoint
バックエンドサーバーに /api/health エンドポイントを設置し、各コンポーネントの状態を返す。
The backend server exposes a /api/health endpoint that returns the status of each component.
// GET /api/health
// Response example:
interface HealthResponse {
status: 'healthy' | 'degraded' | 'unhealthy';
timestamp: string;
uptime_seconds: number;
checks: {
process: {
status: 'ok' | 'warning' | 'error';
memory_mb: number;
cpu_percent: number;
restart_count: number;
};
database: {
status: 'ok' | 'error';
size_mb: number;
last_write: string;
integrity: 'ok' | 'unknown';
};
llm: {
status: 'ok' | 'degraded' | 'down';
avg_response_ms: number;
error_rate_percent: number;
last_success: string;
fallback_active: boolean;
};
tips: {
status: 'ok' | 'warning';
processed_last_hour: number;
failed_last_hour: number;
queue_depth: number;
};
platforms: {
youtube: { status: 'ok' | 'disconnected'; last_poll: string };
tiktok: { status: 'ok' | 'disconnected'; last_event: string };
};
game: {
status: 'ok' | 'stale';
current_cycle: number;
last_cycle_at: string;
game_day: number;
};
};
}{
"status": "healthy",
"timestamp": "2026-03-11T15:30:00.000Z",
"uptime_seconds": 86400,
"checks": {
"process": {
"status": "ok",
"memory_mb": 145,
"cpu_percent": 8.2,
"restart_count": 0
},
"database": {
"status": "ok",
"size_mb": 6.3,
"last_write": "2026-03-11T15:29:50.000Z",
"integrity": "ok"
},
"llm": {
"status": "ok",
"avg_response_ms": 1200,
"error_rate_percent": 0.5,
"last_success": "2026-03-11T15:29:55.000Z",
"fallback_active": false
},
"tips": {
"status": "ok",
"processed_last_hour": 12,
"failed_last_hour": 0,
"queue_depth": 0
},
"platforms": {
"youtube": { "status": "ok", "last_poll": "2026-03-11T15:29:58.000Z" },
"tiktok": { "status": "ok", "last_event": "2026-03-11T15:28:30.000Z" }
},
"game": {
"status": "ok",
"current_cycle": 144,
"last_cycle_at": "2026-03-11T15:20:00.000Z",
"game_day": 15
}
}
}3.3 メトリクス収集 / Metrics Collection
アプリケーション内でメトリクスを収集し、定期的にログ出力とヘルスチェックAPIに反映する。
Metrics are collected within the application and periodically output to logs and reflected in the health check API.
// packages/backend/src/monitoring/MetricsCollector.ts
interface Metrics {
// LLM metrics (sliding window: last 100 calls)
llm_response_times: number[]; // ms
llm_error_count: number; // errors in last hour
llm_timeout_count: number; // timeouts in last hour
llm_fallback_count: number; // fallbacks in last hour
// Tip processing metrics
tips_processed_total: number; // since startup
tips_processed_last_hour: number;
tips_failed_last_hour: number;
tip_processing_times: number[]; // ms, sliding window
// Game cycle metrics
cycles_completed: number; // since startup
last_cycle_duration_ms: number;
cycle_drift_ms: number; // deviation from 10-min target
// System metrics
process_memory_mb: number;
process_cpu_percent: number;
db_size_mb: number;
ws_connected_clients: number;
}3.4 pm2 プロセス監視 / pm2 Process Monitoring
pm2 設定ファイルでプロセスの自動再起動とメモリ上限を管理する。
pm2 configuration manages auto-restart and memory limits.
// ecosystem.config.js
module.exports = {
apps: [{
name: 'ailife',
script: './packages/backend/dist/server.js',
instances: 1,
autorestart: true,
watch: false,
max_memory_restart: '512M',
restart_delay: 5000, // 5s delay between restarts
max_restarts: 10, // max 10 restarts in min_uptime window
min_uptime: '60s', // consider stable after 60s
exp_backoff_restart_delay: 100, // exponential backoff on repeated crashes
env: {
NODE_ENV: 'production',
},
// Log configuration
error_file: './logs/pm2-error.log',
out_file: './logs/pm2-out.log',
merge_logs: true,
log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
}],
};# pm2 の基本運用コマンド / Basic pm2 operations
# Start
pm2 start ecosystem.config.js
# Monitor (real-time dashboard)
pm2 monit
# Status check
pm2 status
# Restart
pm2 restart ailife
# View logs
pm2 logs ailife --lines 100
# Flush logs (to prevent log file growth)
pm2 flush3.5 外部ヘルスチェック / External Health Check
外部からの定期ヘルスチェックスクリプト。Cron で毎分実行し、異常時にアラートを送信する。
External health check script. Runs every minute via cron and sends alerts on failures.
#!/bin/bash
# scripts/health-check.sh
# External health check — runs via cron every minute
HEALTH_URL="${HEALTH_URL:-http://localhost:3000/api/health}"
SLACK_WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"
ALERT_STATE_FILE="/tmp/ailife_alert_state"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [health] $1"
}
notify_slack() {
local message="$1"
local color="${2:-danger}"
if [ -n "$SLACK_WEBHOOK_URL" ]; then
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"attachments\":[{\"color\":\"$color\",\"text\":\"$message\"}]}" \
> /dev/null 2>&1 || true
fi
}
# Fetch health endpoint with 10s timeout
RESPONSE=$(curl -s -w "\n%{http_code}" --max-time 10 "$HEALTH_URL" 2>/dev/null)
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | sed '$d')
if [ "$HTTP_CODE" != "200" ]; then
log "ERROR: Health check failed (HTTP $HTTP_CODE)"
# Only alert if not already in alert state (prevent spam)
if [ ! -f "$ALERT_STATE_FILE" ]; then
notify_slack ":rotating_light: [ALERT] AIPrizeCohabitationLife is DOWN (HTTP $HTTP_CODE)"
touch "$ALERT_STATE_FILE"
fi
exit 1
fi
# Parse status from JSON
STATUS=$(echo "$BODY" | grep -o '"status":"[^"]*"' | head -1 | cut -d'"' -f4)
case "$STATUS" in
healthy)
# Clear alert state if previously alerting
if [ -f "$ALERT_STATE_FILE" ]; then
notify_slack ":white_check_mark: [RECOVERED] AIPrizeCohabitationLife is back online" "good"
rm -f "$ALERT_STATE_FILE"
fi
;;
degraded)
log "WARNING: Service is degraded"
if [ ! -f "$ALERT_STATE_FILE" ]; then
notify_slack ":warning: [WARNING] AIPrizeCohabitationLife is degraded — check /api/health for details" "warning"
touch "$ALERT_STATE_FILE"
fi
;;
unhealthy)
log "ERROR: Service is unhealthy"
if [ ! -f "$ALERT_STATE_FILE" ]; then
notify_slack ":x: [CRITICAL] AIPrizeCohabitationLife is unhealthy — immediate attention required"
touch "$ALERT_STATE_FILE"
fi
;;
esacCrontab エントリ / Crontab entry:
# Health check every minute
* * * * * /path/to/AIPrizeCohabitationLife/scripts/health-check.sh >> ./logs/health-check.log 2>&14. アラート通知 / Alert Notifications
4.1 アラートレベル / Alert Levels
| Level | Condition | Channel | Frequency |
|---|---|---|---|
| CRITICAL | Process down, DB corrupted, health endpoint unreachable | Slack + Email | Immediate (deduplicated) |
| WARNING | LLM degraded, platform disconnected, high error rate, memory > 80% | Slack | Every 15 min (aggregated) |
| INFO | Backup success, recovery success, daily stats | Slack | Per event |
4.2 Slack Webhook 通知 / Slack Webhook Notification
// packages/backend/src/monitoring/AlertNotifier.ts
interface AlertConfig {
slackWebhookUrl: string;
emailTo?: string; // optional: admin email
deduplicationWindowMs: number; // suppress duplicate alerts (default: 900000 = 15 min)
}
// Alert message format examples:
// CRITICAL
// :rotating_light: [CRITICAL] AIPrizeCohabitationLife
// Process crashed and failed to auto-restart.
// Last error: "SQLITE_CORRUPT: database disk image is malformed"
// Time: 2026-03-11 15:30:00 JST
// WARNING
// :warning: [WARNING] AIPrizeCohabitationLife
// LLM API error rate exceeded 10% (current: 15.3%)
// Fallback responses are active.
// Time: 2026-03-11 15:30:00 JST
// INFO
// :white_check_mark: [INFO] AIPrizeCohabitationLife
// Daily backup completed successfully (6.3 MB)
// Time: 2026-03-11 00:00:05 JST4.3 Email 通知(オプション) / Email Notification (Optional)
CRITICAL アラートのみメールでも通知する。安価VPSの場合は mailutils + postfix またはSendGrid APIを使用。
Email notification for CRITICAL alerts only. On a cheap VPS, use mailutils + postfix or the SendGrid API.
# Simple email alert via mailutils (for VPS)
echo "AIPrizeCohabitationLife is DOWN. Check immediately." | \
mail -s "[CRITICAL] AIPrizeCohabitationLife DOWN" admin@example.com
# Or via SendGrid API (more reliable)
curl -s --request POST \
--url https://api.sendgrid.com/v3/mail/send \
--header "Authorization: Bearer $SENDGRID_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"personalizations":[{"to":[{"email":"admin@example.com"}]}],
"from":{"email":"alerts@ailife.example.com"},
"subject":"[CRITICAL] AIPrizeCohabitationLife DOWN",
"content":[{"type":"text/plain","value":"Service is down. Check immediately."}]
}'4.4 アラート抑制 / Alert Suppression
アラートのスパムを防ぐための重複排除ロジック。
Deduplication logic to prevent alert spam.
┌─────────────────────────────────────────────────────┐
│ Alert Deduplication Logic │
│ │
│ 1. New alert generated │
│ 2. Check: same alert_type in last 15 min? │
│ ├── Yes → Suppress (increment counter) │
│ └── No → Send alert │
│ 3. Every 15 min: if suppressed count > 0 │
│ └── Send summary: "X alerts suppressed" │
│ 4. On recovery: always send recovery notification │
└─────────────────────────────────────────────────────┘5. 配信リカバリ / Stream Recovery
5.1 サーバーダウンからの自動復旧フロー / Auto-Recovery from Server Down
Server Down
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Phase 1: Process Recovery (0〜10 seconds) │
│ │
│ pm2 detects crash → restart_delay: 5s → restart process │
│ ・exp_backoff_restart_delay: 100ms base (exponential) │
│ ・max_restarts: 10 │
│ ・Game state loaded from SQLite automatically │
└──────────────────────────────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Phase 2: WebSocket Reconnection (10〜30 seconds) │
│ │
│ ・Frontend (PixiJS) detects WS disconnect │
│ ・Auto-reconnect with exponential backoff │
│ ・On reconnect: full state sync from backend │
│ ・During disconnect: show "Reconnecting..." overlay │
└──────────────────────────────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Phase 3: OBS Reconnection (30〜60 seconds) │
│ │
│ ・OBS browser source auto-reloads on navigation │
│ ・OBS "Automatically reconnect" setting handles stream │
│ ・RTMP reconnection to YouTube/TikTok (automatic) │
│ ・Viewers see brief freeze → stream resumes │
└──────────────────────────────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Phase 4: Platform Reconnection (30〜120 seconds) │
│ │
│ ・YouTube LiveChat polling resumes automatically │
│ ・TikTok WebSocket reconnects with auto-reconnect logic │
│ ・Tip processing queue drains any queued events │
│ ・Game cycle timer re-calibrates to absolute time │
└────────────────────────────────────────────────────────────────┘5.2 OBS 設定 / OBS Configuration
OBS が自動的に復帰するための推奨設定。
Recommended OBS settings for automatic recovery.
| Setting | Location | Value |
|---|---|---|
| Auto-reconnect | Settings > Advanced > Automatically Reconnect | Enabled |
| Retry delay | Settings > Advanced > Retry Delay | 10 seconds |
| Max retries | Settings > Advanced > Maximum Retries | 100 (effectively infinite) |
| Browser source refresh | Source properties > Refresh browser when scene becomes active | Enabled |
| Custom CSS (optional) | Browser source > Custom CSS | Hide reconnecting overlay from stream |
5.3 フロントエンド再接続ロジック / Frontend Reconnection Logic
// packages/frontend/src/network/WebSocketClient.ts
// Reconnection strategy
const RECONNECT_CONFIG = {
initialDelay: 1000, // 1 second
maxDelay: 30000, // 30 seconds
backoffMultiplier: 2, // exponential
maxRetries: Infinity, // never give up
};
// Connection states displayed to viewers:
// ・Connected → normal game display
// ・Reconnecting → game frozen + subtle "Reconnecting..." text
// ・Disconnected → "Stream will resume shortly" message6. メトリクスダッシュボード / Metrics Dashboard
6.1 設計方針 / Design Approach
外部ツール(Grafana等)は導入しない。管理者用の簡易HTMLダッシュボードをバックエンドから配信する。
No external tools (Grafana, etc.). Serve a simple HTML admin dashboard from the backend.
┌──────────────────────────────────────────────────────────────┐
│ Admin Dashboard (http://localhost:3000/admin/dashboard) │
│ Protected by ADMIN_API_KEY │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ System Status: ● HEALTHY Uptime: 3d 14h │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ Process │ │ LLM API │ │
│ │ Memory: 145 MB │ │ Avg Response: 1.2s │ │
│ │ CPU: 8.2% │ │ Error Rate: 0.5% │ │
│ │ Restarts: 0 │ │ Timeouts: 2 (last 1h) │ │
│ │ Uptime: 3d 14h 22m │ │ Fallbacks: 0 │ │
│ └─────────────────────┘ └─────────────────────────────┘ │
│ │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ Tips (last 1h) │ │ Game State │ │
│ │ Processed: 12 │ │ Day: 15 │ │
│ │ Failed: 0 │ │ Cycle: 144 │ │
│ │ Queue: 0 │ │ Last Cycle: 2 min ago │ │
│ │ Revenue: ¥3,500 │ │ John: cooking (mood: 72) │ │
│ └─────────────────────┘ │ Sara: reading (mood: 85) │ │
│ │ Eve: sleeping │ │
│ ┌─────────────────────┐ └─────────────────────────────┘ │
│ │ Platforms │ │
│ │ YouTube: ● OK │ ┌─────────────────────────────┐ │
│ │ TikTok: ● OK │ │ Backups │ │
│ │ OBS/WS: ● OK │ │ Last: 30 min ago (OK) │ │
│ │ Viewers: 42 │ │ DB Size: 6.3 MB │ │
│ └─────────────────────┘ │ Disk Free: 42 GB │ │
│ └─────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Recent Alerts │ │
│ │ [INFO] 03-11 00:00 Daily backup OK (6.3 MB) │ │
│ │ [WARN] 03-10 18:22 LLM timeout (retry succeeded) │ │
│ │ [INFO] 03-10 15:00 Hourly backup OK (6.2 MB) │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘6.2 ダッシュボード実装 / Dashboard Implementation
// packages/backend/src/admin/dashboard.ts
// Serve static HTML dashboard
// GET /admin/dashboard — requires ADMIN_API_KEY query param or cookie
// Dashboard uses:
// ・Vanilla HTML + CSS (no framework needed)
// ・Polls /api/health every 10 seconds
// ・Polls /api/metrics every 30 seconds
// ・Auto-refreshes status indicators
// ・Color coding: green (ok), yellow (warning), red (error)6.3 メトリクスAPI / Metrics API
// GET /api/metrics
// Returns detailed metrics for the dashboard
interface MetricsResponse {
// Time-series data (last 24 hours, 10-min intervals)
llm_response_times: { time: string; avg_ms: number; p95_ms: number }[];
tip_counts: { time: string; count: number; failed: number }[];
memory_usage: { time: string; mb: number }[];
cycle_durations: { time: string; ms: number }[];
// Aggregated stats
totals: {
tips_today: number;
tips_total: number;
llm_calls_today: number;
errors_today: number;
uptime_seconds: number;
game_days_elapsed: number;
};
// Recent events (last 50)
recent_alerts: {
level: 'critical' | 'warning' | 'info';
message: string;
timestamp: string;
}[];
}7. 運用チェックリスト / Operations Checklist
7.1 デプロイ前 / Pre-Deployment
- [ ]
ecosystem.config.jsのmax_memory_restartがサーバーRAMに適合しているか確認 / Verifymax_memory_restartfits server RAM - [ ]
SLACK_WEBHOOK_URL環境変数が設定済み /SLACK_WEBHOOK_URLenv var is set - [ ] Crontab にバックアップとヘルスチェックが登録済み / Crontab has backup and health check entries
- [ ]
./backups/ディレクトリが存在する /./backups/directory exists - [ ]
./logs/ディレクトリが存在する /./logs/directory exists - [ ] OBS の自動再接続が有効 / OBS auto-reconnect is enabled
- [ ]
sqlite3コマンドが利用可能 /sqlite3command is available - [ ] バックアップスクリプトに実行権限がある / Backup scripts have execute permission
7.2 毎日の確認 / Daily Checks
- [ ]
pm2 statusで正常稼働を確認 / Verify normal operation viapm2 status - [ ]
curl http://localhost:3000/api/healthがhealthyを返す / Health endpoint returnshealthy - [ ] バックアップログにエラーがないか確認 / Check backup logs for errors
- [ ] Slack に異常なアラートがないか確認 / Check Slack for abnormal alerts
- [ ] ディスク空き容量の確認(
df -h) / Check disk free space (df -h)
7.3 週次メンテナンス / Weekly Maintenance
- [ ] pm2 ログのローテーション(
pm2 flush) / Rotate pm2 logs (pm2 flush) - [ ] 古いバックアップが自動削除されているか確認 / Verify old backups are auto-purged
- [ ] SQLite VACUUM の実行(必要に応じて) / Run SQLite VACUUM if needed
- [ ] メモリクリーンアップの確認(重要度 < 0.5 のデータ削除) / Verify memory cleanup (delete importance < 0.5 data)
8. 環境変数一覧 / Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
DB_PATH | No | ./data/game.sqlite | SQLite database file path / SQLite データベースファイルパス |
BACKUP_DIR | No | ./backups | Backup directory / バックアップディレクトリ |
SLACK_WEBHOOK_URL | Yes | — | Slack incoming webhook URL / Slack Webhook URL |
SENDGRID_API_KEY | No | — | SendGrid API key for email alerts / メール通知用 SendGrid API キー |
ADMIN_API_KEY | Yes | — | Admin dashboard authentication / 管理ダッシュボード認証キー |
HEALTH_URL | No | http://localhost:3000/api/health | Health check endpoint / ヘルスチェックエンドポイント |
9. トラブルシューティング / Troubleshooting
よくある問題 / Common Issues
| Problem | Symptom | Solution |
|---|---|---|
| pm2 won't restart | pm2 status shows errored | Check pm2 logs ailife. Fix root cause. pm2 delete ailife && pm2 start ecosystem.config.js |
| DB locked | SQLITE_BUSY errors in logs | Ensure only one process accesses DB. Check for zombie processes: ps aux | grep node |
| Backup fails | sqlite3: command not found | Install sqlite3: apt install sqlite3 (Ubuntu) or brew install sqlite3 (macOS) |
| OBS won't reconnect | Stream shows black screen | Verify OBS auto-reconnect is ON. Manually refresh browser source. Check RTMP endpoint |
| Slack alerts not sent | No messages in Slack channel | Verify SLACK_WEBHOOK_URL. Test: curl -X POST $SLACK_WEBHOOK_URL -d '{"text":"test"}' |
| High memory | pm2 keeps restarting (memory limit) | Check for memory leaks. Increase max_memory_restart. Run SQLite VACUUM |
| Stale game cycles | last_cycle_at is old | Check if TurnScheduler is running. Look for blocking LLM calls in logs |
| TikTok disconnects | Platform status shows disconnected | TikTok WebSocket is unstable by nature. Auto-reconnect handles this. Check network |