Skip to content

エラーハンドリング・フォールバック設計 / Error Handling & Fallback Design

24時間配信で止まらないための各種障害対応パターン。

Resilience patterns to keep the 24-hour stream running without interruption.

設計原則 / Design Principles

  1. 配信は絶対に止めない / Never stop the stream — 障害が起きても視聴者に影響を与えない
  2. サイレント復旧 / Silent recovery — 可能な限り自動復旧し、視聴者に障害を気づかせない
  3. データは失ってもいい、体験は失わない / Data can be lost, experience cannot — 一部データロストより配信停止が悪い
  4. 多重防御 / Defense in depth — 1つの対策が失敗しても次の対策がカバーする

エラー重大度レベル / Error Severity Levels

LevelName影響 / Impact対応 / Response通知 / Notification
L0FATAL配信停止 / Stream stopspm2 自動再起動 + 管理者通知Slack + SMS
L1CRITICAL主要機能停止 / Major feature down自動フォールバック + 管理者通知Slack
L2WARNING機能劣化 / Degraded experience自動フォールバック、ログ記録Slack (集約)
L3INFO軽微な問題 / Minor issueログ記録のみログファイルのみ

エラー分類マトリクス / Error Classification Matrix

                    ┌─────────────────────────────────────────┐
                    │         Error Classification             │
                    │                                          │
  L0 FATAL          │  ・Process crash (uncaught exception)    │
                    │  ・SQLite DB corruption (unrecoverable)  │
                    │  ・Port already in use                   │
                    │                                          │
  L1 CRITICAL       │  ・LLM API全面停止 (complete outage)     │
                    │  ・WebSocket server failure              │
                    │  ・Memory > 90% threshold                │
                    │                                          │
  L2 WARNING        │  ・LLM API timeout (single request)     │
                    │  ・YouTube API quota near limit          │
                    │  ・TikTok WebSocket disconnect           │
                    │  ・SQLite write failure (single)         │
                    │                                          │
  L3 INFO           │  ・LLM response slow (> 3s)             │
                    │  ・Comment filter triggered              │
                    │  ・Memory cleanup executed               │
                    └─────────────────────────────────────────┘

1. LLM API 障害対応 / LLM API Failure Handling

1.1 障害パターン / Failure Patterns

PatternCauseHTTP StatusFrequency
TimeoutNetwork latency, API overload-Common
Rate LimitToo many requests429Occasional
Server ErrorClaude API internal error500, 502, 503Rare
Auth FailureAPI key expired/invalid401, 403Very rare
Complete OutageService fully downConnection refusedVery rare

1.2 リトライ戦略 / Retry Strategy

LLM API Call


┌──────────────┐    Success
│  First Try    │──────────────▶ Process Response
│  timeout: 5s  │
└──────┬───────┘
       │ Failure

┌──────────────┐    Success
│  Retry #1     │──────────────▶ Process Response
│  wait: 1s     │
│  timeout: 8s  │
└──────┬───────┘
       │ Failure

┌──────────────┐    Success
│  Retry #2     │──────────────▶ Process Response
│  wait: 3s     │
│  timeout: 10s │
└──────┬───────┘
       │ Failure

┌──────────────┐    Success
│  Retry #3     │──────────────▶ Process Response
│  wait: 8s     │
│  timeout: 15s │
└──────┬───────┘
       │ All retries failed

┌──────────────┐
│  FALLBACK     │──────────────▶ Use pre-defined fallback action
└──────────────┘

Exponential backoff with jitter:

typescript
function getRetryDelay(attempt: number): number {
  const base = 1000; // 1 second
  const maxDelay = 15000; // 15 seconds
  const delay = Math.min(base * Math.pow(2, attempt), maxDelay);
  const jitter = delay * 0.2 * Math.random();
  return delay + jitter;
}

1.3 Rate Limit 対応 / Rate Limit Handling

typescript
// Rate limit headers から wait 時間を取得
// Extract wait time from rate limit headers
interface RateLimitHandler {
  // Respect Retry-After header
  handleRateLimit(retryAfter: number): void;

  // Token bucket for self-throttling
  tokenBucket: {
    maxTokens: 60,        // per minute
    refillRate: 1,         // per second
    currentTokens: number
  };
}
ConditionAction
429 with Retry-AfterWait specified duration, then retry
429 without Retry-AfterWait 60s, then retry
Consecutive 429s (3+)Extend cycle to 15 min temporarily
Rate limit persists > 10 minSwitch to fallback-only mode

1.4 フォールバック行動 / Fallback Actions

LLM が応答不能な場合、事前定義されたフォールバック行動を使用する。

When LLM is unavailable, use pre-defined fallback actions.

フォールバック行動テーブル / Fallback Action Table

typescript
interface FallbackAction {
  action: string;
  dialogue_ja: string;
  dialogue_en: string;
  animation: string;
  duration_ms: number;
}
Time of DayJohn FallbackSara FallbackEve Fallback
06:00-08:00Wake up, stretchWake up, make coffeeWake up, wag tail
08:00-12:00Work at deskWork at deskNap near owner
12:00-13:00Eat (default meal)Eat (default meal)Eat (dog food)
13:00-18:00Work at deskWork at deskPlay with ball
18:00-20:00Rest, readCook dinnerFollow Sara
20:00-22:00Watch TV / talkWatch TV / talkNap on sofa
22:00-06:00SleepSleepSleep

フォールバックダイアログ / Fallback Dialogue Templates

json
{
  "fallback_dialogues": {
    "john": {
      "idle": [
        { "ja": "ふう、ちょっと一息つこう", "en": "Phew, let me take a short break" },
        { "ja": "今日も頑張るか", "en": "Let's do our best today too" },
        { "ja": "そういえば、冷蔵庫に何かあったかな", "en": "I wonder if there's anything in the fridge" },
        { "ja": "イヴ、いい子だな", "en": "Eve, you're such a good girl" },
        { "ja": "...(考え事をしている)", "en": "...(lost in thought)" }
      ],
      "tip_reaction": [
        { "ja": "おっ、ありがとうございます!", "en": "Oh, thank you so much!" },
        { "ja": "うれしいな、ありがとう!", "en": "That makes me happy, thanks!" }
      ]
    },
    "sara": {
      "idle": [
        { "ja": "ちょっと休憩しよっかな", "en": "Maybe I'll take a little break" },
        { "ja": "イヴ〜おいで〜", "en": "Eve~ come here~" },
        { "ja": "今日の夕飯、何にしよう", "en": "What should I make for dinner today" },
        { "ja": "ジョン、お疲れ様", "en": "John, good work today" },
        { "ja": "お部屋、ちょっと片付けよう", "en": "Let me tidy up a bit" }
      ],
      "tip_reaction": [
        { "ja": "わあ、ありがとうございます!", "en": "Wow, thank you so much!" },
        { "ja": "嬉しい〜!ありがとう!", "en": "So happy~! Thank you!" }
      ]
    }
  }
}

1.5 LLM障害時のモード遷移 / LLM Failure Mode Transitions

                 NORMAL MODE

          LLM fails 3 consecutive times


              DEGRADED MODE
         (fallback actions only)
         (10-min cycle continues)
         (tip reactions use templates)

          LLM recovers (1 success)


              RECOVERY MODE
         (test with 1 agent first)
         (if OK, restore all agents)

          3 consecutive successes


                 NORMAL MODE
ModeBehaviorViewer Impact
NORMALFull LLM-powered actions and dialogueNone
DEGRADEDPre-defined fallback actions and dialogue templatesSlightly repetitive but natural
RECOVERYGradually restoring LLM callsMinimal

2. プラットフォーム API 障害対応 / Platform API Failure Handling

2.1 YouTube Live API

障害パターン / Failure Patterns

PatternCauseImpact
Polling failureAPI quota exceeded, network errorComments not received
Auth token expiredOAuth token needs refreshAll API calls fail
Live chat endedStream ended/restarted on YouTube sideChat polling returns empty
Quota exhaustionToo many API calls in 24hAll calls rejected

再接続戦略 / Reconnection Strategy

YouTube LiveChat Poll


┌──────────────────┐
│  Poll every 5s    │◀──────────────────────┐
└────────┬─────────┘                        │
         │ Error                             │
         ▼                                  │
┌──────────────────┐                        │
│  Increase interval│                       │
│  5s → 10s → 30s  │                       │
│  → 60s → 120s    │                       │
└────────┬─────────┘                        │
         │                                  │
         ▼                                  │
┌──────────────────┐    Token expired?      │
│  Check error type │──────────────▶ Refresh token
└────────┬─────────┘                        │
         │ Other error                      │
         ▼                                  │
┌──────────────────┐    Recovered?          │
│  Wait & retry     │──────────────────────┘
│  (backoff)        │
└────────┬─────────┘
         │ Failed > 10 min

┌──────────────────┐
│  LOG WARNING      │
│  Continue without │
│  YouTube comments │
└──────────────────┘

OAuth トークン自動リフレッシュ / OAuth Token Auto-Refresh

typescript
class YouTubeTokenManager {
  private refreshToken: string;
  private accessToken: string;
  private expiresAt: number;

  async ensureValidToken(): Promise<string> {
    // Refresh 5 minutes before expiration
    if (Date.now() > this.expiresAt - 5 * 60 * 1000) {
      await this.refreshAccessToken();
    }
    return this.accessToken;
  }

  async refreshAccessToken(): Promise<void> {
    try {
      // POST to OAuth endpoint with refresh_token
      const response = await fetch('https://oauth2.googleapis.com/token', {
        method: 'POST',
        body: new URLSearchParams({
          grant_type: 'refresh_token',
          refresh_token: this.refreshToken,
          client_id: process.env.YOUTUBE_CLIENT_ID!,
          client_secret: process.env.YOUTUBE_CLIENT_SECRET!,
        }),
      });
      const data = await response.json();
      this.accessToken = data.access_token;
      this.expiresAt = Date.now() + data.expires_in * 1000;
    } catch (error) {
      // L1 CRITICAL: YouTube auth completely broken
      ErrorReporter.report('L1', 'youtube_auth_refresh_failed', error);
    }
  }
}

2.2 TikTok Live Connector (WebSocket)

障害パターン / Failure Patterns

PatternCauseImpact
WebSocket disconnectNetwork instabilityGift/comment events lost
Connection rejectedStream not active, rate limitCannot connect
Message parse errorProtocol change, malformed dataIndividual events lost
Stream endedTikTok stream stoppedAll events lost

再接続戦略 / Reconnection Strategy

TikTok WebSocket


┌──────────────────┐
│  Connected        │◀───────────────────────┐
│  Receiving events │                         │
└────────┬─────────┘                         │
         │ Disconnect                         │
         ▼                                   │
┌──────────────────┐                         │
│  Attempt reconnect│                        │
│  wait: 2s         │    Success             │
│  max retries: ∞   │────────────────────────┘
└────────┬─────────┘
         │ Failure

┌──────────────────┐
│  Exponential      │    Success
│  backoff          │────────────────────────┘
│  2s→5s→10s→30s   │
│  →60s→120s       │
│  cap: 120s        │
└────────┬─────────┘
         │ Failed > 5 min

┌──────────────────┐
│  LOG WARNING      │
│  TikTok module    │
│  enters standby   │
│  Retry every 2min │
└──────────────────┘

2.3 プラットフォーム障害時の動作 / Behavior During Platform Outage

ScenarioYouTube StatusTikTok StatusSystem Behavior
NormalActiveActiveFull dual-platform
YT downDownActiveTikTok tips only. Characters continue normally
TT downActiveDownYouTube tips only. Characters continue normally
Both downDownDownAutonomous mode: fallback cycles, no tip reactions
RecoveryReconnectingReconnectingQueue events during reconnection

Key principle: ゲーム自体は投げ銭・コメントなしでも自律的に動き続ける。プラットフォーム障害は「誰もコメントしていない配信」と同じ状態になるだけ。

The game continues autonomously without tips/comments. A platform outage simply looks like "a stream with no viewer interaction."


3. データベース障害対応 / Database Error Handling

3.1 SQLite 障害パターン / SQLite Failure Patterns

PatternCauseSeverityImpact
Write failureDisk full, permission errorL2State not persisted
Read failureCorrupted index, I/O errorL1Cannot load game state
DB lockedConcurrent access (rare with better-sqlite3)L2Delayed write
DB corruptionPower loss during write, disk failureL0Full data loss risk
WAL overflowWAL file grows too largeL2Performance degradation

3.2 書き込み障害対応 / Write Failure Handling

State Update (every 10-min cycle)


┌──────────────────┐
│  Write to SQLite  │
└────────┬─────────┘
         │ Success → Done
         │ Failure

┌──────────────────┐
│  Retry write      │
│  (3 attempts,     │
│   1s interval)    │
└────────┬─────────┘
         │ All retries failed

┌──────────────────┐
│  Hold state in    │
│  memory           │
│  Write to backup  │
│  JSON file        │
└────────┬─────────┘


┌──────────────────┐
│  Continue game    │
│  with in-memory   │
│  state            │
│  Retry DB write   │
│  next cycle       │
└──────────────────┘

インメモリフォールバック / In-Memory Fallback

typescript
class StateManager {
  private memoryState: GameState;
  private dbAvailable: boolean = true;
  private pendingWrites: GameState[] = [];

  async saveState(state: GameState): Promise<void> {
    this.memoryState = state; // Always keep in memory

    if (this.dbAvailable) {
      try {
        await this.db.saveGameState(state);
        // Flush any pending writes
        await this.flushPendingWrites();
      } catch (error) {
        this.dbAvailable = false;
        this.pendingWrites.push(state);
        await this.saveToBackupJson(state);
        ErrorReporter.report('L2', 'sqlite_write_failed', error);
      }
    } else {
      this.pendingWrites.push(state);
      // Try to reconnect every 5 cycles
      if (this.pendingWrites.length % 5 === 0) {
        await this.attemptDbReconnect();
      }
    }
  }

  private async saveToBackupJson(state: GameState): Promise<void> {
    const backupPath = `./data/backup/state_${Date.now()}.json`;
    await fs.writeFile(backupPath, JSON.stringify(state, null, 2));
  }
}

3.3 DB破損対策 / DB Corruption Prevention

定期バックアップ / Periodic Backup

IntervalBackup TypeRetention
Every 10 minWAL checkpointCurrent only
Every 1 hourFull SQLite copyLast 24 copies
Every 24 hoursCompressed archiveLast 7 archives
typescript
class DatabaseBackup {
  // SQLite online backup API via better-sqlite3
  performBackup(): void {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    const backupPath = `./data/backup/game_${timestamp}.db`;

    this.db.backup(backupPath)
      .then(() => this.cleanOldBackups())
      .catch((err) => ErrorReporter.report('L2', 'backup_failed', err));
  }

  // WAL checkpoint to prevent WAL overflow
  checkpoint(): void {
    this.db.pragma('wal_checkpoint(TRUNCATE)');
  }

  // Integrity check (run daily during low-activity hours)
  integrityCheck(): boolean {
    const result = this.db.pragma('integrity_check');
    return result[0].integrity_check === 'ok';
  }
}

破損時の復旧手順 / Corruption Recovery

DB Integrity Check Failed


┌──────────────────┐
│  1. Try VACUUM    │── Success → Continue
└────────┬─────────┘
         │ Failure

┌──────────────────┐
│  2. Try .recover  │── Success → Rebuilt DB
│  (SQLite recovery │
│   mode)           │
└────────┬─────────┘
         │ Failure

┌──────────────────┐
│  3. Restore from  │── Success → Resume with
│  latest hourly    │   some data loss
│  backup           │
└────────┬─────────┘
         │ Failure

┌──────────────────┐
│  4. Initialize    │── Fresh start
│  fresh DB         │   All memory lost
│  LOG L0 FATAL     │
│  Notify admin     │
└──────────────────┘

4. メモリ管理 / Memory Management

4.1 メモリリーク対策 / Memory Leak Prevention

24時間連続稼働ではメモリリークが蓄積して致命的になる。

In 24-hour continuous operation, memory leaks accumulate and become fatal.

メモリリスクポイント / Memory Risk Points

ComponentRiskMitigation
Event queueUnbounded growth if processing is slowMax queue size (1,000). Drop oldest COMMENT events
LLM prompt historyContext grows with conversationKeep only last 5 actions in prompt. Summarize beyond
WebSocket connectionsDangling connections accumulateConnection timeout (60s inactivity). Periodic cleanup
TikTok event bufferHigh-traffic streams flood bufferRate limit to 100 events/min. Drop duplicates
Episodic memory cacheIn-memory cache growsLRU cache with 200 entry limit
Tip effect queueMany simultaneous tipsMax 5 concurrent effects. Queue excess
Log buffersConsole/file log accumulationRotate logs daily. Max 100MB per file

定期クリーンアップ / Periodic Cleanup

typescript
class MemoryManager {
  private readonly CLEANUP_INTERVAL = 30 * 60 * 1000; // 30 minutes
  private readonly MEMORY_WARNING_THRESHOLD = 0.8;      // 80% of max
  private readonly MEMORY_CRITICAL_THRESHOLD = 0.9;     // 90% of max
  private readonly MAX_HEAP_MB = 512;

  startMonitoring(): void {
    setInterval(() => this.performCleanup(), this.CLEANUP_INTERVAL);
    setInterval(() => this.checkMemoryUsage(), 60 * 1000); // every 1 min
  }

  private performCleanup(): void {
    // 1. Clear resolved promises and stale references
    this.eventQueue.pruneProcessed();

    // 2. Clear old WebSocket message buffers
    this.wsManager.clearMessageHistory();

    // 3. Trim episodic memory cache
    this.memoryCache.trimToLimit(200);

    // 4. Force garbage collection if available
    if (global.gc) {
      global.gc();
    }

    // 5. Log memory status
    const usage = process.memoryUsage();
    Logger.info('memory_cleanup', {
      heapUsed: Math.round(usage.heapUsed / 1024 / 1024),
      heapTotal: Math.round(usage.heapTotal / 1024 / 1024),
      rss: Math.round(usage.rss / 1024 / 1024),
      external: Math.round(usage.external / 1024 / 1024),
    });
  }

  private checkMemoryUsage(): void {
    const usage = process.memoryUsage();
    const heapUsedMB = usage.heapUsed / 1024 / 1024;
    const ratio = heapUsedMB / this.MAX_HEAP_MB;

    if (ratio > this.MEMORY_CRITICAL_THRESHOLD) {
      ErrorReporter.report('L1', 'memory_critical', {
        heapUsedMB,
        ratio,
      });
      this.emergencyCleanup();
    } else if (ratio > this.MEMORY_WARNING_THRESHOLD) {
      ErrorReporter.report('L2', 'memory_warning', {
        heapUsedMB,
        ratio,
      });
      this.performCleanup();
    }
  }

  private emergencyCleanup(): void {
    // Aggressive cleanup to avoid OOM
    this.eventQueue.clear();           // Drop all pending events
    this.memoryCache.clear();          // Clear all cached memories
    this.wsManager.dropAllBuffers();   // Clear WS buffers
    if (global.gc) global.gc();

    Logger.warn('emergency_cleanup_executed');
  }
}

4.2 Node.js プロセス設定 / Node.js Process Configuration

bash
# pm2 ecosystem.config.js
module.exports = {
  apps: [{
    name: 'cohabitation-life',
    script: './dist/server.js',
    node_args: '--max-old-space-size=512 --expose-gc',
    max_memory_restart: '480M',  // Restart before hitting 512M limit
    exp_backoff_restart_delay: 100,
    max_restarts: 100,
    min_uptime: '10s',
    kill_timeout: 5000,
    listen_timeout: 10000,
    autorestart: true,
    watch: false,
  }]
};

5. WebSocket 障害対応 / WebSocket Disconnection Handling

5.1 Backend → Frontend WebSocket

障害パターン / Failure Patterns

PatternCauseImpact
Client disconnectBrowser tab closed, OBS restartFrontend stops updating
Server WebSocket crashUnhandled error in WS handlerAll clients disconnect
Network interruptionLocal network issueTemporary disconnect
Message serialization errorInvalid state objectSingle message lost

再接続プロトコル / Reconnection Protocol

Frontend (Browser / OBS Browser Source)


┌──────────────────┐
│  WebSocket        │
│  Connected        │◀──────────────────────────┐
└────────┬─────────┘                             │
         │ onclose / onerror                     │
         ▼                                      │
┌──────────────────┐                             │
│  Show "Reconnect  │                            │
│  ing..." overlay  │                            │
│  (non-intrusive)  │                            │
└────────┬─────────┘                             │
         │                                      │
         ▼                                      │
┌──────────────────┐  Connected                  │
│  Reconnect        │───────────────▶ Request    │
│  attempt          │               full state   │
│  1s→2s→4s→8s→15s │               sync         │
│  cap: 15s         │                ─────────────┘
└────────┬─────────┘
         │ All attempts (indefinite retry)

┌──────────────────┐
│  Keep retrying    │
│  every 15s        │
│  Show "Offline"   │
│  static animation │
└──────────────────┘

サーバー側 Heartbeat / Server-Side Heartbeat

typescript
class WebSocketServer {
  private readonly HEARTBEAT_INTERVAL = 30_000; // 30s
  private readonly CLIENT_TIMEOUT = 90_000;     // 90s without pong

  setupHeartbeat(ws: WebSocket): void {
    ws.isAlive = true;

    ws.on('pong', () => {
      ws.isAlive = true;
    });

    const interval = setInterval(() => {
      if (!ws.isAlive) {
        ws.terminate();  // Dead connection
        clearInterval(interval);
        return;
      }
      ws.isAlive = false;
      ws.ping();
    }, this.HEARTBEAT_INTERVAL);

    ws.on('close', () => clearInterval(interval));
  }
}

メッセージタイプ: error / Message Type: error

WebSocket で送信するエラーメッセージの型定義。

Error message type definition sent over WebSocket.

typescript
interface WSErrorMessage {
  type: 'error';
  payload: {
    code: string;           // e.g., 'LLM_TIMEOUT', 'PLATFORM_DISCONNECT'
    severity: 'L0' | 'L1' | 'L2' | 'L3';
    message: string;        // Human-readable message (EN)
    message_ja: string;     // Human-readable message (JP)
    timestamp: string;      // ISO 8601
    recoverable: boolean;   // true if auto-recovery is expected
    retryIn?: number;       // milliseconds until next retry (optional)
  };
}

// Example error messages
const errorMessages = {
  LLM_TIMEOUT: {
    code: 'LLM_TIMEOUT',
    severity: 'L2' as const,
    message: 'AI response delayed. Using fallback action.',
    message_ja: 'AI応答遅延。フォールバック行動を使用します。',
    recoverable: true,
  },
  PLATFORM_DISCONNECT: {
    code: 'PLATFORM_DISCONNECT',
    severity: 'L2' as const,
    message: 'Platform connection lost. Reconnecting...',
    message_ja: 'プラットフォーム接続断。再接続中...',
    recoverable: true,
  },
  DB_WRITE_FAILED: {
    code: 'DB_WRITE_FAILED',
    severity: 'L2' as const,
    message: 'Database write failed. State held in memory.',
    message_ja: 'DB書き込み失敗。メモリ上で状態保持中。',
    recoverable: true,
  },
  MEMORY_CRITICAL: {
    code: 'MEMORY_CRITICAL',
    severity: 'L1' as const,
    message: 'Memory usage critical. Emergency cleanup in progress.',
    message_ja: 'メモリ使用率危険。緊急クリーンアップ実行中。',
    recoverable: true,
  },
};

5.2 フロントエンド側の障害表現 / Frontend Error Display

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Normal: No indication. Game runs smoothly.              │
│                                                          │
│  Reconnecting: Small pulsing dot in corner               │
│    🟡 "Reconnecting..."                                  │
│    Characters continue last animation loop               │
│                                                          │
│  Offline: Subtle overlay                                 │
│    🔴 "Connection lost. Retrying..."                     │
│    Characters in idle animation loop                     │
│    Status bar frozen at last known values                │
│                                                          │
│  Recovered: Brief green flash                            │
│    🟢 "Connected" (fades after 3s)                       │
│    Full state sync from server                           │
│                                                          │
└──────────────────────────────────────────────────────────┘

Design note: エラー表示は視聴者を不安にさせない。「配信の一部」に見えるレベルに留める。

Error display must not alarm viewers. Keep it subtle enough to look like part of the stream.


6. 自動復旧シナリオ / Auto-Recovery Scenarios

6.1 pm2 統合 / pm2 Integration

┌──────────────────────────────────────────────────────────┐
│                     pm2 Process Manager                   │
│                                                          │
│  ┌─────────────────────────────────────────────────┐     │
│  │  cohabitation-life (Node.js process)             │     │
│  │                                                   │     │
│  │  Monitored:                                       │     │
│  │  ・Process alive/dead                             │     │
│  │  ・Memory usage (restart if > 480MB)              │     │
│  │  ・CPU usage                                      │     │
│  │  ・Restart count                                  │     │
│  │  ・Uptime                                         │     │
│  └─────────────────────────────────────────────────┘     │
│                                                          │
│  Auto-restart conditions:                                │
│  ・Process exit (any code)                               │
│  ・Memory exceeds 480MB                                  │
│  ・Unresponsive (no heartbeat for 30s)                   │
│                                                          │
│  Restart behavior:                                       │
│  ・Exponential backoff: 100ms → 200ms → 400ms → ...     │
│  ・Max 100 restarts before stopping                      │
│  ・Min uptime: 10s (avoids restart loop)                 │
│                                                          │
└──────────────────────────────────────────────────────────┘

6.2 起動時の状態復旧 / State Recovery on Startup

Process Start (after crash/restart)


┌──────────────────┐
│  1. Load config    │
│  2. Init logger    │
│  3. Check DB       │
└────────┬─────────┘


┌──────────────────┐     DB OK?
│  Open SQLite DB   │──── Yes ──▶ Load last saved state
└────────┬─────────┘              │
         │ No                     │
         ▼                       │
┌──────────────────┐              │
│  Check backup     │              │
│  JSON files       │              │
└────────┬─────────┘              │
         │ Found?                 │
         ▼                       │
┌──────────────────┐              │
│  Restore from     │              │
│  latest backup    │──────────────┤
└────────┬─────────┘              │
         │ Not found              │
         ▼                       │
┌──────────────────┐              │
│  Initialize fresh │              │
│  game state       │──────────────┤
└──────────────────┘              │


                     ┌──────────────────┐
                     │  Resume game loop │
                     │  ・Reconnect WS   │
                     │  ・Reconnect YT   │
                     │  ・Reconnect TT   │
                     │  ・Start scheduler│
                     └──────────────────┘

Graceful Shutdown / グレースフルシャットダウン

typescript
class GracefulShutdown {
  async shutdown(signal: string): Promise<void> {
    Logger.info(`Shutdown signal received: ${signal}`);

    // 1. Stop accepting new events
    this.eventQueue.pause();

    // 2. Save current state to DB
    try {
      await this.stateManager.saveState(this.currentState);
      Logger.info('State saved to DB successfully');
    } catch (error) {
      // Fallback: save to JSON
      await this.stateManager.saveToBackupJson(this.currentState);
      Logger.warn('State saved to backup JSON (DB unavailable)');
    }

    // 3. Close WebSocket connections gracefully
    this.wsServer.clients.forEach((client) => {
      client.close(1001, 'Server shutting down');
    });

    // 4. Close platform connections
    await this.youtubeModule.disconnect();
    await this.tiktokModule.disconnect();

    // 5. Close DB
    this.db.close();

    Logger.info('Graceful shutdown complete');
    process.exit(0);
  }
}

// Register signal handlers
process.on('SIGTERM', () => gracefulShutdown.shutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown.shutdown('SIGINT'));

// Last-resort crash handler
process.on('uncaughtException', (error) => {
  Logger.error('Uncaught exception', error);
  ErrorReporter.report('L0', 'uncaught_exception', error);
  // pm2 will restart the process
  process.exit(1);
});

process.on('unhandledRejection', (reason) => {
  Logger.error('Unhandled rejection', reason);
  ErrorReporter.report('L2', 'unhandled_rejection', reason);
  // Log but don't crash — let the process continue
});

6.3 復旧シナリオ一覧 / Recovery Scenario Summary

ScenarioDetectionRecoveryTime to RecoverData Loss
Process crashpm2 monitorsAuto-restart + load last state5-15sLast cycle (max 10 min)
OOM killpm2 memory limitAuto-restart + load last state5-15sLast cycle
LLM API down3 consecutive failuresSwitch to fallback modeImmediateNone
YouTube disconnectAPI error responseBackoff reconnect5s-2minComments during downtime
TikTok disconnectWebSocket oncloseBackoff reconnect2s-2minEvents during downtime
DB corruptionIntegrity checkRestore from backup10-30sUp to 1 hour
WebSocket dropHeartbeat timeoutClient auto-reconnect1-15sVisual updates only
Network outageAll connections failWait and retry allDepends on networkEvents during downtime
Disk fullWrite failureAlert + cleanup old logs/backupsManual interventionNone if caught early

7. ヘルスチェック / Health Check Endpoints

7.1 エンドポイント定義 / Endpoint Definition

typescript
// GET /health — Basic liveness check
// Returns 200 if process is running
interface HealthResponse {
  status: 'ok' | 'degraded' | 'error';
  uptime: number;          // seconds
  timestamp: string;       // ISO 8601
}

// GET /health/detailed — Full system status
interface DetailedHealthResponse {
  status: 'ok' | 'degraded' | 'error';
  uptime: number;
  timestamp: string;
  components: {
    gameLoop: {
      status: 'ok' | 'error';
      lastCycleAt: string;
      cycleCount: number;
      currentMode: 'NORMAL' | 'DEGRADED' | 'RECOVERY';
    };
    llm: {
      status: 'ok' | 'degraded' | 'error';
      lastCallAt: string;
      consecutiveFailures: number;
      avgResponseMs: number;
      mode: 'NORMAL' | 'DEGRADED' | 'RECOVERY';
    };
    youtube: {
      status: 'ok' | 'disconnected' | 'error';
      lastPollAt: string;
      pollInterval: number;
      quotaUsed: number;
      quotaLimit: number;
    };
    tiktok: {
      status: 'ok' | 'disconnected' | 'error';
      lastEventAt: string;
      reconnectCount: number;
    };
    database: {
      status: 'ok' | 'readonly' | 'error';
      lastWriteAt: string;
      pendingWrites: number;
      sizeBytes: number;
    };
    memory: {
      heapUsedMB: number;
      heapTotalMB: number;
      rssMB: number;
      heapUsagePercent: number;
    };
    websocket: {
      status: 'ok' | 'error';
      connectedClients: number;
      lastMessageAt: string;
    };
  };
}

7.2 ステータス判定ロジック / Status Determination Logic

typescript
function determineOverallStatus(components: Components): 'ok' | 'degraded' | 'error' {
  const statuses = Object.values(components).map(c => c.status);

  // Any component in 'error' state → overall 'error'
  if (statuses.includes('error')) return 'error';

  // Game loop or database in non-ok state → 'degraded'
  if (components.gameLoop.status !== 'ok') return 'degraded';
  if (components.database.status !== 'ok') return 'degraded';

  // LLM in degraded mode → 'degraded' (but not 'error')
  if (components.llm.mode === 'DEGRADED') return 'degraded';

  // Platform disconnected → 'degraded' (but not 'error')
  if (components.youtube.status === 'disconnected' &&
      components.tiktok.status === 'disconnected') {
    return 'degraded';
  }

  return 'ok';
}

8. 監視・アラート / Monitoring & Alerting

8.1 監視項目 / Monitored Metrics

CategoryMetricCheck IntervalWarning ThresholdCritical Threshold
ProcessUptime1 minRestart count > 3/hourRestart count > 10/hour
ProcessMemory (heap)1 min> 400MB (80%)> 450MB (90%)
ProcessCPU1 min> 80% sustained 5min> 95% sustained 5min
Game LoopCycle execution10 minMissed 1 cycleMissed 3 cycles
LLMResponse timePer call> 5s avg> 10s avg
LLMFailure ratePer call> 10% in 10min> 50% in 10min
LLMConsecutive failuresPer call3 consecutive10 consecutive
YouTubePoll success rate5s> 5 failures/min> 20 failures/min
TikTokConnection status10sDisconnect > 1minDisconnect > 5min
DatabaseWrite latencyPer write> 100ms> 500ms
DatabaseDB file size1 hour> 500MB> 1GB
WebSocketConnected clients30s0 clients > 5min0 clients > 15min
DiskFree space1 hour< 1GB< 500MB

8.2 アラート通知 / Alert Notification

┌──────────────────────────────────────────────────────────┐
│                   Alert Pipeline                          │
│                                                          │
│  Metric exceeds threshold                                │
│         │                                                │
│         ▼                                                │
│  ┌──────────────┐                                        │
│  │  Dedup check  │  Same alert within 5 min? → Skip     │
│  └──────┬───────┘                                        │
│         │ New alert                                      │
│         ▼                                                │
│  ┌──────────────┐                                        │
│  │  Severity     │                                       │
│  │  routing      │                                       │
│  └──────┬───────┘                                        │
│         │                                                │
│    ┌────┼────┬─────────┐                                 │
│    ▼    ▼    ▼         ▼                                 │
│   L0   L1   L2       L3                                  │
│   │    │    │         │                                   │
│   │    │    │     Log only                                │
│   │    │    │                                             │
│   │    │   Slack                                          │
│   │    │   (batched,                                      │
│   │    │    every 5min)                                   │
│   │    │                                                  │
│   │   Slack                                               │
│   │   (immediate)                                         │
│   │                                                       │
│  Slack + SMS                                              │
│  (immediate)                                              │
│                                                          │
└──────────────────────────────────────────────────────────┘

Slack 通知フォーマット / Slack Notification Format

🔴 [L0 FATAL] cohabitation-life
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Error: Process crashed - uncaught exception
Time: 2026-03-11 14:30:45 JST
Server: production-01
Uptime before crash: 18h 42m
Restart attempt: #1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Action: pm2 auto-restart initiated

🟡 [L2 WARNING] cohabitation-life
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Warning: LLM API timeout (3 consecutive)
Time: 2026-03-11 14:30:45 JST
Mode: Switched to DEGRADED
Last successful call: 2 min ago
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Action: Using fallback actions

8.3 ログ設計 / Logging Design

ログレベル / Log Levels

LevelUsageExample
error要対応エラー / Actionable errorsDB write failed, LLM 5xx
warn注意事項 / Attention neededMemory > 80%, TikTok reconnect
info通常稼働ログ / Normal operationCycle completed, state saved
debug開発用詳細 / Development detailLLM prompt/response, WS messages

構造化ログ / Structured Logging

typescript
interface LogEntry {
  level: 'error' | 'warn' | 'info' | 'debug';
  timestamp: string;
  component: string;  // 'game_loop' | 'llm' | 'youtube' | 'tiktok' | 'db' | 'ws' | 'memory'
  event: string;      // 'cycle_complete' | 'llm_timeout' | 'platform_reconnect' | ...
  data?: Record<string, unknown>;
  error?: {
    message: string;
    stack?: string;
    code?: string;
  };
}

// Example log output (JSON Lines format)
// {"level":"warn","timestamp":"2026-03-11T14:30:45.123Z","component":"llm","event":"api_timeout","data":{"attempt":2,"timeoutMs":8000}}
// {"level":"info","timestamp":"2026-03-11T14:30:46.456Z","component":"llm","event":"fallback_used","data":{"character":"john","action":"work_at_desk"}}

ログローテーション / Log Rotation

Log FileMax SizeRetentionRotation
app.log100MB7 daysDaily rotation
error.log50MB30 daysDaily rotation
access.log100MB3 daysDaily rotation

9. 障害対応フロー全体図 / Complete Error Handling Flow

┌──────────────────────────────────────────────────────────────────────┐
│                    Error Handling Architecture                        │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐      │
│  │                    Error Interceptor                        │      │
│  │                                                             │      │
│  │  try/catch wrapper around:                                  │      │
│  │  ・Game loop cycle                                          │      │
│  │  ・LLM API calls                                            │      │
│  │  ・Platform API calls                                       │      │
│  │  ・DB operations                                            │      │
│  │  ・WebSocket message handling                               │      │
│  └────────────────────────┬───────────────────────────────────┘      │
│                            │ Error caught                            │
│                            ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐      │
│  │                    Error Classifier                         │      │
│  │                                                             │      │
│  │  Input: Error object                                        │      │
│  │  Output: { severity, component, recoverable, action }       │      │
│  └────────────────────────┬───────────────────────────────────┘      │
│                            │                                         │
│              ┌─────────────┼─────────────┐                           │
│              ▼             ▼             ▼                           │
│        Recoverable    Degradable     Fatal                           │
│              │             │             │                           │
│              ▼             ▼             ▼                           │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │
│  │  Retry with   │ │  Switch to   │ │  Save state   │                │
│  │  backoff      │ │  fallback    │ │  Log & exit   │                │
│  │              │ │  mode        │ │  pm2 restarts │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                 │
│         │                │                │                          │
│         ▼                ▼                ▼                          │
│  ┌────────────────────────────────────────────────────────────┐      │
│  │                    Error Reporter                           │      │
│  │                                                             │      │
│  │  ・Structured log entry                                     │      │
│  │  ・Metric update (for health check)                         │      │
│  │  ・Alert notification (if threshold exceeded)               │      │
│  └────────────────────────────────────────────────────────────┘      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

10. テスト戦略 / Testing Strategy

障害シミュレーション / Failure Simulation

TestMethodValidates
LLM timeoutMock API with delayed responseFallback action triggers
LLM 429Mock API returning 429Rate limit handling, backoff
LLM outageMock API returning 503 for N minutesDEGRADED mode transition
YT disconnectKill YouTube pollingGraceful degradation
TT disconnectClose TikTok WebSocketReconnection logic
DB write failMake DB file read-onlyIn-memory fallback
DB corruptionCorrupt SQLite fileBackup restoration
OOMAllocate memory until limitpm2 restart + state recovery
Process killkill -9 the processpm2 restart + state recovery
WS disconnectClose browser tabClient reconnection
Network outageDisable network interfaceAll reconnection logic

耐久テスト / Endurance Test

24-hour stress test checklist:
□ Memory usage stays stable (no monotonic increase)
□ No uncaught exceptions in error.log
□ Game loop cycles never skip more than 1
□ LLM fallback mode activates/deactivates correctly
□ Platform reconnection works after simulated outages
□ DB backup files are created on schedule
□ Log rotation works correctly
□ Health endpoint returns accurate status