エラーハンドリング・フォールバック設計 / Error Handling & Fallback Design

24時間配信で止まらないための各種障害対応パターン。

Resilience patterns to keep the 24-hour stream running without interruption.

設計原則 / Design Principles

配信は絶対に止めない / Never stop the stream — 障害が起きても視聴者に影響を与えない
サイレント復旧 / Silent recovery — 可能な限り自動復旧し、視聴者に障害を気づかせない
データは失ってもいい、体験は失わない / Data can be lost, experience cannot — 一部データロストより配信停止が悪い
多重防御 / Defense in depth — 1つの対策が失敗しても次の対策がカバーする

エラー重大度レベル / Error Severity Levels

Level	Name	影響 / Impact	対応 / Response	通知 / Notification
L0	FATAL	配信停止 / Stream stops	pm2 自動再起動 + 管理者通知	Slack + SMS
L1	CRITICAL	主要機能停止 / Major feature down	自動フォールバック + 管理者通知	Slack
L2	WARNING	機能劣化 / Degraded experience	自動フォールバック、ログ記録	Slack (集約)
L3	INFO	軽微な問題 / Minor issue	ログ記録のみ	ログファイルのみ

エラー分類マトリクス / Error Classification Matrix

                    ┌─────────────────────────────────────────┐
                    │         Error Classification             │
                    │                                          │
  L0 FATAL          │  ・Process crash (uncaught exception)    │
                    │  ・SQLite DB corruption (unrecoverable)  │
                    │  ・Port already in use                   │
                    │                                          │
  L1 CRITICAL       │  ・LLM API全面停止 (complete outage)     │
                    │  ・WebSocket server failure              │
                    │  ・Memory > 90% threshold                │
                    │                                          │
  L2 WARNING        │  ・LLM API timeout (single request)     │
                    │  ・YouTube API quota near limit          │
                    │  ・TikTok WebSocket disconnect           │
                    │  ・SQLite write failure (single)         │
                    │                                          │
  L3 INFO           │  ・LLM response slow (> 3s)             │
                    │  ・Comment filter triggered              │
                    │  ・Memory cleanup executed               │
                    └─────────────────────────────────────────┘

1. LLM API 障害対応 / LLM API Failure Handling

1.1 障害パターン / Failure Patterns

Pattern	Cause	HTTP Status	Frequency
Timeout	Network latency, API overload	-	Common
Rate Limit	Too many requests	429	Occasional
Server Error	Claude API internal error	500, 502, 503	Rare
Auth Failure	API key expired/invalid	401, 403	Very rare
Complete Outage	Service fully down	Connection refused	Very rare

1.2 リトライ戦略 / Retry Strategy

LLM API Call
    │
    ▼
┌──────────────┐    Success
│  First Try    │──────────────▶ Process Response
│  timeout: 5s  │
└──────┬───────┘
       │ Failure
       ▼
┌──────────────┐    Success
│  Retry #1     │──────────────▶ Process Response
│  wait: 1s     │
│  timeout: 8s  │
└──────┬───────┘
       │ Failure
       ▼
┌──────────────┐    Success
│  Retry #2     │──────────────▶ Process Response
│  wait: 3s     │
│  timeout: 10s │
└──────┬───────┘
       │ Failure
       ▼
┌──────────────┐    Success
│  Retry #3     │──────────────▶ Process Response
│  wait: 8s     │
│  timeout: 15s │
└──────┬───────┘
       │ All retries failed
       ▼
┌──────────────┐
│  FALLBACK     │──────────────▶ Use pre-defined fallback action
└──────────────┘

Exponential backoff with jitter:

typescript

function getRetryDelay(attempt: number): number {
  const base = 1000; // 1 second
  const maxDelay = 15000; // 15 seconds
  const delay = Math.min(base * Math.pow(2, attempt), maxDelay);
  const jitter = delay * 0.2 * Math.random();
  return delay + jitter;
}

1.3 Rate Limit 対応 / Rate Limit Handling

typescript

// Rate limit headers から wait 時間を取得
// Extract wait time from rate limit headers
interface RateLimitHandler {
  // Respect Retry-After header
  handleRateLimit(retryAfter: number): void;

  // Token bucket for self-throttling
  tokenBucket: {
    maxTokens: 60,        // per minute
    refillRate: 1,         // per second
    currentTokens: number
  };
}

Condition	Action
429 with Retry-After	Wait specified duration, then retry
429 without Retry-After	Wait 60s, then retry
Consecutive 429s (3+)	Extend cycle to 15 min temporarily
Rate limit persists > 10 min	Switch to fallback-only mode

1.4 フォールバック行動 / Fallback Actions

LLM が応答不能な場合、事前定義されたフォールバック行動を使用する。

When LLM is unavailable, use pre-defined fallback actions.

フォールバック行動テーブル / Fallback Action Table

typescript

interface FallbackAction {
  action: string;
  dialogue_ja: string;
  dialogue_en: string;
  animation: string;
  duration_ms: number;
}

Time of Day	John Fallback	Sara Fallback	Eve Fallback
06:00-08:00	Wake up, stretch	Wake up, make coffee	Wake up, wag tail
08:00-12:00	Work at desk	Work at desk	Nap near owner
12:00-13:00	Eat (default meal)	Eat (default meal)	Eat (dog food)
13:00-18:00	Work at desk	Work at desk	Play with ball
18:00-20:00	Rest, read	Cook dinner	Follow Sara
20:00-22:00	Watch TV / talk	Watch TV / talk	Nap on sofa
22:00-06:00	Sleep	Sleep	Sleep

フォールバックダイアログ / Fallback Dialogue Templates

json

{
  "fallback_dialogues": {
    "john": {
      "idle": [
        { "ja": "ふう、ちょっと一息つこう", "en": "Phew, let me take a short break" },
        { "ja": "今日も頑張るか", "en": "Let's do our best today too" },
        { "ja": "そういえば、冷蔵庫に何かあったかな", "en": "I wonder if there's anything in the fridge" },
        { "ja": "イヴ、いい子だな", "en": "Eve, you're such a good girl" },
        { "ja": "...（考え事をしている）", "en": "...(lost in thought)" }
      ],
      "tip_reaction": [
        { "ja": "おっ、ありがとうございます！", "en": "Oh, thank you so much!" },
        { "ja": "うれしいな、ありがとう！", "en": "That makes me happy, thanks!" }
      ]
    },
    "sara": {
      "idle": [
        { "ja": "ちょっと休憩しよっかな", "en": "Maybe I'll take a little break" },
        { "ja": "イヴ〜おいで〜", "en": "Eve~ come here~" },
        { "ja": "今日の夕飯、何にしよう", "en": "What should I make for dinner today" },
        { "ja": "ジョン、お疲れ様", "en": "John, good work today" },
        { "ja": "お部屋、ちょっと片付けよう", "en": "Let me tidy up a bit" }
      ],
      "tip_reaction": [
        { "ja": "わあ、ありがとうございます！", "en": "Wow, thank you so much!" },
        { "ja": "嬉しい〜！ありがとう！", "en": "So happy~! Thank you!" }
      ]
    }
  }
}

1.5 LLM障害時のモード遷移 / LLM Failure Mode Transitions

                 NORMAL MODE
                     │
          LLM fails 3 consecutive times
                     │
                     ▼
              DEGRADED MODE
         (fallback actions only)
         (10-min cycle continues)
         (tip reactions use templates)
                     │
          LLM recovers (1 success)
                     │
                     ▼
              RECOVERY MODE
         (test with 1 agent first)
         (if OK, restore all agents)
                     │
          3 consecutive successes
                     │
                     ▼
                 NORMAL MODE

Mode	Behavior	Viewer Impact
NORMAL	Full LLM-powered actions and dialogue	None
DEGRADED	Pre-defined fallback actions and dialogue templates	Slightly repetitive but natural
RECOVERY	Gradually restoring LLM calls	Minimal

2. プラットフォーム API 障害対応 / Platform API Failure Handling

2.1 YouTube Live API

障害パターン / Failure Patterns

Pattern	Cause	Impact
Polling failure	API quota exceeded, network error	Comments not received
Auth token expired	OAuth token needs refresh	All API calls fail
Live chat ended	Stream ended/restarted on YouTube side	Chat polling returns empty
Quota exhaustion	Too many API calls in 24h	All calls rejected

再接続戦略 / Reconnection Strategy

YouTube LiveChat Poll
        │
        ▼
┌──────────────────┐
│  Poll every 5s    │◀──────────────────────┐
└────────┬─────────┘                        │
         │ Error                             │
         ▼                                  │
┌──────────────────┐                        │
│  Increase interval│                       │
│  5s → 10s → 30s  │                       │
│  → 60s → 120s    │                       │
└────────┬─────────┘                        │
         │                                  │
         ▼                                  │
┌──────────────────┐    Token expired?      │
│  Check error type │──────────────▶ Refresh token
└────────┬─────────┘                        │
         │ Other error                      │
         ▼                                  │
┌──────────────────┐    Recovered?          │
│  Wait & retry     │──────────────────────┘
│  (backoff)        │
└────────┬─────────┘
         │ Failed > 10 min
         ▼
┌──────────────────┐
│  LOG WARNING      │
│  Continue without │
│  YouTube comments │
└──────────────────┘

OAuth トークン自動リフレッシュ / OAuth Token Auto-Refresh

typescript

class YouTubeTokenManager {
  private refreshToken: string;
  private accessToken: string;
  private expiresAt: number;

  async ensureValidToken(): Promise<string> {
    // Refresh 5 minutes before expiration
    if (Date.now() > this.expiresAt - 5 * 60 * 1000) {
      await this.refreshAccessToken();
    }
    return this.accessToken;
  }

  async refreshAccessToken(): Promise<void> {
    try {
      // POST to OAuth endpoint with refresh_token
      const response = await fetch('https://oauth2.googleapis.com/token', {
        method: 'POST',
        body: new URLSearchParams({
          grant_type: 'refresh_token',
          refresh_token: this.refreshToken,
          client_id: process.env.YOUTUBE_CLIENT_ID!,
          client_secret: process.env.YOUTUBE_CLIENT_SECRET!,
        }),
      });
      const data = await response.json();
      this.accessToken = data.access_token;
      this.expiresAt = Date.now() + data.expires_in * 1000;
    } catch (error) {
      // L1 CRITICAL: YouTube auth completely broken
      ErrorReporter.report('L1', 'youtube_auth_refresh_failed', error);
    }
  }
}

2.2 TikTok Live Connector (WebSocket)

障害パターン / Failure Patterns

Pattern	Cause	Impact
WebSocket disconnect	Network instability	Gift/comment events lost
Connection rejected	Stream not active, rate limit	Cannot connect
Message parse error	Protocol change, malformed data	Individual events lost
Stream ended	TikTok stream stopped	All events lost

再接続戦略 / Reconnection Strategy

TikTok WebSocket
       │
       ▼
┌──────────────────┐
│  Connected        │◀───────────────────────┐
│  Receiving events │                         │
└────────┬─────────┘                         │
         │ Disconnect                         │
         ▼                                   │
┌──────────────────┐                         │
│  Attempt reconnect│                        │
│  wait: 2s         │    Success             │
│  max retries: ∞   │────────────────────────┘
└────────┬─────────┘
         │ Failure
         ▼
┌──────────────────┐
│  Exponential      │    Success
│  backoff          │────────────────────────┘
│  2s→5s→10s→30s   │
│  →60s→120s       │
│  cap: 120s        │
└────────┬─────────┘
         │ Failed > 5 min
         ▼
┌──────────────────┐
│  LOG WARNING      │
│  TikTok module    │
│  enters standby   │
│  Retry every 2min │
└──────────────────┘

2.3 プラットフォーム障害時の動作 / Behavior During Platform Outage

Scenario	YouTube Status	TikTok Status	System Behavior
Normal	Active	Active	Full dual-platform
YT down	Down	Active	TikTok tips only. Characters continue normally
TT down	Active	Down	YouTube tips only. Characters continue normally
Both down	Down	Down	Autonomous mode: fallback cycles, no tip reactions
Recovery	Reconnecting	Reconnecting	Queue events during reconnection

Key principle: ゲーム自体は投げ銭・コメントなしでも自律的に動き続ける。プラットフォーム障害は「誰もコメントしていない配信」と同じ状態になるだけ。

The game continues autonomously without tips/comments. A platform outage simply looks like "a stream with no viewer interaction."

3. データベース障害対応 / Database Error Handling

3.1 SQLite 障害パターン / SQLite Failure Patterns

Pattern	Cause	Severity	Impact
Write failure	Disk full, permission error	L2	State not persisted
Read failure	Corrupted index, I/O error	L1	Cannot load game state
DB locked	Concurrent access (rare with better-sqlite3)	L2	Delayed write
DB corruption	Power loss during write, disk failure	L0	Full data loss risk
WAL overflow	WAL file grows too large	L2	Performance degradation

3.2 書き込み障害対応 / Write Failure Handling

State Update (every 10-min cycle)
        │
        ▼
┌──────────────────┐
│  Write to SQLite  │
└────────┬─────────┘
         │ Success → Done
         │ Failure
         ▼
┌──────────────────┐
│  Retry write      │
│  (3 attempts,     │
│   1s interval)    │
└────────┬─────────┘
         │ All retries failed
         ▼
┌──────────────────┐
│  Hold state in    │
│  memory           │
│  Write to backup  │
│  JSON file        │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Continue game    │
│  with in-memory   │
│  state            │
│  Retry DB write   │
│  next cycle       │
└──────────────────┘

インメモリフォールバック / In-Memory Fallback

typescript

class StateManager {
  private memoryState: GameState;
  private dbAvailable: boolean = true;
  private pendingWrites: GameState[] = [];

  async saveState(state: GameState): Promise<void> {
    this.memoryState = state; // Always keep in memory

    if (this.dbAvailable) {
      try {
        await this.db.saveGameState(state);
        // Flush any pending writes
        await this.flushPendingWrites();
      } catch (error) {
        this.dbAvailable = false;
        this.pendingWrites.push(state);
        await this.saveToBackupJson(state);
        ErrorReporter.report('L2', 'sqlite_write_failed', error);
      }
    } else {
      this.pendingWrites.push(state);
      // Try to reconnect every 5 cycles
      if (this.pendingWrites.length % 5 === 0) {
        await this.attemptDbReconnect();
      }
    }
  }

  private async saveToBackupJson(state: GameState): Promise<void> {
    const backupPath = `./data/backup/state_${Date.now()}.json`;
    await fs.writeFile(backupPath, JSON.stringify(state, null, 2));
  }
}

3.3 DB破損対策 / DB Corruption Prevention

定期バックアップ / Periodic Backup

Interval	Backup Type	Retention
Every 10 min	WAL checkpoint	Current only
Every 1 hour	Full SQLite copy	Last 24 copies
Every 24 hours	Compressed archive	Last 7 archives

typescript

class DatabaseBackup {
  // SQLite online backup API via better-sqlite3
  performBackup(): void {
    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
    const backupPath = `./data/backup/game_${timestamp}.db`;

    this.db.backup(backupPath)
      .then(() => this.cleanOldBackups())
      .catch((err) => ErrorReporter.report('L2', 'backup_failed', err));
  }

  // WAL checkpoint to prevent WAL overflow
  checkpoint(): void {
    this.db.pragma('wal_checkpoint(TRUNCATE)');
  }

  // Integrity check (run daily during low-activity hours)
  integrityCheck(): boolean {
    const result = this.db.pragma('integrity_check');
    return result[0].integrity_check === 'ok';
  }
}

破損時の復旧手順 / Corruption Recovery

DB Integrity Check Failed
         │
         ▼
┌──────────────────┐
│  1. Try VACUUM    │── Success → Continue
└────────┬─────────┘
         │ Failure
         ▼
┌──────────────────┐
│  2. Try .recover  │── Success → Rebuilt DB
│  (SQLite recovery │
│   mode)           │
└────────┬─────────┘
         │ Failure
         ▼
┌──────────────────┐
│  3. Restore from  │── Success → Resume with
│  latest hourly    │   some data loss
│  backup           │
└────────┬─────────┘
         │ Failure
         ▼
┌──────────────────┐
│  4. Initialize    │── Fresh start
│  fresh DB         │   All memory lost
│  LOG L0 FATAL     │
│  Notify admin     │
└──────────────────┘

4. メモリ管理 / Memory Management

4.1 メモリリーク対策 / Memory Leak Prevention

24時間連続稼働ではメモリリークが蓄積して致命的になる。

In 24-hour continuous operation, memory leaks accumulate and become fatal.

メモリリスクポイント / Memory Risk Points

Component	Risk	Mitigation
Event queue	Unbounded growth if processing is slow	Max queue size (1,000). Drop oldest COMMENT events
LLM prompt history	Context grows with conversation	Keep only last 5 actions in prompt. Summarize beyond
WebSocket connections	Dangling connections accumulate	Connection timeout (60s inactivity). Periodic cleanup
TikTok event buffer	High-traffic streams flood buffer	Rate limit to 100 events/min. Drop duplicates
Episodic memory cache	In-memory cache grows	LRU cache with 200 entry limit
Tip effect queue	Many simultaneous tips	Max 5 concurrent effects. Queue excess
Log buffers	Console/file log accumulation	Rotate logs daily. Max 100MB per file

定期クリーンアップ / Periodic Cleanup

typescript

class MemoryManager {
  private readonly CLEANUP_INTERVAL = 30 * 60 * 1000; // 30 minutes
  private readonly MEMORY_WARNING_THRESHOLD = 0.8;      // 80% of max
  private readonly MEMORY_CRITICAL_THRESHOLD = 0.9;     // 90% of max
  private readonly MAX_HEAP_MB = 512;

  startMonitoring(): void {
    setInterval(() => this.performCleanup(), this.CLEANUP_INTERVAL);
    setInterval(() => this.checkMemoryUsage(), 60 * 1000); // every 1 min
  }

  private performCleanup(): void {
    // 1. Clear resolved promises and stale references
    this.eventQueue.pruneProcessed();

    // 2. Clear old WebSocket message buffers
    this.wsManager.clearMessageHistory();

    // 3. Trim episodic memory cache
    this.memoryCache.trimToLimit(200);

    // 4. Force garbage collection if available
    if (global.gc) {
      global.gc();
    }

    // 5. Log memory status
    const usage = process.memoryUsage();
    Logger.info('memory_cleanup', {
      heapUsed: Math.round(usage.heapUsed / 1024 / 1024),
      heapTotal: Math.round(usage.heapTotal / 1024 / 1024),
      rss: Math.round(usage.rss / 1024 / 1024),
      external: Math.round(usage.external / 1024 / 1024),
    });
  }

  private checkMemoryUsage(): void {
    const usage = process.memoryUsage();
    const heapUsedMB = usage.heapUsed / 1024 / 1024;
    const ratio = heapUsedMB / this.MAX_HEAP_MB;

    if (ratio > this.MEMORY_CRITICAL_THRESHOLD) {
      ErrorReporter.report('L1', 'memory_critical', {
        heapUsedMB,
        ratio,
      });
      this.emergencyCleanup();
    } else if (ratio > this.MEMORY_WARNING_THRESHOLD) {
      ErrorReporter.report('L2', 'memory_warning', {
        heapUsedMB,
        ratio,
      });
      this.performCleanup();
    }
  }

  private emergencyCleanup(): void {
    // Aggressive cleanup to avoid OOM
    this.eventQueue.clear();           // Drop all pending events
    this.memoryCache.clear();          // Clear all cached memories
    this.wsManager.dropAllBuffers();   // Clear WS buffers
    if (global.gc) global.gc();

    Logger.warn('emergency_cleanup_executed');
  }
}

4.2 Node.js プロセス設定 / Node.js Process Configuration

bash

# pm2 ecosystem.config.js
module.exports = {
  apps: [{
    name: 'cohabitation-life',
    script: './dist/server.js',
    node_args: '--max-old-space-size=512 --expose-gc',
    max_memory_restart: '480M',  // Restart before hitting 512M limit
    exp_backoff_restart_delay: 100,
    max_restarts: 100,
    min_uptime: '10s',
    kill_timeout: 5000,
    listen_timeout: 10000,
    autorestart: true,
    watch: false,
  }]
};

5. WebSocket 障害対応 / WebSocket Disconnection Handling

5.1 Backend → Frontend WebSocket

障害パターン / Failure Patterns

Pattern	Cause	Impact
Client disconnect	Browser tab closed, OBS restart	Frontend stops updating
Server WebSocket crash	Unhandled error in WS handler	All clients disconnect
Network interruption	Local network issue	Temporary disconnect
Message serialization error	Invalid state object	Single message lost

再接続プロトコル / Reconnection Protocol

Frontend (Browser / OBS Browser Source)
        │
        ▼
┌──────────────────┐
│  WebSocket        │
│  Connected        │◀──────────────────────────┐
└────────┬─────────┘                             │
         │ onclose / onerror                     │
         ▼                                      │
┌──────────────────┐                             │
│  Show "Reconnect  │                            │
│  ing..." overlay  │                            │
│  (non-intrusive)  │                            │
└────────┬─────────┘                             │
         │                                      │
         ▼                                      │
┌──────────────────┐  Connected                  │
│  Reconnect        │───────────────▶ Request    │
│  attempt          │               full state   │
│  1s→2s→4s→8s→15s │               sync         │
│  cap: 15s         │                ─────────────┘
└────────┬─────────┘
         │ All attempts (indefinite retry)
         ▼
┌──────────────────┐
│  Keep retrying    │
│  every 15s        │
│  Show "Offline"   │
│  static animation │
└──────────────────┘

サーバー側 Heartbeat / Server-Side Heartbeat

typescript

class WebSocketServer {
  private readonly HEARTBEAT_INTERVAL = 30_000; // 30s
  private readonly CLIENT_TIMEOUT = 90_000;     // 90s without pong

  setupHeartbeat(ws: WebSocket): void {
    ws.isAlive = true;

    ws.on('pong', () => {
      ws.isAlive = true;
    });

    const interval = setInterval(() => {
      if (!ws.isAlive) {
        ws.terminate();  // Dead connection
        clearInterval(interval);
        return;
      }
      ws.isAlive = false;
      ws.ping();
    }, this.HEARTBEAT_INTERVAL);

    ws.on('close', () => clearInterval(interval));
  }
}

メッセージタイプ: error / Message Type: error

WebSocket で送信するエラーメッセージの型定義。

Error message type definition sent over WebSocket.

typescript

interface WSErrorMessage {
  type: 'error';
  payload: {
    code: string;           // e.g., 'LLM_TIMEOUT', 'PLATFORM_DISCONNECT'
    severity: 'L0' | 'L1' | 'L2' | 'L3';
    message: string;        // Human-readable message (EN)
    message_ja: string;     // Human-readable message (JP)
    timestamp: string;      // ISO 8601
    recoverable: boolean;   // true if auto-recovery is expected
    retryIn?: number;       // milliseconds until next retry (optional)
  };
}

// Example error messages
const errorMessages = {
  LLM_TIMEOUT: {
    code: 'LLM_TIMEOUT',
    severity: 'L2' as const,
    message: 'AI response delayed. Using fallback action.',
    message_ja: 'AI応答遅延。フォールバック行動を使用します。',
    recoverable: true,
  },
  PLATFORM_DISCONNECT: {
    code: 'PLATFORM_DISCONNECT',
    severity: 'L2' as const,
    message: 'Platform connection lost. Reconnecting...',
    message_ja: 'プラットフォーム接続断。再接続中...',
    recoverable: true,
  },
  DB_WRITE_FAILED: {
    code: 'DB_WRITE_FAILED',
    severity: 'L2' as const,
    message: 'Database write failed. State held in memory.',
    message_ja: 'DB書き込み失敗。メモリ上で状態保持中。',
    recoverable: true,
  },
  MEMORY_CRITICAL: {
    code: 'MEMORY_CRITICAL',
    severity: 'L1' as const,
    message: 'Memory usage critical. Emergency cleanup in progress.',
    message_ja: 'メモリ使用率危険。緊急クリーンアップ実行中。',
    recoverable: true,
  },
};

5.2 フロントエンド側の障害表現 / Frontend Error Display

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Normal: No indication. Game runs smoothly.              │
│                                                          │
│  Reconnecting: Small pulsing dot in corner               │
│    🟡 "Reconnecting..."                                  │
│    Characters continue last animation loop               │
│                                                          │
│  Offline: Subtle overlay                                 │
│    🔴 "Connection lost. Retrying..."                     │
│    Characters in idle animation loop                     │
│    Status bar frozen at last known values                │
│                                                          │
│  Recovered: Brief green flash                            │
│    🟢 "Connected" (fades after 3s)                       │
│    Full state sync from server                           │
│                                                          │
└──────────────────────────────────────────────────────────┘

Design note: エラー表示は視聴者を不安にさせない。「配信の一部」に見えるレベルに留める。

Error display must not alarm viewers. Keep it subtle enough to look like part of the stream.

6. 自動復旧シナリオ / Auto-Recovery Scenarios

6.1 pm2 統合 / pm2 Integration

┌──────────────────────────────────────────────────────────┐
│                     pm2 Process Manager                   │
│                                                          │
│  ┌─────────────────────────────────────────────────┐     │
│  │  cohabitation-life (Node.js process)             │     │
│  │                                                   │     │
│  │  Monitored:                                       │     │
│  │  ・Process alive/dead                             │     │
│  │  ・Memory usage (restart if > 480MB)              │     │
│  │  ・CPU usage                                      │     │
│  │  ・Restart count                                  │     │
│  │  ・Uptime                                         │     │
│  └─────────────────────────────────────────────────┘     │
│                                                          │
│  Auto-restart conditions:                                │
│  ・Process exit (any code)                               │
│  ・Memory exceeds 480MB                                  │
│  ・Unresponsive (no heartbeat for 30s)                   │
│                                                          │
│  Restart behavior:                                       │
│  ・Exponential backoff: 100ms → 200ms → 400ms → ...     │
│  ・Max 100 restarts before stopping                      │
│  ・Min uptime: 10s (avoids restart loop)                 │
│                                                          │
└──────────────────────────────────────────────────────────┘

6.2 起動時の状態復旧 / State Recovery on Startup

Process Start (after crash/restart)
        │
        ▼
┌──────────────────┐
│  1. Load config    │
│  2. Init logger    │
│  3. Check DB       │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐     DB OK?
│  Open SQLite DB   │──── Yes ──▶ Load last saved state
└────────┬─────────┘              │
         │ No                     │
         ▼                       │
┌──────────────────┐              │
│  Check backup     │              │
│  JSON files       │              │
└────────┬─────────┘              │
         │ Found?                 │
         ▼                       │
┌──────────────────┐              │
│  Restore from     │              │
│  latest backup    │──────────────┤
└────────┬─────────┘              │
         │ Not found              │
         ▼                       │
┌──────────────────┐              │
│  Initialize fresh │              │
│  game state       │──────────────┤
└──────────────────┘              │
                                  │
                                  ▼
                     ┌──────────────────┐
                     │  Resume game loop │
                     │  ・Reconnect WS   │
                     │  ・Reconnect YT   │
                     │  ・Reconnect TT   │
                     │  ・Start scheduler│
                     └──────────────────┘

Graceful Shutdown / グレースフルシャットダウン

typescript

class GracefulShutdown {
  async shutdown(signal: string): Promise<void> {
    Logger.info(`Shutdown signal received: ${signal}`);

    // 1. Stop accepting new events
    this.eventQueue.pause();

    // 2. Save current state to DB
    try {
      await this.stateManager.saveState(this.currentState);
      Logger.info('State saved to DB successfully');
    } catch (error) {
      // Fallback: save to JSON
      await this.stateManager.saveToBackupJson(this.currentState);
      Logger.warn('State saved to backup JSON (DB unavailable)');
    }

    // 3. Close WebSocket connections gracefully
    this.wsServer.clients.forEach((client) => {
      client.close(1001, 'Server shutting down');
    });

    // 4. Close platform connections
    await this.youtubeModule.disconnect();
    await this.tiktokModule.disconnect();

    // 5. Close DB
    this.db.close();

    Logger.info('Graceful shutdown complete');
    process.exit(0);
  }
}

// Register signal handlers
process.on('SIGTERM', () => gracefulShutdown.shutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown.shutdown('SIGINT'));

// Last-resort crash handler
process.on('uncaughtException', (error) => {
  Logger.error('Uncaught exception', error);
  ErrorReporter.report('L0', 'uncaught_exception', error);
  // pm2 will restart the process
  process.exit(1);
});

process.on('unhandledRejection', (reason) => {
  Logger.error('Unhandled rejection', reason);
  ErrorReporter.report('L2', 'unhandled_rejection', reason);
  // Log but don't crash — let the process continue
});

6.3 復旧シナリオ一覧 / Recovery Scenario Summary

Scenario	Detection	Recovery	Time to Recover	Data Loss
Process crash	pm2 monitors	Auto-restart + load last state	5-15s	Last cycle (max 10 min)
OOM kill	pm2 memory limit	Auto-restart + load last state	5-15s	Last cycle
LLM API down	3 consecutive failures	Switch to fallback mode	Immediate	None
YouTube disconnect	API error response	Backoff reconnect	5s-2min	Comments during downtime
TikTok disconnect	WebSocket onclose	Backoff reconnect	2s-2min	Events during downtime
DB corruption	Integrity check	Restore from backup	10-30s	Up to 1 hour
WebSocket drop	Heartbeat timeout	Client auto-reconnect	1-15s	Visual updates only
Network outage	All connections fail	Wait and retry all	Depends on network	Events during downtime
Disk full	Write failure	Alert + cleanup old logs/backups	Manual intervention	None if caught early

7. ヘルスチェック / Health Check Endpoints

7.1 エンドポイント定義 / Endpoint Definition

typescript

// GET /health — Basic liveness check
// Returns 200 if process is running
interface HealthResponse {
  status: 'ok' | 'degraded' | 'error';
  uptime: number;          // seconds
  timestamp: string;       // ISO 8601
}

// GET /health/detailed — Full system status
interface DetailedHealthResponse {
  status: 'ok' | 'degraded' | 'error';
  uptime: number;
  timestamp: string;
  components: {
    gameLoop: {
      status: 'ok' | 'error';
      lastCycleAt: string;
      cycleCount: number;
      currentMode: 'NORMAL' | 'DEGRADED' | 'RECOVERY';
    };
    llm: {
      status: 'ok' | 'degraded' | 'error';
      lastCallAt: string;
      consecutiveFailures: number;
      avgResponseMs: number;
      mode: 'NORMAL' | 'DEGRADED' | 'RECOVERY';
    };
    youtube: {
      status: 'ok' | 'disconnected' | 'error';
      lastPollAt: string;
      pollInterval: number;
      quotaUsed: number;
      quotaLimit: number;
    };
    tiktok: {
      status: 'ok' | 'disconnected' | 'error';
      lastEventAt: string;
      reconnectCount: number;
    };
    database: {
      status: 'ok' | 'readonly' | 'error';
      lastWriteAt: string;
      pendingWrites: number;
      sizeBytes: number;
    };
    memory: {
      heapUsedMB: number;
      heapTotalMB: number;
      rssMB: number;
      heapUsagePercent: number;
    };
    websocket: {
      status: 'ok' | 'error';
      connectedClients: number;
      lastMessageAt: string;
    };
  };
}

7.2 ステータス判定ロジック / Status Determination Logic

typescript

function determineOverallStatus(components: Components): 'ok' | 'degraded' | 'error' {
  const statuses = Object.values(components).map(c => c.status);

  // Any component in 'error' state → overall 'error'
  if (statuses.includes('error')) return 'error';

  // Game loop or database in non-ok state → 'degraded'
  if (components.gameLoop.status !== 'ok') return 'degraded';
  if (components.database.status !== 'ok') return 'degraded';

  // LLM in degraded mode → 'degraded' (but not 'error')
  if (components.llm.mode === 'DEGRADED') return 'degraded';

  // Platform disconnected → 'degraded' (but not 'error')
  if (components.youtube.status === 'disconnected' &&
      components.tiktok.status === 'disconnected') {
    return 'degraded';
  }

  return 'ok';
}

8. 監視・アラート / Monitoring & Alerting

8.1 監視項目 / Monitored Metrics

Category	Metric	Check Interval	Warning Threshold	Critical Threshold
Process	Uptime	1 min	Restart count > 3/hour	Restart count > 10/hour
Process	Memory (heap)	1 min	> 400MB (80%)	> 450MB (90%)
Process	CPU	1 min	> 80% sustained 5min	> 95% sustained 5min
Game Loop	Cycle execution	10 min	Missed 1 cycle	Missed 3 cycles
LLM	Response time	Per call	> 5s avg	> 10s avg
LLM	Failure rate	Per call	> 10% in 10min	> 50% in 10min
LLM	Consecutive failures	Per call	3 consecutive	10 consecutive
YouTube	Poll success rate	5s	> 5 failures/min	> 20 failures/min
TikTok	Connection status	10s	Disconnect > 1min	Disconnect > 5min
Database	Write latency	Per write	> 100ms	> 500ms
Database	DB file size	1 hour	> 500MB	> 1GB
WebSocket	Connected clients	30s	0 clients > 5min	0 clients > 15min
Disk	Free space	1 hour	< 1GB	< 500MB

8.2 アラート通知 / Alert Notification

┌──────────────────────────────────────────────────────────┐
│                   Alert Pipeline                          │
│                                                          │
│  Metric exceeds threshold                                │
│         │                                                │
│         ▼                                                │
│  ┌──────────────┐                                        │
│  │  Dedup check  │  Same alert within 5 min? → Skip     │
│  └──────┬───────┘                                        │
│         │ New alert                                      │
│         ▼                                                │
│  ┌──────────────┐                                        │
│  │  Severity     │                                       │
│  │  routing      │                                       │
│  └──────┬───────┘                                        │
│         │                                                │
│    ┌────┼────┬─────────┐                                 │
│    ▼    ▼    ▼         ▼                                 │
│   L0   L1   L2       L3                                  │
│   │    │    │         │                                   │
│   │    │    │     Log only                                │
│   │    │    │                                             │
│   │    │   Slack                                          │
│   │    │   (batched,                                      │
│   │    │    every 5min)                                   │
│   │    │                                                  │
│   │   Slack                                               │
│   │   (immediate)                                         │
│   │                                                       │
│  Slack + SMS                                              │
│  (immediate)                                              │
│                                                          │
└──────────────────────────────────────────────────────────┘

Slack 通知フォーマット / Slack Notification Format

🔴 [L0 FATAL] cohabitation-life
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Error: Process crashed - uncaught exception
Time: 2026-03-11 14:30:45 JST
Server: production-01
Uptime before crash: 18h 42m
Restart attempt: #1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Action: pm2 auto-restart initiated

🟡 [L2 WARNING] cohabitation-life
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Warning: LLM API timeout (3 consecutive)
Time: 2026-03-11 14:30:45 JST
Mode: Switched to DEGRADED
Last successful call: 2 min ago
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Action: Using fallback actions

8.3 ログ設計 / Logging Design

ログレベル / Log Levels

Level	Usage	Example
`error`	要対応エラー / Actionable errors	DB write failed, LLM 5xx
`warn`	注意事項 / Attention needed	Memory > 80%, TikTok reconnect
`info`	通常稼働ログ / Normal operation	Cycle completed, state saved
`debug`	開発用詳細 / Development detail	LLM prompt/response, WS messages

構造化ログ / Structured Logging

typescript

interface LogEntry {
  level: 'error' | 'warn' | 'info' | 'debug';
  timestamp: string;
  component: string;  // 'game_loop' | 'llm' | 'youtube' | 'tiktok' | 'db' | 'ws' | 'memory'
  event: string;      // 'cycle_complete' | 'llm_timeout' | 'platform_reconnect' | ...
  data?: Record<string, unknown>;
  error?: {
    message: string;
    stack?: string;
    code?: string;
  };
}

// Example log output (JSON Lines format)
// {"level":"warn","timestamp":"2026-03-11T14:30:45.123Z","component":"llm","event":"api_timeout","data":{"attempt":2,"timeoutMs":8000}}
// {"level":"info","timestamp":"2026-03-11T14:30:46.456Z","component":"llm","event":"fallback_used","data":{"character":"john","action":"work_at_desk"}}

ログローテーション / Log Rotation

Log File	Max Size	Retention	Rotation
`app.log`	100MB	7 days	Daily rotation
`error.log`	50MB	30 days	Daily rotation
`access.log`	100MB	3 days	Daily rotation

9. 障害対応フロー全体図 / Complete Error Handling Flow

┌──────────────────────────────────────────────────────────────────────┐
│                    Error Handling Architecture                        │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐      │
│  │                    Error Interceptor                        │      │
│  │                                                             │      │
│  │  try/catch wrapper around:                                  │      │
│  │  ・Game loop cycle                                          │      │
│  │  ・LLM API calls                                            │      │
│  │  ・Platform API calls                                       │      │
│  │  ・DB operations                                            │      │
│  │  ・WebSocket message handling                               │      │
│  └────────────────────────┬───────────────────────────────────┘      │
│                            │ Error caught                            │
│                            ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐      │
│  │                    Error Classifier                         │      │
│  │                                                             │      │
│  │  Input: Error object                                        │      │
│  │  Output: { severity, component, recoverable, action }       │      │
│  └────────────────────────┬───────────────────────────────────┘      │
│                            │                                         │
│              ┌─────────────┼─────────────┐                           │
│              ▼             ▼             ▼                           │
│        Recoverable    Degradable     Fatal                           │
│              │             │             │                           │
│              ▼             ▼             ▼                           │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                 │
│  │  Retry with   │ │  Switch to   │ │  Save state   │                │
│  │  backoff      │ │  fallback    │ │  Log & exit   │                │
│  │              │ │  mode        │ │  pm2 restarts │                │
│  └──────────────┘ └──────────────┘ └──────────────┘                 │
│         │                │                │                          │
│         ▼                ▼                ▼                          │
│  ┌────────────────────────────────────────────────────────────┐      │
│  │                    Error Reporter                           │      │
│  │                                                             │      │
│  │  ・Structured log entry                                     │      │
│  │  ・Metric update (for health check)                         │      │
│  │  ・Alert notification (if threshold exceeded)               │      │
│  └────────────────────────────────────────────────────────────┘      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

10. テスト戦略 / Testing Strategy

障害シミュレーション / Failure Simulation

Test	Method	Validates
LLM timeout	Mock API with delayed response	Fallback action triggers
LLM 429	Mock API returning 429	Rate limit handling, backoff
LLM outage	Mock API returning 503 for N minutes	DEGRADED mode transition
YT disconnect	Kill YouTube polling	Graceful degradation
TT disconnect	Close TikTok WebSocket	Reconnection logic
DB write fail	Make DB file read-only	In-memory fallback
DB corruption	Corrupt SQLite file	Backup restoration
OOM	Allocate memory until limit	pm2 restart + state recovery
Process kill	`kill -9` the process	pm2 restart + state recovery
WS disconnect	Close browser tab	Client reconnection
Network outage	Disable network interface	All reconnection logic

耐久テスト / Endurance Test

24-hour stress test checklist:
□ Memory usage stays stable (no monotonic increase)
□ No uncaught exceptions in error.log
□ Game loop cycles never skip more than 1
□ LLM fallback mode activates/deactivates correctly
□ Platform reconnection works after simulated outages
□ DB backup files are created on schedule
□ Log rotation works correctly
□ Health endpoint returns accurate status

アーキテクチャ / Architecture — システム全体構成
リスク & 対策 / Risks & Mitigations — リスク一覧と対策
MVP スコープ / MVP Scope — 開発スコープと優先度
配信プラットフォーム / Streaming Platforms — プラットフォーム統合詳細
記憶システム / Memory System — メモリ管理の詳細

エラーハンドリング・フォールバック設計 / Error Handling & Fallback Design ​

設計原則 / Design Principles ​

エラー重大度レベル / Error Severity Levels ​

エラー分類マトリクス / Error Classification Matrix ​

1. LLM API 障害対応 / LLM API Failure Handling ​

1.1 障害パターン / Failure Patterns ​

1.2 リトライ戦略 / Retry Strategy ​

1.3 Rate Limit 対応 / Rate Limit Handling ​

1.4 フォールバック行動 / Fallback Actions ​

フォールバック行動テーブル / Fallback Action Table ​

フォールバックダイアログ / Fallback Dialogue Templates ​

1.5 LLM障害時のモード遷移 / LLM Failure Mode Transitions ​

2. プラットフォーム API 障害対応 / Platform API Failure Handling ​

2.1 YouTube Live API ​

障害パターン / Failure Patterns ​

再接続戦略 / Reconnection Strategy ​

OAuth トークン自動リフレッシュ / OAuth Token Auto-Refresh ​

2.2 TikTok Live Connector (WebSocket) ​

障害パターン / Failure Patterns ​

再接続戦略 / Reconnection Strategy ​

2.3 プラットフォーム障害時の動作 / Behavior During Platform Outage ​

3. データベース障害対応 / Database Error Handling ​

3.1 SQLite 障害パターン / SQLite Failure Patterns ​

3.2 書き込み障害対応 / Write Failure Handling ​

インメモリフォールバック / In-Memory Fallback ​

3.3 DB破損対策 / DB Corruption Prevention ​

定期バックアップ / Periodic Backup ​

破損時の復旧手順 / Corruption Recovery ​

4. メモリ管理 / Memory Management ​

4.1 メモリリーク対策 / Memory Leak Prevention ​

メモリリスクポイント / Memory Risk Points ​

定期クリーンアップ / Periodic Cleanup ​

4.2 Node.js プロセス設定 / Node.js Process Configuration ​

5. WebSocket 障害対応 / WebSocket Disconnection Handling ​

5.1 Backend → Frontend WebSocket ​

障害パターン / Failure Patterns ​

再接続プロトコル / Reconnection Protocol ​

サーバー側 Heartbeat / Server-Side Heartbeat ​

メッセージタイプ: error / Message Type: error ​

5.2 フロントエンド側の障害表現 / Frontend Error Display ​

6. 自動復旧シナリオ / Auto-Recovery Scenarios ​

6.1 pm2 統合 / pm2 Integration ​

6.2 起動時の状態復旧 / State Recovery on Startup ​

Graceful Shutdown / グレースフルシャットダウン ​

6.3 復旧シナリオ一覧 / Recovery Scenario Summary ​

7. ヘルスチェック / Health Check Endpoints ​

7.1 エンドポイント定義 / Endpoint Definition ​

7.2 ステータス判定ロジック / Status Determination Logic ​

8. 監視・アラート / Monitoring & Alerting ​

8.1 監視項目 / Monitored Metrics ​

8.2 アラート通知 / Alert Notification ​

Slack 通知フォーマット / Slack Notification Format ​

8.3 ログ設計 / Logging Design ​

ログレベル / Log Levels ​

構造化ログ / Structured Logging ​

ログローテーション / Log Rotation ​

9. 障害対応フロー全体図 / Complete Error Handling Flow ​

10. テスト戦略 / Testing Strategy ​

障害シミュレーション / Failure Simulation ​

耐久テスト / Endurance Test ​

関連ドキュメント / Related Documents ​

エラーハンドリング・フォールバック設計 / Error Handling & Fallback Design

設計原則 / Design Principles

エラー重大度レベル / Error Severity Levels

エラー分類マトリクス / Error Classification Matrix

1. LLM API 障害対応 / LLM API Failure Handling

1.1 障害パターン / Failure Patterns

1.2 リトライ戦略 / Retry Strategy

1.3 Rate Limit 対応 / Rate Limit Handling

1.4 フォールバック行動 / Fallback Actions

フォールバック行動テーブル / Fallback Action Table

フォールバックダイアログ / Fallback Dialogue Templates

1.5 LLM障害時のモード遷移 / LLM Failure Mode Transitions

2. プラットフォーム API 障害対応 / Platform API Failure Handling

2.1 YouTube Live API

障害パターン / Failure Patterns

再接続戦略 / Reconnection Strategy

OAuth トークン自動リフレッシュ / OAuth Token Auto-Refresh

2.2 TikTok Live Connector (WebSocket)

障害パターン / Failure Patterns

再接続戦略 / Reconnection Strategy

2.3 プラットフォーム障害時の動作 / Behavior During Platform Outage

3. データベース障害対応 / Database Error Handling

3.1 SQLite 障害パターン / SQLite Failure Patterns

3.2 書き込み障害対応 / Write Failure Handling

インメモリフォールバック / In-Memory Fallback

3.3 DB破損対策 / DB Corruption Prevention

定期バックアップ / Periodic Backup

破損時の復旧手順 / Corruption Recovery

4. メモリ管理 / Memory Management

4.1 メモリリーク対策 / Memory Leak Prevention

メモリリスクポイント / Memory Risk Points

定期クリーンアップ / Periodic Cleanup

4.2 Node.js プロセス設定 / Node.js Process Configuration

5. WebSocket 障害対応 / WebSocket Disconnection Handling

5.1 Backend → Frontend WebSocket

障害パターン / Failure Patterns

再接続プロトコル / Reconnection Protocol

サーバー側 Heartbeat / Server-Side Heartbeat

メッセージタイプ: error / Message Type: error

5.2 フロントエンド側の障害表現 / Frontend Error Display

6. 自動復旧シナリオ / Auto-Recovery Scenarios

6.1 pm2 統合 / pm2 Integration

6.2 起動時の状態復旧 / State Recovery on Startup

Graceful Shutdown / グレースフルシャットダウン

6.3 復旧シナリオ一覧 / Recovery Scenario Summary

7. ヘルスチェック / Health Check Endpoints

7.1 エンドポイント定義 / Endpoint Definition

7.2 ステータス判定ロジック / Status Determination Logic

8. 監視・アラート / Monitoring & Alerting

8.1 監視項目 / Monitored Metrics

8.2 アラート通知 / Alert Notification

Slack 通知フォーマット / Slack Notification Format

8.3 ログ設計 / Logging Design

ログレベル / Log Levels

構造化ログ / Structured Logging

ログローテーション / Log Rotation

9. 障害対応フロー全体図 / Complete Error Handling Flow

10. テスト戦略 / Testing Strategy

障害シミュレーション / Failure Simulation

耐久テスト / Endurance Test

関連ドキュメント / Related Documents