エラーハンドリング・フォールバック設計 / Error Handling & Fallback Design
24時間配信で止まらないための各種障害対応パターン。
Resilience patterns to keep the 24-hour stream running without interruption.
設計原則 / Design Principles
- 配信は絶対に止めない / Never stop the stream — 障害が起きても視聴者に影響を与えない
- サイレント復旧 / Silent recovery — 可能な限り自動復旧し、視聴者に障害を気づかせない
- データは失ってもいい、体験は失わない / Data can be lost, experience cannot — 一部データロストより配信停止が悪い
- 多重防御 / Defense in depth — 1つの対策が失敗しても次の対策がカバーする
エラー重大度レベル / Error Severity Levels
| Level | Name | 影響 / Impact | 対応 / Response | 通知 / Notification |
|---|---|---|---|---|
| L0 | FATAL | 配信停止 / Stream stops | pm2 自動再起動 + 管理者通知 | Slack + SMS |
| L1 | CRITICAL | 主要機能停止 / Major feature down | 自動フォールバック + 管理者通知 | Slack |
| L2 | WARNING | 機能劣化 / Degraded experience | 自動フォールバック、ログ記録 | Slack (集約) |
| L3 | INFO | 軽微な問題 / Minor issue | ログ記録のみ | ログファイルのみ |
エラー分類マトリクス / Error Classification Matrix
┌─────────────────────────────────────────┐
│ Error Classification │
│ │
L0 FATAL │ ・Process crash (uncaught exception) │
│ ・SQLite DB corruption (unrecoverable) │
│ ・Port already in use │
│ │
L1 CRITICAL │ ・LLM API全面停止 (complete outage) │
│ ・WebSocket server failure │
│ ・Memory > 90% threshold │
│ │
L2 WARNING │ ・LLM API timeout (single request) │
│ ・YouTube API quota near limit │
│ ・TikTok WebSocket disconnect │
│ ・SQLite write failure (single) │
│ │
L3 INFO │ ・LLM response slow (> 3s) │
│ ・Comment filter triggered │
│ ・Memory cleanup executed │
└─────────────────────────────────────────┘1. LLM API 障害対応 / LLM API Failure Handling
1.1 障害パターン / Failure Patterns
| Pattern | Cause | HTTP Status | Frequency |
|---|---|---|---|
| Timeout | Network latency, API overload | - | Common |
| Rate Limit | Too many requests | 429 | Occasional |
| Server Error | Claude API internal error | 500, 502, 503 | Rare |
| Auth Failure | API key expired/invalid | 401, 403 | Very rare |
| Complete Outage | Service fully down | Connection refused | Very rare |
1.2 リトライ戦略 / Retry Strategy
LLM API Call
│
▼
┌──────────────┐ Success
│ First Try │──────────────▶ Process Response
│ timeout: 5s │
└──────┬───────┘
│ Failure
▼
┌──────────────┐ Success
│ Retry #1 │──────────────▶ Process Response
│ wait: 1s │
│ timeout: 8s │
└──────┬───────┘
│ Failure
▼
┌──────────────┐ Success
│ Retry #2 │──────────────▶ Process Response
│ wait: 3s │
│ timeout: 10s │
└──────┬───────┘
│ Failure
▼
┌──────────────┐ Success
│ Retry #3 │──────────────▶ Process Response
│ wait: 8s │
│ timeout: 15s │
└──────┬───────┘
│ All retries failed
▼
┌──────────────┐
│ FALLBACK │──────────────▶ Use pre-defined fallback action
└──────────────┘Exponential backoff with jitter:
function getRetryDelay(attempt: number): number {
const base = 1000; // 1 second
const maxDelay = 15000; // 15 seconds
const delay = Math.min(base * Math.pow(2, attempt), maxDelay);
const jitter = delay * 0.2 * Math.random();
return delay + jitter;
}1.3 Rate Limit 対応 / Rate Limit Handling
// Rate limit headers から wait 時間を取得
// Extract wait time from rate limit headers
interface RateLimitHandler {
// Respect Retry-After header
handleRateLimit(retryAfter: number): void;
// Token bucket for self-throttling
tokenBucket: {
maxTokens: 60, // per minute
refillRate: 1, // per second
currentTokens: number
};
}| Condition | Action |
|---|---|
| 429 with Retry-After | Wait specified duration, then retry |
| 429 without Retry-After | Wait 60s, then retry |
| Consecutive 429s (3+) | Extend cycle to 15 min temporarily |
| Rate limit persists > 10 min | Switch to fallback-only mode |
1.4 フォールバック行動 / Fallback Actions
LLM が応答不能な場合、事前定義されたフォールバック行動を使用する。
When LLM is unavailable, use pre-defined fallback actions.
フォールバック行動テーブル / Fallback Action Table
interface FallbackAction {
action: string;
dialogue_ja: string;
dialogue_en: string;
animation: string;
duration_ms: number;
}| Time of Day | John Fallback | Sara Fallback | Eve Fallback |
|---|---|---|---|
| 06:00-08:00 | Wake up, stretch | Wake up, make coffee | Wake up, wag tail |
| 08:00-12:00 | Work at desk | Work at desk | Nap near owner |
| 12:00-13:00 | Eat (default meal) | Eat (default meal) | Eat (dog food) |
| 13:00-18:00 | Work at desk | Work at desk | Play with ball |
| 18:00-20:00 | Rest, read | Cook dinner | Follow Sara |
| 20:00-22:00 | Watch TV / talk | Watch TV / talk | Nap on sofa |
| 22:00-06:00 | Sleep | Sleep | Sleep |
フォールバックダイアログ / Fallback Dialogue Templates
{
"fallback_dialogues": {
"john": {
"idle": [
{ "ja": "ふう、ちょっと一息つこう", "en": "Phew, let me take a short break" },
{ "ja": "今日も頑張るか", "en": "Let's do our best today too" },
{ "ja": "そういえば、冷蔵庫に何かあったかな", "en": "I wonder if there's anything in the fridge" },
{ "ja": "イヴ、いい子だな", "en": "Eve, you're such a good girl" },
{ "ja": "...(考え事をしている)", "en": "...(lost in thought)" }
],
"tip_reaction": [
{ "ja": "おっ、ありがとうございます!", "en": "Oh, thank you so much!" },
{ "ja": "うれしいな、ありがとう!", "en": "That makes me happy, thanks!" }
]
},
"sara": {
"idle": [
{ "ja": "ちょっと休憩しよっかな", "en": "Maybe I'll take a little break" },
{ "ja": "イヴ〜おいで〜", "en": "Eve~ come here~" },
{ "ja": "今日の夕飯、何にしよう", "en": "What should I make for dinner today" },
{ "ja": "ジョン、お疲れ様", "en": "John, good work today" },
{ "ja": "お部屋、ちょっと片付けよう", "en": "Let me tidy up a bit" }
],
"tip_reaction": [
{ "ja": "わあ、ありがとうございます!", "en": "Wow, thank you so much!" },
{ "ja": "嬉しい〜!ありがとう!", "en": "So happy~! Thank you!" }
]
}
}
}1.5 LLM障害時のモード遷移 / LLM Failure Mode Transitions
NORMAL MODE
│
LLM fails 3 consecutive times
│
▼
DEGRADED MODE
(fallback actions only)
(10-min cycle continues)
(tip reactions use templates)
│
LLM recovers (1 success)
│
▼
RECOVERY MODE
(test with 1 agent first)
(if OK, restore all agents)
│
3 consecutive successes
│
▼
NORMAL MODE| Mode | Behavior | Viewer Impact |
|---|---|---|
| NORMAL | Full LLM-powered actions and dialogue | None |
| DEGRADED | Pre-defined fallback actions and dialogue templates | Slightly repetitive but natural |
| RECOVERY | Gradually restoring LLM calls | Minimal |
2. プラットフォーム API 障害対応 / Platform API Failure Handling
2.1 YouTube Live API
障害パターン / Failure Patterns
| Pattern | Cause | Impact |
|---|---|---|
| Polling failure | API quota exceeded, network error | Comments not received |
| Auth token expired | OAuth token needs refresh | All API calls fail |
| Live chat ended | Stream ended/restarted on YouTube side | Chat polling returns empty |
| Quota exhaustion | Too many API calls in 24h | All calls rejected |
再接続戦略 / Reconnection Strategy
YouTube LiveChat Poll
│
▼
┌──────────────────┐
│ Poll every 5s │◀──────────────────────┐
└────────┬─────────┘ │
│ Error │
▼ │
┌──────────────────┐ │
│ Increase interval│ │
│ 5s → 10s → 30s │ │
│ → 60s → 120s │ │
└────────┬─────────┘ │
│ │
▼ │
┌──────────────────┐ Token expired? │
│ Check error type │──────────────▶ Refresh token
└────────┬─────────┘ │
│ Other error │
▼ │
┌──────────────────┐ Recovered? │
│ Wait & retry │──────────────────────┘
│ (backoff) │
└────────┬─────────┘
│ Failed > 10 min
▼
┌──────────────────┐
│ LOG WARNING │
│ Continue without │
│ YouTube comments │
└──────────────────┘OAuth トークン自動リフレッシュ / OAuth Token Auto-Refresh
class YouTubeTokenManager {
private refreshToken: string;
private accessToken: string;
private expiresAt: number;
async ensureValidToken(): Promise<string> {
// Refresh 5 minutes before expiration
if (Date.now() > this.expiresAt - 5 * 60 * 1000) {
await this.refreshAccessToken();
}
return this.accessToken;
}
async refreshAccessToken(): Promise<void> {
try {
// POST to OAuth endpoint with refresh_token
const response = await fetch('https://oauth2.googleapis.com/token', {
method: 'POST',
body: new URLSearchParams({
grant_type: 'refresh_token',
refresh_token: this.refreshToken,
client_id: process.env.YOUTUBE_CLIENT_ID!,
client_secret: process.env.YOUTUBE_CLIENT_SECRET!,
}),
});
const data = await response.json();
this.accessToken = data.access_token;
this.expiresAt = Date.now() + data.expires_in * 1000;
} catch (error) {
// L1 CRITICAL: YouTube auth completely broken
ErrorReporter.report('L1', 'youtube_auth_refresh_failed', error);
}
}
}2.2 TikTok Live Connector (WebSocket)
障害パターン / Failure Patterns
| Pattern | Cause | Impact |
|---|---|---|
| WebSocket disconnect | Network instability | Gift/comment events lost |
| Connection rejected | Stream not active, rate limit | Cannot connect |
| Message parse error | Protocol change, malformed data | Individual events lost |
| Stream ended | TikTok stream stopped | All events lost |
再接続戦略 / Reconnection Strategy
TikTok WebSocket
│
▼
┌──────────────────┐
│ Connected │◀───────────────────────┐
│ Receiving events │ │
└────────┬─────────┘ │
│ Disconnect │
▼ │
┌──────────────────┐ │
│ Attempt reconnect│ │
│ wait: 2s │ Success │
│ max retries: ∞ │────────────────────────┘
└────────┬─────────┘
│ Failure
▼
┌──────────────────┐
│ Exponential │ Success
│ backoff │────────────────────────┘
│ 2s→5s→10s→30s │
│ →60s→120s │
│ cap: 120s │
└────────┬─────────┘
│ Failed > 5 min
▼
┌──────────────────┐
│ LOG WARNING │
│ TikTok module │
│ enters standby │
│ Retry every 2min │
└──────────────────┘2.3 プラットフォーム障害時の動作 / Behavior During Platform Outage
| Scenario | YouTube Status | TikTok Status | System Behavior |
|---|---|---|---|
| Normal | Active | Active | Full dual-platform |
| YT down | Down | Active | TikTok tips only. Characters continue normally |
| TT down | Active | Down | YouTube tips only. Characters continue normally |
| Both down | Down | Down | Autonomous mode: fallback cycles, no tip reactions |
| Recovery | Reconnecting | Reconnecting | Queue events during reconnection |
Key principle: ゲーム自体は投げ銭・コメントなしでも自律的に動き続ける。プラットフォーム障害は「誰もコメントしていない配信」と同じ状態になるだけ。
The game continues autonomously without tips/comments. A platform outage simply looks like "a stream with no viewer interaction."
3. データベース障害対応 / Database Error Handling
3.1 SQLite 障害パターン / SQLite Failure Patterns
| Pattern | Cause | Severity | Impact |
|---|---|---|---|
| Write failure | Disk full, permission error | L2 | State not persisted |
| Read failure | Corrupted index, I/O error | L1 | Cannot load game state |
| DB locked | Concurrent access (rare with better-sqlite3) | L2 | Delayed write |
| DB corruption | Power loss during write, disk failure | L0 | Full data loss risk |
| WAL overflow | WAL file grows too large | L2 | Performance degradation |
3.2 書き込み障害対応 / Write Failure Handling
State Update (every 10-min cycle)
│
▼
┌──────────────────┐
│ Write to SQLite │
└────────┬─────────┘
│ Success → Done
│ Failure
▼
┌──────────────────┐
│ Retry write │
│ (3 attempts, │
│ 1s interval) │
└────────┬─────────┘
│ All retries failed
▼
┌──────────────────┐
│ Hold state in │
│ memory │
│ Write to backup │
│ JSON file │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Continue game │
│ with in-memory │
│ state │
│ Retry DB write │
│ next cycle │
└──────────────────┘インメモリフォールバック / In-Memory Fallback
class StateManager {
private memoryState: GameState;
private dbAvailable: boolean = true;
private pendingWrites: GameState[] = [];
async saveState(state: GameState): Promise<void> {
this.memoryState = state; // Always keep in memory
if (this.dbAvailable) {
try {
await this.db.saveGameState(state);
// Flush any pending writes
await this.flushPendingWrites();
} catch (error) {
this.dbAvailable = false;
this.pendingWrites.push(state);
await this.saveToBackupJson(state);
ErrorReporter.report('L2', 'sqlite_write_failed', error);
}
} else {
this.pendingWrites.push(state);
// Try to reconnect every 5 cycles
if (this.pendingWrites.length % 5 === 0) {
await this.attemptDbReconnect();
}
}
}
private async saveToBackupJson(state: GameState): Promise<void> {
const backupPath = `./data/backup/state_${Date.now()}.json`;
await fs.writeFile(backupPath, JSON.stringify(state, null, 2));
}
}3.3 DB破損対策 / DB Corruption Prevention
定期バックアップ / Periodic Backup
| Interval | Backup Type | Retention |
|---|---|---|
| Every 10 min | WAL checkpoint | Current only |
| Every 1 hour | Full SQLite copy | Last 24 copies |
| Every 24 hours | Compressed archive | Last 7 archives |
class DatabaseBackup {
// SQLite online backup API via better-sqlite3
performBackup(): void {
const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
const backupPath = `./data/backup/game_${timestamp}.db`;
this.db.backup(backupPath)
.then(() => this.cleanOldBackups())
.catch((err) => ErrorReporter.report('L2', 'backup_failed', err));
}
// WAL checkpoint to prevent WAL overflow
checkpoint(): void {
this.db.pragma('wal_checkpoint(TRUNCATE)');
}
// Integrity check (run daily during low-activity hours)
integrityCheck(): boolean {
const result = this.db.pragma('integrity_check');
return result[0].integrity_check === 'ok';
}
}破損時の復旧手順 / Corruption Recovery
DB Integrity Check Failed
│
▼
┌──────────────────┐
│ 1. Try VACUUM │── Success → Continue
└────────┬─────────┘
│ Failure
▼
┌──────────────────┐
│ 2. Try .recover │── Success → Rebuilt DB
│ (SQLite recovery │
│ mode) │
└────────┬─────────┘
│ Failure
▼
┌──────────────────┐
│ 3. Restore from │── Success → Resume with
│ latest hourly │ some data loss
│ backup │
└────────┬─────────┘
│ Failure
▼
┌──────────────────┐
│ 4. Initialize │── Fresh start
│ fresh DB │ All memory lost
│ LOG L0 FATAL │
│ Notify admin │
└──────────────────┘4. メモリ管理 / Memory Management
4.1 メモリリーク対策 / Memory Leak Prevention
24時間連続稼働ではメモリリークが蓄積して致命的になる。
In 24-hour continuous operation, memory leaks accumulate and become fatal.
メモリリスクポイント / Memory Risk Points
| Component | Risk | Mitigation |
|---|---|---|
| Event queue | Unbounded growth if processing is slow | Max queue size (1,000). Drop oldest COMMENT events |
| LLM prompt history | Context grows with conversation | Keep only last 5 actions in prompt. Summarize beyond |
| WebSocket connections | Dangling connections accumulate | Connection timeout (60s inactivity). Periodic cleanup |
| TikTok event buffer | High-traffic streams flood buffer | Rate limit to 100 events/min. Drop duplicates |
| Episodic memory cache | In-memory cache grows | LRU cache with 200 entry limit |
| Tip effect queue | Many simultaneous tips | Max 5 concurrent effects. Queue excess |
| Log buffers | Console/file log accumulation | Rotate logs daily. Max 100MB per file |
定期クリーンアップ / Periodic Cleanup
class MemoryManager {
private readonly CLEANUP_INTERVAL = 30 * 60 * 1000; // 30 minutes
private readonly MEMORY_WARNING_THRESHOLD = 0.8; // 80% of max
private readonly MEMORY_CRITICAL_THRESHOLD = 0.9; // 90% of max
private readonly MAX_HEAP_MB = 512;
startMonitoring(): void {
setInterval(() => this.performCleanup(), this.CLEANUP_INTERVAL);
setInterval(() => this.checkMemoryUsage(), 60 * 1000); // every 1 min
}
private performCleanup(): void {
// 1. Clear resolved promises and stale references
this.eventQueue.pruneProcessed();
// 2. Clear old WebSocket message buffers
this.wsManager.clearMessageHistory();
// 3. Trim episodic memory cache
this.memoryCache.trimToLimit(200);
// 4. Force garbage collection if available
if (global.gc) {
global.gc();
}
// 5. Log memory status
const usage = process.memoryUsage();
Logger.info('memory_cleanup', {
heapUsed: Math.round(usage.heapUsed / 1024 / 1024),
heapTotal: Math.round(usage.heapTotal / 1024 / 1024),
rss: Math.round(usage.rss / 1024 / 1024),
external: Math.round(usage.external / 1024 / 1024),
});
}
private checkMemoryUsage(): void {
const usage = process.memoryUsage();
const heapUsedMB = usage.heapUsed / 1024 / 1024;
const ratio = heapUsedMB / this.MAX_HEAP_MB;
if (ratio > this.MEMORY_CRITICAL_THRESHOLD) {
ErrorReporter.report('L1', 'memory_critical', {
heapUsedMB,
ratio,
});
this.emergencyCleanup();
} else if (ratio > this.MEMORY_WARNING_THRESHOLD) {
ErrorReporter.report('L2', 'memory_warning', {
heapUsedMB,
ratio,
});
this.performCleanup();
}
}
private emergencyCleanup(): void {
// Aggressive cleanup to avoid OOM
this.eventQueue.clear(); // Drop all pending events
this.memoryCache.clear(); // Clear all cached memories
this.wsManager.dropAllBuffers(); // Clear WS buffers
if (global.gc) global.gc();
Logger.warn('emergency_cleanup_executed');
}
}4.2 Node.js プロセス設定 / Node.js Process Configuration
# pm2 ecosystem.config.js
module.exports = {
apps: [{
name: 'cohabitation-life',
script: './dist/server.js',
node_args: '--max-old-space-size=512 --expose-gc',
max_memory_restart: '480M', // Restart before hitting 512M limit
exp_backoff_restart_delay: 100,
max_restarts: 100,
min_uptime: '10s',
kill_timeout: 5000,
listen_timeout: 10000,
autorestart: true,
watch: false,
}]
};5. WebSocket 障害対応 / WebSocket Disconnection Handling
5.1 Backend → Frontend WebSocket
障害パターン / Failure Patterns
| Pattern | Cause | Impact |
|---|---|---|
| Client disconnect | Browser tab closed, OBS restart | Frontend stops updating |
| Server WebSocket crash | Unhandled error in WS handler | All clients disconnect |
| Network interruption | Local network issue | Temporary disconnect |
| Message serialization error | Invalid state object | Single message lost |
再接続プロトコル / Reconnection Protocol
Frontend (Browser / OBS Browser Source)
│
▼
┌──────────────────┐
│ WebSocket │
│ Connected │◀──────────────────────────┐
└────────┬─────────┘ │
│ onclose / onerror │
▼ │
┌──────────────────┐ │
│ Show "Reconnect │ │
│ ing..." overlay │ │
│ (non-intrusive) │ │
└────────┬─────────┘ │
│ │
▼ │
┌──────────────────┐ Connected │
│ Reconnect │───────────────▶ Request │
│ attempt │ full state │
│ 1s→2s→4s→8s→15s │ sync │
│ cap: 15s │ ─────────────┘
└────────┬─────────┘
│ All attempts (indefinite retry)
▼
┌──────────────────┐
│ Keep retrying │
│ every 15s │
│ Show "Offline" │
│ static animation │
└──────────────────┘サーバー側 Heartbeat / Server-Side Heartbeat
class WebSocketServer {
private readonly HEARTBEAT_INTERVAL = 30_000; // 30s
private readonly CLIENT_TIMEOUT = 90_000; // 90s without pong
setupHeartbeat(ws: WebSocket): void {
ws.isAlive = true;
ws.on('pong', () => {
ws.isAlive = true;
});
const interval = setInterval(() => {
if (!ws.isAlive) {
ws.terminate(); // Dead connection
clearInterval(interval);
return;
}
ws.isAlive = false;
ws.ping();
}, this.HEARTBEAT_INTERVAL);
ws.on('close', () => clearInterval(interval));
}
}メッセージタイプ: error / Message Type: error
WebSocket で送信するエラーメッセージの型定義。
Error message type definition sent over WebSocket.
interface WSErrorMessage {
type: 'error';
payload: {
code: string; // e.g., 'LLM_TIMEOUT', 'PLATFORM_DISCONNECT'
severity: 'L0' | 'L1' | 'L2' | 'L3';
message: string; // Human-readable message (EN)
message_ja: string; // Human-readable message (JP)
timestamp: string; // ISO 8601
recoverable: boolean; // true if auto-recovery is expected
retryIn?: number; // milliseconds until next retry (optional)
};
}
// Example error messages
const errorMessages = {
LLM_TIMEOUT: {
code: 'LLM_TIMEOUT',
severity: 'L2' as const,
message: 'AI response delayed. Using fallback action.',
message_ja: 'AI応答遅延。フォールバック行動を使用します。',
recoverable: true,
},
PLATFORM_DISCONNECT: {
code: 'PLATFORM_DISCONNECT',
severity: 'L2' as const,
message: 'Platform connection lost. Reconnecting...',
message_ja: 'プラットフォーム接続断。再接続中...',
recoverable: true,
},
DB_WRITE_FAILED: {
code: 'DB_WRITE_FAILED',
severity: 'L2' as const,
message: 'Database write failed. State held in memory.',
message_ja: 'DB書き込み失敗。メモリ上で状態保持中。',
recoverable: true,
},
MEMORY_CRITICAL: {
code: 'MEMORY_CRITICAL',
severity: 'L1' as const,
message: 'Memory usage critical. Emergency cleanup in progress.',
message_ja: 'メモリ使用率危険。緊急クリーンアップ実行中。',
recoverable: true,
},
};5.2 フロントエンド側の障害表現 / Frontend Error Display
┌──────────────────────────────────────────────────────────┐
│ │
│ Normal: No indication. Game runs smoothly. │
│ │
│ Reconnecting: Small pulsing dot in corner │
│ 🟡 "Reconnecting..." │
│ Characters continue last animation loop │
│ │
│ Offline: Subtle overlay │
│ 🔴 "Connection lost. Retrying..." │
│ Characters in idle animation loop │
│ Status bar frozen at last known values │
│ │
│ Recovered: Brief green flash │
│ 🟢 "Connected" (fades after 3s) │
│ Full state sync from server │
│ │
└──────────────────────────────────────────────────────────┘Design note: エラー表示は視聴者を不安にさせない。「配信の一部」に見えるレベルに留める。
Error display must not alarm viewers. Keep it subtle enough to look like part of the stream.
6. 自動復旧シナリオ / Auto-Recovery Scenarios
6.1 pm2 統合 / pm2 Integration
┌──────────────────────────────────────────────────────────┐
│ pm2 Process Manager │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ cohabitation-life (Node.js process) │ │
│ │ │ │
│ │ Monitored: │ │
│ │ ・Process alive/dead │ │
│ │ ・Memory usage (restart if > 480MB) │ │
│ │ ・CPU usage │ │
│ │ ・Restart count │ │
│ │ ・Uptime │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Auto-restart conditions: │
│ ・Process exit (any code) │
│ ・Memory exceeds 480MB │
│ ・Unresponsive (no heartbeat for 30s) │
│ │
│ Restart behavior: │
│ ・Exponential backoff: 100ms → 200ms → 400ms → ... │
│ ・Max 100 restarts before stopping │
│ ・Min uptime: 10s (avoids restart loop) │
│ │
└──────────────────────────────────────────────────────────┘6.2 起動時の状態復旧 / State Recovery on Startup
Process Start (after crash/restart)
│
▼
┌──────────────────┐
│ 1. Load config │
│ 2. Init logger │
│ 3. Check DB │
└────────┬─────────┘
│
▼
┌──────────────────┐ DB OK?
│ Open SQLite DB │──── Yes ──▶ Load last saved state
└────────┬─────────┘ │
│ No │
▼ │
┌──────────────────┐ │
│ Check backup │ │
│ JSON files │ │
└────────┬─────────┘ │
│ Found? │
▼ │
┌──────────────────┐ │
│ Restore from │ │
│ latest backup │──────────────┤
└────────┬─────────┘ │
│ Not found │
▼ │
┌──────────────────┐ │
│ Initialize fresh │ │
│ game state │──────────────┤
└──────────────────┘ │
│
▼
┌──────────────────┐
│ Resume game loop │
│ ・Reconnect WS │
│ ・Reconnect YT │
│ ・Reconnect TT │
│ ・Start scheduler│
└──────────────────┘Graceful Shutdown / グレースフルシャットダウン
class GracefulShutdown {
async shutdown(signal: string): Promise<void> {
Logger.info(`Shutdown signal received: ${signal}`);
// 1. Stop accepting new events
this.eventQueue.pause();
// 2. Save current state to DB
try {
await this.stateManager.saveState(this.currentState);
Logger.info('State saved to DB successfully');
} catch (error) {
// Fallback: save to JSON
await this.stateManager.saveToBackupJson(this.currentState);
Logger.warn('State saved to backup JSON (DB unavailable)');
}
// 3. Close WebSocket connections gracefully
this.wsServer.clients.forEach((client) => {
client.close(1001, 'Server shutting down');
});
// 4. Close platform connections
await this.youtubeModule.disconnect();
await this.tiktokModule.disconnect();
// 5. Close DB
this.db.close();
Logger.info('Graceful shutdown complete');
process.exit(0);
}
}
// Register signal handlers
process.on('SIGTERM', () => gracefulShutdown.shutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown.shutdown('SIGINT'));
// Last-resort crash handler
process.on('uncaughtException', (error) => {
Logger.error('Uncaught exception', error);
ErrorReporter.report('L0', 'uncaught_exception', error);
// pm2 will restart the process
process.exit(1);
});
process.on('unhandledRejection', (reason) => {
Logger.error('Unhandled rejection', reason);
ErrorReporter.report('L2', 'unhandled_rejection', reason);
// Log but don't crash — let the process continue
});6.3 復旧シナリオ一覧 / Recovery Scenario Summary
| Scenario | Detection | Recovery | Time to Recover | Data Loss |
|---|---|---|---|---|
| Process crash | pm2 monitors | Auto-restart + load last state | 5-15s | Last cycle (max 10 min) |
| OOM kill | pm2 memory limit | Auto-restart + load last state | 5-15s | Last cycle |
| LLM API down | 3 consecutive failures | Switch to fallback mode | Immediate | None |
| YouTube disconnect | API error response | Backoff reconnect | 5s-2min | Comments during downtime |
| TikTok disconnect | WebSocket onclose | Backoff reconnect | 2s-2min | Events during downtime |
| DB corruption | Integrity check | Restore from backup | 10-30s | Up to 1 hour |
| WebSocket drop | Heartbeat timeout | Client auto-reconnect | 1-15s | Visual updates only |
| Network outage | All connections fail | Wait and retry all | Depends on network | Events during downtime |
| Disk full | Write failure | Alert + cleanup old logs/backups | Manual intervention | None if caught early |
7. ヘルスチェック / Health Check Endpoints
7.1 エンドポイント定義 / Endpoint Definition
// GET /health — Basic liveness check
// Returns 200 if process is running
interface HealthResponse {
status: 'ok' | 'degraded' | 'error';
uptime: number; // seconds
timestamp: string; // ISO 8601
}
// GET /health/detailed — Full system status
interface DetailedHealthResponse {
status: 'ok' | 'degraded' | 'error';
uptime: number;
timestamp: string;
components: {
gameLoop: {
status: 'ok' | 'error';
lastCycleAt: string;
cycleCount: number;
currentMode: 'NORMAL' | 'DEGRADED' | 'RECOVERY';
};
llm: {
status: 'ok' | 'degraded' | 'error';
lastCallAt: string;
consecutiveFailures: number;
avgResponseMs: number;
mode: 'NORMAL' | 'DEGRADED' | 'RECOVERY';
};
youtube: {
status: 'ok' | 'disconnected' | 'error';
lastPollAt: string;
pollInterval: number;
quotaUsed: number;
quotaLimit: number;
};
tiktok: {
status: 'ok' | 'disconnected' | 'error';
lastEventAt: string;
reconnectCount: number;
};
database: {
status: 'ok' | 'readonly' | 'error';
lastWriteAt: string;
pendingWrites: number;
sizeBytes: number;
};
memory: {
heapUsedMB: number;
heapTotalMB: number;
rssMB: number;
heapUsagePercent: number;
};
websocket: {
status: 'ok' | 'error';
connectedClients: number;
lastMessageAt: string;
};
};
}7.2 ステータス判定ロジック / Status Determination Logic
function determineOverallStatus(components: Components): 'ok' | 'degraded' | 'error' {
const statuses = Object.values(components).map(c => c.status);
// Any component in 'error' state → overall 'error'
if (statuses.includes('error')) return 'error';
// Game loop or database in non-ok state → 'degraded'
if (components.gameLoop.status !== 'ok') return 'degraded';
if (components.database.status !== 'ok') return 'degraded';
// LLM in degraded mode → 'degraded' (but not 'error')
if (components.llm.mode === 'DEGRADED') return 'degraded';
// Platform disconnected → 'degraded' (but not 'error')
if (components.youtube.status === 'disconnected' &&
components.tiktok.status === 'disconnected') {
return 'degraded';
}
return 'ok';
}8. 監視・アラート / Monitoring & Alerting
8.1 監視項目 / Monitored Metrics
| Category | Metric | Check Interval | Warning Threshold | Critical Threshold |
|---|---|---|---|---|
| Process | Uptime | 1 min | Restart count > 3/hour | Restart count > 10/hour |
| Process | Memory (heap) | 1 min | > 400MB (80%) | > 450MB (90%) |
| Process | CPU | 1 min | > 80% sustained 5min | > 95% sustained 5min |
| Game Loop | Cycle execution | 10 min | Missed 1 cycle | Missed 3 cycles |
| LLM | Response time | Per call | > 5s avg | > 10s avg |
| LLM | Failure rate | Per call | > 10% in 10min | > 50% in 10min |
| LLM | Consecutive failures | Per call | 3 consecutive | 10 consecutive |
| YouTube | Poll success rate | 5s | > 5 failures/min | > 20 failures/min |
| TikTok | Connection status | 10s | Disconnect > 1min | Disconnect > 5min |
| Database | Write latency | Per write | > 100ms | > 500ms |
| Database | DB file size | 1 hour | > 500MB | > 1GB |
| WebSocket | Connected clients | 30s | 0 clients > 5min | 0 clients > 15min |
| Disk | Free space | 1 hour | < 1GB | < 500MB |
8.2 アラート通知 / Alert Notification
┌──────────────────────────────────────────────────────────┐
│ Alert Pipeline │
│ │
│ Metric exceeds threshold │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Dedup check │ Same alert within 5 min? → Skip │
│ └──────┬───────┘ │
│ │ New alert │
│ ▼ │
│ ┌──────────────┐ │
│ │ Severity │ │
│ │ routing │ │
│ └──────┬───────┘ │
│ │ │
│ ┌────┼────┬─────────┐ │
│ ▼ ▼ ▼ ▼ │
│ L0 L1 L2 L3 │
│ │ │ │ │ │
│ │ │ │ Log only │
│ │ │ │ │
│ │ │ Slack │
│ │ │ (batched, │
│ │ │ every 5min) │
│ │ │ │
│ │ Slack │
│ │ (immediate) │
│ │ │
│ Slack + SMS │
│ (immediate) │
│ │
└──────────────────────────────────────────────────────────┘Slack 通知フォーマット / Slack Notification Format
🔴 [L0 FATAL] cohabitation-life
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Error: Process crashed - uncaught exception
Time: 2026-03-11 14:30:45 JST
Server: production-01
Uptime before crash: 18h 42m
Restart attempt: #1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Action: pm2 auto-restart initiated
🟡 [L2 WARNING] cohabitation-life
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Warning: LLM API timeout (3 consecutive)
Time: 2026-03-11 14:30:45 JST
Mode: Switched to DEGRADED
Last successful call: 2 min ago
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Action: Using fallback actions8.3 ログ設計 / Logging Design
ログレベル / Log Levels
| Level | Usage | Example |
|---|---|---|
error | 要対応エラー / Actionable errors | DB write failed, LLM 5xx |
warn | 注意事項 / Attention needed | Memory > 80%, TikTok reconnect |
info | 通常稼働ログ / Normal operation | Cycle completed, state saved |
debug | 開発用詳細 / Development detail | LLM prompt/response, WS messages |
構造化ログ / Structured Logging
interface LogEntry {
level: 'error' | 'warn' | 'info' | 'debug';
timestamp: string;
component: string; // 'game_loop' | 'llm' | 'youtube' | 'tiktok' | 'db' | 'ws' | 'memory'
event: string; // 'cycle_complete' | 'llm_timeout' | 'platform_reconnect' | ...
data?: Record<string, unknown>;
error?: {
message: string;
stack?: string;
code?: string;
};
}
// Example log output (JSON Lines format)
// {"level":"warn","timestamp":"2026-03-11T14:30:45.123Z","component":"llm","event":"api_timeout","data":{"attempt":2,"timeoutMs":8000}}
// {"level":"info","timestamp":"2026-03-11T14:30:46.456Z","component":"llm","event":"fallback_used","data":{"character":"john","action":"work_at_desk"}}ログローテーション / Log Rotation
| Log File | Max Size | Retention | Rotation |
|---|---|---|---|
app.log | 100MB | 7 days | Daily rotation |
error.log | 50MB | 30 days | Daily rotation |
access.log | 100MB | 3 days | Daily rotation |
9. 障害対応フロー全体図 / Complete Error Handling Flow
┌──────────────────────────────────────────────────────────────────────┐
│ Error Handling Architecture │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Error Interceptor │ │
│ │ │ │
│ │ try/catch wrapper around: │ │
│ │ ・Game loop cycle │ │
│ │ ・LLM API calls │ │
│ │ ・Platform API calls │ │
│ │ ・DB operations │ │
│ │ ・WebSocket message handling │ │
│ └────────────────────────┬───────────────────────────────────┘ │
│ │ Error caught │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Error Classifier │ │
│ │ │ │
│ │ Input: Error object │ │
│ │ Output: { severity, component, recoverable, action } │ │
│ └────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ Recoverable Degradable Fatal │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Retry with │ │ Switch to │ │ Save state │ │
│ │ backoff │ │ fallback │ │ Log & exit │ │
│ │ │ │ mode │ │ pm2 restarts │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Error Reporter │ │
│ │ │ │
│ │ ・Structured log entry │ │
│ │ ・Metric update (for health check) │ │
│ │ ・Alert notification (if threshold exceeded) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘10. テスト戦略 / Testing Strategy
障害シミュレーション / Failure Simulation
| Test | Method | Validates |
|---|---|---|
| LLM timeout | Mock API with delayed response | Fallback action triggers |
| LLM 429 | Mock API returning 429 | Rate limit handling, backoff |
| LLM outage | Mock API returning 503 for N minutes | DEGRADED mode transition |
| YT disconnect | Kill YouTube polling | Graceful degradation |
| TT disconnect | Close TikTok WebSocket | Reconnection logic |
| DB write fail | Make DB file read-only | In-memory fallback |
| DB corruption | Corrupt SQLite file | Backup restoration |
| OOM | Allocate memory until limit | pm2 restart + state recovery |
| Process kill | kill -9 the process | pm2 restart + state recovery |
| WS disconnect | Close browser tab | Client reconnection |
| Network outage | Disable network interface | All reconnection logic |
耐久テスト / Endurance Test
24-hour stress test checklist:
□ Memory usage stays stable (no monotonic increase)
□ No uncaught exceptions in error.log
□ Game loop cycles never skip more than 1
□ LLM fallback mode activates/deactivates correctly
□ Platform reconnection works after simulated outages
□ DB backup files are created on schedule
□ Log rotation works correctly
□ Health endpoint returns accurate status関連ドキュメント / Related Documents
- アーキテクチャ / Architecture — システム全体構成
- リスク & 対策 / Risks & Mitigations — リスク一覧と対策
- MVP スコープ / MVP Scope — 開発スコープと優先度
- 配信プラットフォーム / Streaming Platforms — プラットフォーム統合詳細
- 記憶システム / Memory System — メモリ管理の詳細