ITシステムの障害の原因やトリガーの傾向を論文から分析してみる

いくつかの障害の根本原因の調査している論文を見ていると，データによって割合の違いがあるものの，アプリケーションコードのバグや誤設定のように共通している原因があった．以下では実際の論文を紹介していく．

How to fight production incidents? an empirical study on a large-scale cloud service.

著者たちはMicrosoftでの障害の原因を分析した．

Ghosh, Supriyo, et al. “How to fight production incidents? an empirical study on a large-scale cloud service.” Proceedings of the 13th Symposium on Cloud Computing. 2022.

インシデントの根本原因を分析すると27%がコードのバグ，依存による障害が16%，インフラストラクチャの障害が15%，データベースとネットワーク10%，Configのミスが12.5%(一部抜粋)だった．

Code Bug – 27.0 %
Dependency Failure – 16.4 %
Infrastructure – 15.8 %
Deployment Error – 13.2 %
Config Bug – 12.5 %
Database/Network – 10.5 %
Auth Failure – 4.6 %

Ref) Ghosh, Supriyo, et al. “How to fight production incidents? an empirical study on a large-scale cloud service.”, 2022

Identifying bad software changes via multimodal anomaly detection for online service systems

中国の広発銀行(Guangfa Bank)でのインシデントを分析している．

Zhao, Nengwen, et al. “Identifying bad software changes via multimodal anomaly detection for online service systems.” Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021.

障害のうち50.4%は誤った変更(コードの欠陥, 設定ミス, リソース競合, ソフトウェアバージョン)によるものだった．

Our quantitative analyses indicate that about 50.4% of incidents are caused by bad changes, mainly because of code defect, configuration error, resource contention, and software version.

Clearly, we observe that change is an important factor leading to incidents, accounting for from 39% to 64% (50.4% on average).

Ref) Zhao, Nengwen, et al. “Identifying bad software changes via multimodal anomaly detection for online service systems.”, 2021

Figure 4に以下の分類が書いてあった．コードの欠陥が38%であり，設定ミスが31%であった．互換性のないソフトウェアのバージョンによるものが11%であり，リソース競合が12%であった．

Code defect 38%
Configuration error 31%
Incompatible software version 11%
Resource contention 12%
Others 9%

Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code

DeepFlowを提供する企業が顧客環境での障害を分析した．

Shen, Junxian, et al. “Network-centric distributed tracing with deepflow: Troubleshooting your microservices in zero code.” Proceedings of the ACM SIGCOMM 2023 Conference. 2023.

Figure 2(a)をみると，障害原因のうちネットワーク・インフラが47%, コンピューティングインフラ 12%, アプリケーションロジック 32.7%だった．

Network Infra 47.3%
Computing Infra 12.7%
Application Logic 32.7%
Others 7.3%

Ref) Shen, Junxian, et al. “Network-centric distributed tracing with deepflow: Troubleshooting your microservices in zero code.”, 2023.

Microservice Root Cause Analysis With Limited Observability
Through Intervention Recognition in the Latent Space

著者たちはeBayでの障害の内訳を分類している．

Xie, Zhe, et al. “Microservice root cause analysis with limited observability through intervention recognition in the latent space.” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024.

Table 1ではeBayでの障害の内訳が書かれている．63.59%がサードパーティサービス(例えばサードパーティのAPIエラー)であり最も多い．8.76%が内部サービスの原因だった．7.83%がソフトウェアの変更が原因だった．5.53%がデータベースの原因だった．

Category Percentage Typical Related Metrics

Third-Party Services 63.59% Third-Party API Error*
Internal Services 8.76% Runtime Error
Software Change 7.83% Change Process, Runtime Error*
Database 5.53% Markdown Error*

Ref) Xie, Zhe, et al. “Microservice root cause analysis with limited observability through intervention recognition in the latent space.” , 2024.

ポストモーテムの分析

個人的に集めているインシデントのポストモーテムからトリガー(起因)を分析してみた．最も多かったのはリクエストの増加であった．リソース不足や高負荷でのエラー発生がおきていた．次に多かったのはハードウェアの障害であり，特にデータセンタの電力系統による障害が多かった．

Web系のシステム障害 – Google Sheets

Request increase — 11件
- 例) Webサービスでアクセス数が増加
Hardware fault — 8件
- 例) データセンタのハードウェアが故障
Job failure — 6件
- 例) SSL証明書の更新失敗，PostgreSQLのauto vacuumが失敗
Misconfiguration — 4件
Application bugs — 3件
Maintainance — 3件
Misoperation — 3件
Infrastructure failure — 2件
- クラウドプロバイダの提供するインフラで障害

おわりに

障害の原因分析を行うことで，インシデントの発生するシステムコンポーネントや壊れ方，トリガーを明らかにした．論文によって全体の障害に含まれる障害の種類の割合が異なっていることがわかった．種類の割合には違いがあるものの，共通している障害の種類があることがわかった．

余談

ポストモーテムや興味深いインシデントがあれば教えて下さい．先月のGitHub Availability ReportにはDBのDROP COLUMNによる障害が紹介されていて興味深かったです．

GitHub Availability Report: August 2025 – The GitHub Blog

koyama's blog