The WikiLeaks Afghanistan war logs contain more than 76 000 reports about fatalities and their circumstances in the US led Afghanistan war, covering the period from January 2004 to December 2009. In this paper we use those reports to build statistical models to help us understand the mortality rates associated with specific circumstances. We choose an approach that combines Latent Dirichlet Allocation (LDA) with negative binomial based recursive partitioning. LDA is used to process the natural language information contained in each report summary. We estimate latent topics and assign each report to one of them. These topics - in addition to other variables in the data set - subsequently serve as explanatory variables for modeling the number of fatalities of the civilian population, ISAF Forces, Anti-Coalition Forces and the Afghan National Police or military as well as the combined number of fatalities. Modeling is carried out with manifest mixtures of negative binomial distributions estimated with model-based recursive partitioning. For each group of fatalities, we identify segments with different mortality rates that correspond to a small number of topics and other explanatory variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This provides an unprecedented description of the war in Afghanistan covered by the war logs. Additionally, our approach can serve as an example as to how modern statistical methods may lead to extra insight if applied to problems of data journalism.
|Publication status||Published - 1 Sep 2011|
|Name||Research Report Series / Department of Statistics and Mathematics|
- Research Report Series / Department of Statistics and Mathematics