A statistical analysis of WikiLeaks publications

Purpose: Examining WikiLeaks’ publications on a statistical level points to patterns in the groups’ sources, methods, risks and editorial decisions. This analysis is meant to provide context for future discussion and questions, such as why some things but not others are published, or why a source might not come forward.

Sampling: Approximately 15,060,000 files were considered in this analysis, covering 2007 through 2017 (WikiLeaks’ most recent counted release is from November 2017). This analysis does not consider WikiLeaks’ republishing scraped information in their ICWATCH or ICE Patrol releases, nor WikiLeaks republishing the hacked identities of FBI and DHS employees that later supplemented WikiLeaks’ version of ICWATCH.

Where possible, numbers are taken directly from WikiLeaks. Where multiple numbers are provided, the largest of WikiLeaks’ estimates are used. For example, at the time of this writing WikiLeaks’ main Leaks page lists the Saudi Cables as numbering at over half a million, but the actual Saudi Cables page says that 122,619 have been published so far. The more generous number is used so as to incorporate as many of WikiLeaks’ files as possible, including ones that are potentially not yet released but are part of a known set. In some instances, numbers are derived from WikiLeaks’ search function. For Vault7 and Vault8, totals were calculated by downloading the files and counting the number of items within them.

Precise numbers are avoided to help protect sources who consulted on things such as technical attribution, as well as to avoid risking spoiling an ongoing investigation.

Findings and implications:

1. 33.8% of WikiLeaks’ releases are republications of information released elsewhere. As a result, WikiLeaks’ statement that “WikiLeaks rejects submissions that have already been published elsewhere or which are likely to be considered insignificant” is incorrect and fails as a categorical deflection. This re-raises the question of why WikiLeaks republishes some information, but not other information For instance, WikiLeaks was happy to republish already public Clinton emails, but not to publish documents about Trump’s business that WikiLeaks received during the campaign.

The 33.8% includes the AKP emails due to their near simultaneous publication and WikiLeaks’ violating the sources’ instructions by announcing and releasing the emails when they did.

2. The majority of WikiLeaks’ releases aren’t intentional leaks. 69% of their releases are the result of hacks, not intentional leaks. After confirmed hacks, FOIA and E.O. 13526 releases are excluded, only 9.5% of WikiLeaks’ releases are the result of confirmed or potential leaks.

3. An overwhelming majority of released WikiLeaks files were simple to automate and import, requiring little human effort. 66.3% of WikiLeaks releases consist of emails which WikiLeaks received in their native format, making their import and processing simple. Another 20.6% of the releases are previously released PlusD documents, which were already text searchable and followed a standard format that allowed automated importing and indexing. Between native format emails and PlusD, 86.9% of WikiLeaks’ publications could be automatically processed in a relatively short period of time and at a low cost.

This matches the writer’s discussions with WikiLeaks about their processing system and how metadata is processed.

Clinton emails are not part of the 66.3%, as WikiLeaks received in PDF format and not as native emails. However, the PDFs appear to have been simply imported into a search interface with the OCR already performed by the State Department. If included, this would raise the total another 0.2% – a statistically insignificant difference that would barely raise the total to 87%

4. It’s dangerous to be a source for WikiLeaks. Hackers (both state sponsored and independent) and leakers behind more than 54.6% of WikiLeaks’ releases and re-releases have been charged, arrested or prosecuted in connection with hacking or unauthorized disclosure. The percentage increases to just over 69% when FOIA and E.O. 13526 releases (for which arrest and prosecution aren’t applicable) are excluded. The sources for slightly under 10% of WikiLeaks’ publications were endangered by WikiLeaks’ ignoring their requests, for an approximate total of 79% charged, arrested or endangered as a result of their relationship with WikiLeaks.

This indicates that the government’s goal of “the identification, exposure, termination of employment, criminal prosecution, [and] legal action against current or former insiders, leakers, or whistleblowers” has been largely successful. Whether or not the purpose behind this – ‘damaging or destroying’ the public’s trust in WikiLeaks to “deter others considering similar actions” – has been successful requires an analysis that examines change over time, along with other variables.

5. Files obtained through state sponsored hacks make up a statistically significant, if not a majority, of WikiLeaks releases. Sources have identified over 900,000 files (6%) that are forensically tied to state sponsored hackers. Approximately an additional 7,935,000 files (52.6%) are the result of hacks that are debatably state sponsored, for an approximate sum of 8,839,000 files (58.7%).

While the sources are well placed, consistent, and reliable, no other information can be provided without putting their identities (and thus their livelihoods and in some instances, potentially their lives) at risk.

The latter figures depend on one’s definition of state sponsored and includes examples where WikiLeaks was certainly unwitting of the state’s role.

Conclusion: While it’s undeniably effective, WikiLeaks is not primarily a leaking platform. Its sources are not immune from arrest and prosecution, and WikiLeaks is not immune to being used or manipulated by state actors. The majority of WikiLeaks’ releases are emails and other documents that can be imported with minimal effort.