Why is WikiLeaks’ “entire raw dataset” of Podesta emails incomplete, and exactly how many are there?

The issue of how many items are in a WikiLeaks release should be a straightforward question. When it comes to the number of Podesta emails, however, the answer depends on who you ask, or what resource you consult. WikiLeaks themselves provide four different numbers, ranging from a little over 50,000 to a little under 60,000.

Unlike the issues with the DNC emails, this isn’t recent, but it has been neglected and ignored – both by the press and by WikiLeaks. I first noticed the inconsistency between the number of emails that would be displayed through a blank search and the total number of emails during the first week of releases. On October 12, 2016 I privately raised the issue with WikiLeaks and was told that it was a temporary indexing issue that WikiLeaks was aware of and that the search interface would catch up.

Assured that the issue was temporary, I forgot about it until a project recently had me attempt to gather all of the Podesta emails into one place. The obvious starting point for this was what WikiLeaks referred to as the “entire raw dataset for all published Podesta Emails.” The link, shown on WikiLeaks’ search pages for the Podesta Emails, leads to a compressed .mbox file named podesta-emails.mbox-2016-11-06.gz. A search of reddit, Twitter and the Wayback Machine indicate that November 6, 2016 was, in fact, the first day the file was available. Imported into Thunderbird, the .mbox appears to contain 50,866 emails.

The .mbox provides the lowest count of emails, though this is explained by WikiLeaks’ tweet saying it contained all the emails released until November 6th. It’s unclear why WikiLeaks never updated the file or changed the description of it.

If the only discrepancy were the .mbox, which was published before all of the Podesta emails were out, the issue would be irrelevant and uninteresting. The WikiLeaks’ file server and directory hosting that .mbox file also has a listing of individual email files. This list shows 57,153 .eml files for the Podesta emails, with each file name corresponding to the WikiLeaks ID.

A blank search of the Podesta emails shows that there are 58,660 emails.

The range of Email IDs assigned by WikiLeaks, however, extends up to 59,258. A scrape of these .eml files turns up 70 duplicates (not counting the originals), for a total of 59,188 individual emails. For these numbers, duplicates were identified by duplicate .eml file names. A check of the contents confirms that they are identical. Unlike the .eml files described above, the file names for these do not correspond to the email IDs and begin at 00000001.eml rather than 1.eml.

At present, WikiLeaks’ own data gives us five different totals for the number of Podesta emails:

  1. 50,866
  2. 57,153
  3. 58,660
  4. 59,258
  5. 59,188

The two most authoritative answers to the question come from WikiLeaks and the Special Counsel’s office, and both indicate that the total exceeded 50,000. While WikiLeaks’ stated there were “well over 50,000” emails, the Special Counsel’s indictment simply said that “over 50,000 stolen documents were released.” Since “documents” can be construed to include both the emails and their various attachments, the SC’s total is even more vague and less definitive than WikiLeaks’.

Ultimately, he best answer to the question of how many Podesta emails there are appears to be 59,188. However, this assumes that there are no other duplicate emails that weren’t identified by duplicate file names. Even assuming there are no other duplicates, several questions remain:

  • Why has the neither the “entire raw dataset”nor the language describing it ever been updated?
  • Why does the directory on their file server show only 57,153 emails?
  • Why does the search interface show only 58,660 emails?
  • Why did WikiLeaks say the search interface hadn’t caught up?
  • Why has WikiLeaks never addressed the search interface issue, despite having been aware of it for over two years?
  • Why are there 70 duplicate emails in one cache?
  • Why do the file server .eml file names correspond to their Email IDs, but the names of the .eml files generated by using the “download raw source” links don’t?
  • Is 59,188 the total number of emails they received?

Update: An inconsistency has been addressed and speculation has been removed.