Mirroring Wikimedia project XML dumps
This page coordinates the efforts for mirroring Wikimedia project XML dumps around the globe, on independent servers, similar to the GNU/Linux .iso mirror sites.
In addition to XML dumps, Wikimedia and its mirrors offer "other" files. These include RDF structured data, analytics data (page views etc.), wiki dumps in other formats like HTML and OpenZim, search indexes, archival dumps from retired services, and other assorted datasets.
We encourage anyone with the resources to host their own mirror.
Current mirrors
[edit]
Organization | XML | Other | Location | Access | Notes |
---|---|---|---|---|---|
Your.org | All | All | Illinois, United States | HTTPS | The media files in the mirror may be outdated, please use with care. Have a look at the last modified date.
Media tarballs last updated in March 2013 (as of March 2022). |
Internet Archive | All (updated semi manually) | All (updated semi manually) | California, United States | See #Internet Archive | See #Internet Archive |
C3SL | Last 5 | Paraná, Brazil | HTTP
rsync[5] |
||
BringYour | Last 5 | California, United States | HTTPS
rsync[6] |
||
Individual hoster | Last 5 | United States | HTTPS | ||
PDApps | Last 4 | Wikidata JSON and RDF entity dumps | Moscow, Russia | HTTPS
rsync[7] |
|
Academic Computer Club, Umeå University | Last 2 | All | Umeå, Sweden | HTTPS
rsync[8] |
|
Scatter | Last 2 (English Wikipedia only) | Wikidata entity JSON and RDF; Commons Structured Data entity JSON and RDF; Categories RDF; Clickstream; Commons Impact; Mediacounts; Pagetitles; Pageviews | Bend, Oregon, United States | HTTPS | |
Center for Research Computing, University of Notre Dame | Wikidata entity dumps, pageview and other stats, Picture of the Year tarballs, Kiwix openzim files, other. | Indiana, United States | Internet2 | Access to this mirror is restricted to institutions with access to Internet2/ESnet/Geant. Those with access will have high bandwidth downloads. |
Internet Archive
[edit]The Internet Archive hosts a mirror of Wikimedia dumps and datasets.
- All Wikimedia downloads
- Instructions for finding old Wikidata entity dumps (RDF and JSON) can be found on Wikidata:Database download.
Wikimedia Commons
[edit]All the Commons uploads (and their description pages in XML export format) of each day since 2004, one zip file per day, one item per month. A text file listing various errors is available for each month, as well as a CSV file with metadata about every file of each day.
The archives are made by WikiTeam and meant to be static; an embargo of about 6 months is followed, in order to upload months which are mostly cleaned up. Archives up to early 2013 have been uploaded in August-October 2013 so they reflect the status of the time. After logging in, you can see a table with details about all items.
See Downloading in bulk using wget for official HTTP download instructions.
Individual images can be downloaded as well thanks to the on-the-fly unzipper, by looking for the specific filename in the specific zip file, e.g. [1] for File:Quail1.PNG.
BitTorrent
[edit]For an unofficial listing of torrents, see data dump torrents.
To download Wikimedia Commons files from the Internet Archive with BitTorrent, you need a client which supports webseeding to download from archive.org's 3 webseeds. There is one torrent per item and an (outdated) torrent file to download all torrent files at once. Please join our distributed effort, download and reseed one torrent.
Potential mirrors
[edit]If you are a hosting organization and want to volunteer, please send email to ops-dumps@wikimedia.org with "XML dumps mirror" somewhere in the subject line.
Requirements
[edit]Space
[edit]We require 25.1 TB for the 5 most recent dumps (most desired option). This would be 3 sets of full dumps and 2 sets of partial dumps. This is based on estimates from December 2020.
Alternative options:
- "most recent good dumps": 8 TB (July 2022 estimate). This would be one set of full dumps.
- "last 2 good dumps": 11 TB (July 2022 estimate). This would be one set of full dumps and one set of partial dumps.
- "All dumps and other data": ~ 75 TB and growing (as of July 2022).
Additional options:
- "Historical archives": 1.6T now (October 2017). This consists of 2 dumps per year from 2002 through 2010. Not expected to change or grow.
- "Other": 31 TB (Dec 2020). Pageview analytics, CirrusSearch indexes, Wikidata entity dumps and other datasets.
Bandwidth
[edit]Wikimedia provides about 4-5 MB/s via dumps.wikimedia.org for XML dumps, as of January 2023.
Setup
[edit]Based on your space and bandwidth restrictions, decide how many dumps you want to mirror, whether you want to mirror in addition or alternatively the archives (pre-2009 dumps) and/or "other" datasets. Let us know that in the email. We'll need the hostname for our rsync config, the name for the ipv6 address if there is a separate name, or in case there is no ipv6 connectivity, a note to that effect, and a contact email address.
Once your information is added to our rsync config, you'll be able to pick up the desired dirs and files from the appropriate rsync module:
- dumpslastone -- last complete good dump for each wiki as well as completed files from any run that is in progress
- dumpslasttwo -- last two complete runs etc
- dumpslastthree -- last three complete runs etc
- dumpslastfour -- last four complete runs etc
- dumpslastfive -- last five complete runs etc
- dumpmirrorsother -- 'other' datasets (as seen at [2])
- dumpmirrorsalldumps -- all dumps but no archives and no 'other' datasets
- dumpmirrorseverything -- absolutely everything
- dumpmirrorseverything/archives -- just the archival dumps of historical interest
We recommend a daily cron job for this.
If you are brainstorming organizations that might be interested, see discussion page.
See also
[edit]- Wikipedia:Database download
- Data dumps
- wikitech:Backup procedures
- wikitech:Hurricanes
- wikitech:Dumps
- en:User:Sj/wikiserve
- en:User:Emijrp/Wikipedia Archive
- WikiTeam
External links
[edit]- dumps.wikimedia.org
- IPFS Zim mirror
- ↑ http://dumps.wikimedia.your.org/
- ↑ http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/
- ↑ rsync://ftpmirror.your.org/wikimedia-dumps/
- ↑ rsync://ftpmirror.your.org/wikimedia-images/
- ↑ rsync://wikipedia.c3sl.ufpr.br/wikipedia/
- ↑ rsync://wikimedia.bringyour.com
- ↑ rsync://wikipedia.mirror.pdapps.org/wikimedia-dumps/
- ↑ rsync://ftp.acc.umu.se/mirror/wikimedia.org/