Dumps/OtherMisc
Appearance
< Dumps
This page documents various dumpsets that are produced daily or weekly, not part of the generation of the xml/sql dumps.
All of these dumps run on database servers designated 'vslow, dumps', on a snapshot host dedicated to 'misc' dump generation (everything other than the xml/sql dumps).
The dump scripts are in our git puppet repo.
If errors are encountered when the specific cron job runs, the output is sent to ops-dumps@wikimedia.org.
- Global block table:
- dumped weekly
- contains an sql-format dump of information in the global block table
- managed by mw:Extension:GlobalBlocking) (code)
- Issues: Unless the database server goes away during the run, or database credentials change, this job should just run
- Cirrus search dumps:
- dumped weekly
- contains text indices, the file index (for commons) and the metadata index (for the entire cirrus cluster) in json format
- run by a maintenance script in mw:Extension:CirrusSearch (code)
- Issues: it's been quite reliable so far
- Content Translation dumps:
- dumped weekly
- contains parallel corpora that can be used by developers working on machine translation.
- run by a maintenance script in mw:Extension:ContentTranslation (code)
- Issues: it has run out of memory when the language files being dumped have too much data; these can be split apart in order to resolve the problem. Example: see this phab task.
- Media info:
- dumped weekly
- two files for each wiki, consisting of titles of media files stored locally, and those used on the project stored remotely (on Commons).
- run by a shell wrapper around the onallwikis.py script in the operations/dumps repo (code)
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
- Page titles:
- dumped daily
- contains a list of all page titles in the main namespace (NS 0) per project
- run by the onallwikis.py script in the operations/dumps repo (code)
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
- Media titles:
- dumped daily
- contains a list of all titles in the Media namespace (NS 6) per project
- run by the onallwikis.py script in the operations/dumps repo (code)
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
- Short url mappings:
- dumped weekly
- each line contains an entry of the form short-url|log-url
- run by the onallwikis.py script in the operations/dumps repo (code)
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.