Commons:Batch uploading/Web Gallery of Art

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Web Gallery of Art [1] is a very well organized website with over 26k images. We already have at least 2k of their files and possibly much more. I was thinking about batch upload of their files, but since I have no experience in batch uploads I am coming her for advice and possibly help. WGA provided extensive metadata about images so there is no need to scrape the website for the info. The matadata includes Author's name, the name of the institution holding the artwork and other info which can be easily matched with fields in {{Artwork}}. The only files which should be uploded are 2D artworks which would fall under {{PD-Art}}. I was thinking about using the following steps:

  1. Prepare space at commons by creating Creator and Institution templates as well as categories to be used by the files.
  2. Download the files to may computer
  3. Upload them to Commons using some specialized upload script or Commonist. If using Commonist I could initially provide minimal metadata with each file and then replace it with full-blown {{Artwork}} updated from CSV file using AutoWikiBrowser with en:User:Ganeshk/CSVLoader.

Unresolved issues:

  • Downloading files to my computer. I do have a list of all file locations. I was planning to use some download manager software or even writing my own script to do so. Any suggestion for simple programs to do that?
X <- read.csv("wga-download.csv", header = TRUE)
for(i in 1:dim(X)[1]) download.file(as.character(X[i,1]), as.character(X[i,2]), mode = "wb")
  • Upload should skip files if identical file already exist on Commons. We could find sha-1 hash for all files in Category:Web Gallery of Art and exclude them from the upload or use upload script that does that.

--Jarekt (talk) 18:15, 14 March 2011 (UTC)[reply]

Be lazy, just take this source code. Modify it a bit and upload the 24K images. It includes:
  • csv parsing and cleaning up
  • Automatic downloading to upload the file
  • Dupe checking
  • Standard easy way to generate a title
  • Standard easy way to generate a title
  • Uploading of the files with correct title and description.
Do you have the csv file only somewhere? Multichill (talk) 20:01, 25 March 2011 (UTC)[reply]
Ah, found the csv files. Multichill (talk) 20:06, 25 March 2011 (UTC)[reply]
Ok. Made a quick mock up to show what's possible:
This is just a quickly made versions. Improvements on first sight:
  • Clean up the date some more (use date templates)
  • Split TECHNIQUE into medium and dimensions
  • Do some smart template trick at User:Multichill/WGA to get creator templates from AUTHOR
  • Do some smart template trick at User:Multichill/WGA to get institution templates from LOCATION
  • Filter FORM to upload only free images (graphics & painting)
  • Use FORM, TYPE, SCHOOL & TIMELINE to find categories
  • Use AUTHOR to find categories
Multichill (talk) 21:28, 25 March 2011 (UTC)[reply]
Thanks, I already downloaded and named all the files. I also downloaded their csv metadata file and I am processing it in MS Excel and MS Access to convert WGA fields into commons templates used with {{Artwork}}. There is still a lot of clean up to do:
  • Separate all the images of paintings on walls of churches and murals which are 3D. Also a lot of images of paintings where fancy frame is part of the artwork. (I already filtered out sculptures, etc.)
  • Out of 3k painters in WGA about 1k names match existing creator's names. 500 are for unknown artists for which we will not have a creator pages and the rest - about 1.5k use some alternative version of the name, or is for artists we do not have. I already created about 300 and will soon have another 100 creator pages for artists whose names match category names on commons. I would like to create creator pages for most of the others or match them to existing categories.
  • I will Split TECHNIQUE into medium and dimensions and cast them in the format of our templates.
  • I will try to use your script for uploading with checking for dupes. I will also like to fix descriptions of 5k WGA files we already have. Great many of them do not use {{Artwork}} template.
  • I will also try to match LOCATIONS to our Institution templates or categories.
I might try to upload all the "easy" images, with matching names, institutions, etc. first and worry about others latter. Thanks for helping. --Jarekt (talk) 02:30, 27 March 2011 (UTC)[reply]
Did some small updates to the code and User:Multichill/WGA. Do you know what character set they used? Multichill (talk) 16:17, 10 April 2011 (UTC)[reply]
I am not sure but I am working on a metadata Excel file and all characters seem to be fine. I do not get any incorrect characters I sometimes struggle with. By the way, in the matadata file I am still working on converting various fields to a format compatible with {{Artwork}}:
  • match AUTHORs to category:creator templates, if one exist (16.3k files do)
  • match LOCATIONs to category:institution templates, if one exist (done for 12.5k files)
  • split TECHNIQUE into "technique" and dimensions field. Convert "technique" part to use {{Technique}} (done for 19.4k files) and convert "dimensions" part to use {{Size}} (100% done 15.5k files have size).
  • convert DATE to use the ISO standard or a {{Other date}} template (done for 22k files)
  • convert TITLE to use Category: Multilingual tags: Title if possible
  • match WGA files with 3442 files already on commons using SHA1 codes (this step will be useful for fixing descriptions of a lot of files already uploaded)
Cleanup steps still planned:
  • Look through WGA files and mark 3D shots of rooms and church interiors which should not be uploaded (I looked through 4k images so far)
  • Create for each file categories:
Once I fix the metadata, so I do not have to be tweeking it after upload, I will try uploading. --Jarekt (talk) 03:10, 11 April 2011 (UTC)[reply]
Certainly others! The categories should cover: Location, artist, subject (person(s) in a portrait, Category:Landscape paintings, Category:Paintings of Nativity etc) - these are the most important. In the case of paintings, "Paintings from the Netherlands" should really be picked up via the museum (see the category - in other cases it would be via the artist, unless he is unknown). Other categories, like Category:Paintings of families may apply too. It is very time-consuming to get all the right categories, but very important also. In most cases sub-categories should be used where possible. Johnbod (talk) 19:16, 30 April 2011 (UTC)[reply]
Other categories would have to be added after the upload. --Jarekt (talk) 15:58, 17 May 2011 (UTC)[reply]

A little update:

  • Manually tagged 488 out of 22242 files as 3D and not meeting requirements of {{PD-art}}. Those are mostly photographs of rooms with paintings, murals and frescoes where photographer was trying to show the surroundings of the painting
  • Other ~400 files are tagged as having extensive 3D frames or some other 3D elements
  • I adapted Multichill's code and created new one. Got it to work - single test upload is here. I am still in the process of verifying and tweaking the code. My curent issues are:
  1. Unicode characters in csv file do not seem to work correctly. Plain CSV files with unicode characters break csv reader, and re-saving the files as utf-8 or "unicode" in Windows Notepad (a step I sometimes have to do with other unicode CSV files) adds some junk to the column names.
  2. I am not sure if duplicate testing part works correctly.

--Jarekt (talk) 15:58, 17 May 2011 (UTC)[reply]

Opinions

[edit]
Assigned to Progress Bot name Category
User:Jarekt
  • Generated some of the required Creator templates.
  • Uploaded sample of about 200 files. Still waiting for bot flag. --Jarekt (talk) 12:51, 1 June 2011 (UTC)[reply]
  • Got bot flag for User:JarektUploadBot and started uploading in earnest. However it is slow progress since current py upload function does not have an option to automatically abort upload in case of any warnings (or ignore them). As a result upload needs user input each time an image with the same hash was deleted from Commons which happens frequently. Will have to modify the code to bypass that. --Jarekt (talk) 14:51, 3 June 2011 (UTC)[reply]
  • Uploaded: 10k images, skipped: 6.5k duplicates for which I will fix descriptions latter, Still to upload: ~5k images --Jarekt (talk) 13:59, 10 June 2011 (UTC)[reply]
  • Attempted to upload 22k files, successfully uploaded over 15k files. Remaining 6k files were not uploaded because they are already on Commons or they were deleted from Commons. Next stage will be verifying and correcting metadata of WGA files uploaded by hand. --Jarekt (talk) 19:35, 7 July 2011 (UTC)[reply]

✓ Done The upload and metadate synchronization of the files already on Commons is completed.--Jarekt (talk) 13:26, 6 August 2011 (UTC)[reply]

User:JarektUploadBot Category:Images from Web Gallery of Art
Category:Web Gallery of Art maintenance
title field
[edit]

I am a bit late but it seems very good on the whole. I just chanced upon a small thing: {{Portrait of a Boy}} with a book I don't think it is a very good idea to use a template if it only translates one part of the title. I don't know if it happenned on other pages, but could it be avoided while still using templates when it is relevant ? I would also support wrapping the title in a {{En}} when no template is used or using {{Title}} that adds bold and italics (I think it looks nicer). {{title|en=Angels hanging around}} is basically a langSwitch and {{title|self-portrait}} without any language parameter calls {{Self-portrait}}. --Zolo (talk) 09:31, 8 June 2011 (UTC)[reply]

Good point. I did not noticed {{Portrait of a Boy}} with a book. That is a mistake and I will make sure I either translate the whole thing or nothing. I will also look into using title template. --Jarekt (talk) 12:44, 8 June 2011 (UTC)[reply]