User:GreenC/WaybackMedic 2.1

This is an old version. The latest version is WaybackMedic 2.5

Wayback Medic 2.1 is a bot that adds and maintains links from the list of known webarchive services in use on the English Wikipedia.

Edits made after 2017-01-07 are by version 2.1

The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".

WM fixes

WaybackMedic Fixes
Fix number	Function name	Example edit	Description	Notes	Date added
1	fixthespuriousone	Example	Remove spurious `\|1=` in cite templates.		August 2016
2	fixmissingprotocol	Example	1. Add https if protocol missing from the archive.org URL. 2. Convert existing protocol http to https. 3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/) 4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those.	HTTPS per RFC	August 2016
3	fixemptyarchive	Example	1. If `\|archiveurl=` is empty or missing but `\|archivedate=` has content, attempt to find a working archive URL based on the archive date, otherwise add `{{dead link}}` if appropriate. 2. If `\|archivedate=` is empty or missing but `\|archiveurl=` has content, generate date value based on timestamp in the archive URL. 3. If `\|archiveurl=` and `\|archivedate=` are empty, remove both and leave a `{{dead link}}` if appropriate.		August 2016
4	fixbadstatus	Example	Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove `\|archiveurl=` and `\|archivedate=` and add `{{dead link}}`.		August 2016
5	Retired
6	fixemptywayback	Example	The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example).		August 2016
7	fixencodedurl	Example	The URL was incorrectly encoded. Fully decode URL and re-encode.		August 2016
8	fixdatemismatch	Example	1. Ensure `\|archivedate=` matches the snapshot date in the URL 2. Ensure date format matches dmy or mdy if set (retain ymd if in use)		August 2016
9	fixwebcitlong	Example Example	Convert WebCite URL's from short-form to long-form Convert Freezepage.com URL's from short-form to long-form	WebCite Usage	January 2017
10	fixstraydt	Example	Remove stray `{{dead link}}` template when an archive exists for the link		January 2017
11	fixwam	Example	Merge `{{wayback}}` and `{{webcite}}` --> `{{webarchive}}` Merge completed February 5, 2017	Webarchive TfM	January 2017
12	fixiats	Example	archive url -> \|archive-url)		January 2017
13	fixswitchurl	Example	Move an archive.org URL from `\|url=` to `\|archiveurl=` and add `\|archivedate=` if missing.		January 2017
14	Retired
15	fixembway	Example Example	1. A `{{wayback}}` is embedded in a CS template. 2. A `{{dead link}}` is embedded in a CS template.		January 2017
16	<various>	Example	Timestamp and/or `\|archivedate=` is 19700101 and/or out-of-bounds.		January 2017
17	fixdoubleurl	Example	archive.org URLs are doubled, tripled, etc..		January 2017
18	fixemptywebarchive	Example	`{{webarchive}}` `\|date=` is missing or empty value.		January 2017
19	fixdoublewebarchive	Example	Remove duplicate `{{webarchive}}` instances.		January 2017
20	fixembwebarchive	Example	A `{{cite web}}` is embedded in a `{{webarchive}}`		January 2017
21	fixarchiveis	Example Example	1. Convert Archive.is URL's from short-form to long-form 2. Fix URL encoding of broken links	Archive.is Usage	January 2017
22	fixitems	Example	Change "/items/" URLs that are using machine IDs	BRFA	January 2017
23	encodemag	Example	Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}})	RFC3986	January 2017
24	decodespace	Example	Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way	See also	June 2017
25	waytree_trailgarb	Example Example Example	Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('')		February 2018
26	fixcommentarchive	Example	Open-up commented-out archives and add a `\|deadurl=` "yes" or "no"		February 2018
27	waytree_x2encoding	Example	Repair double URL-encoding eg. %3A -> %253A		February 2018
28	fixencodebug	Example	Repair missed URL-encoding of square brackets	T186417	February 2018
29	fixiats	Example Example	Restore truncated Wayback URL		February 2018
30	fixiats	Example	Convert `\|title={title`} -> `\|title=Archived copy`	T203865	September 2018
31	urlchanger	Example	Move broken URL to a new working URL and undo previous archives.	BOTREQ	November 2018

Technical details

Real-time operations, no link database.
Many APIs including Internet Archive, Memento, WebCite and "Timemap" APIs at individual service sites
Multiple HTTP header status code checks at the application (WaybackMedic) layer
Additional time-out & retries built-in to the web transfer libraries.
Additional operating-procedure level checks against network and other errors - semi-supervised.
Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
Accepts API results but then verifies by looking at page headers and/or contents
If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
If link is policy blocked by robots.txt log it but leave alone - in the future, robots may be lifted by the site owner or IA

Statistics

The bot runs through a batch of articles about every 2-3 months taking a break in-between. Below are some stats from the first two runs.

Run 1: 2016-12-15 -> 2017-03-15

From December 15, 2016 to March 15, 2017, WaybackMedic processed 336,271 articles. This set represents articles edited by InternetArchiveBot from July 2016 -> February 2017 plus articles requiring merge of {{wayback}} -> {{webarchive}}. WaybackMedic made 115,066 changes in 47,810 articles. All changes are logged and available on request eg. which articles had the Archive.is URL fix. Diffs of each article pre and post edit are also saved and searchable.

Bummer          : 507     (Wayback links that return "Bummer page not found")
Robots.txt      : 6477    (Wayback links blocked by robots.txt)
Bogusapi        : 13273   (Wayback API-returned links that don't match real status code)
API mismatch    : 17117   (Wayback API returned fewer records than sent.)
JSON mismatch   : 28972   (Wayback API returned different size JSON)
Discovered      : 47810   (Number of articles edited by WaybackMedic)
Log 404         : 9894    (Dead wayback links)
Log emptyarch   : 2001    (Empty archiveurl arguments)
Log emptyway    : 0       (Ref has an empty {{wayback}})
Log encode      : 0       (URL misencoded)
Log spurious 1  : 191     (Spurious "|1=" parameter)
Log trail       : 3       (URL has a trailing bad character)
Log dead URL    : 185     (|url= is dead even though dead-url=no, archiveurl is dead and no {{dead}})
Log skindeep    : 8466    (changes to URL are skindeep)
Log doubleurl   : 416     (Double archive.org URL error)
Log datemismatch: 27433   (Date in archive URL doesn't match archivedate argument in cite template)
Log wrong https : 895     (https and :80 conflict)
Log WAM         : 32198   (webarchive merge)
Log stray dead  : 2709    (stray {{dead link}} - straydt.awk)
Log WC|IS->IA   : 1022    (Convert WebCite|Archive.is to Wayback et al.)
Log short url   : 10552   (WebCite URL elongated - webcitlong.awk)
Log short url   : 522     (Archive.is URL elongated - archiveis.awk)
Log citeaddl    : 256     (webarchive merge - citeaddl.awk)
Log nowikiway   : 41      (Wayback mangled a certain way)
Log br bug      : 0       (br bug)
Log miss timest : 3043    (Timestamp missing from IA URL)
Log embeded way : 559     (embedded wayback template in cite template)
Log embeded wa  : 18      (embedded cite template in webarchive template)
Log switch URL  : 6051    (archive in url= field)
Log dead /items/: 281     (/items/ URL dead replacement)
Log x2 webarch  : 2311    (double webarchive template)
Log pct encode  : 15      (pct encode magic characters in URLs)
New alt archive : 1009    (Replaced with archive URL found at Mementoweb.org)
New IA link     : 509     (Added new IA link)
New IA date     : 1642    (Changed snapshot date)
Redirects       : 52      (Page was a redirect)
Zombie links    : 650     (Links needing removal by hand)
Wayback RM      : 2836    (Wayback link deleted)

; Links found

Wayback All     : 1099355 (Wayback links total found)
WebCite All     : 39120   (WebCite links total found)
Archive.is All  : 1288    (Archive.is links total found)
Loc.gov All     : 410     (Loc.gov links total found)
Portugal All    : 180     (Portugal links total found)
Stanford All    : 30      (Stanford links total found)
Archive-it All  : 76      (Archive-it.org links total found)
Bibalex All     : 17      (Bibalex.org links total found)
NatArchiveUK All: 4668    (National Archives (UK) links total found)
Europa Archives : 2       (Europa Archives (Ireland) links total found)
Perma.cc All    : 0       (Perma.CC links total found)
PRONI All       : 0       (PRONI links total found)
UK Parliament   : 1       (UK Parliament links total found)
UK Web Archive  : 125     (UK Web Archive (British Library) links total found)
Canada All      : 68      (Canada links total found)
Catalonian All  : 1       (Catalonian links total found)
Singapore Archiv: 10      (Singapore Archives links total found)
Slovenian Archiv: 1       (Slovenian Archives links total found)
Freezepage.com  : 1524    (Freezepage.com links total found)
Webharvest.gov  : 4       (US Nat. Archives links total found)
NLA AU ALL      : 2610    (AU Nat. Archives links total found)
archiveorg items: 419     (Archive.org /items/ total found)

Run 2: 2017-03-19 -> 2017-04-07

From March 19, 2017 to April 7, 2017, WaybackMedic processed 149,195 articles. These were all articles on English Wikipedia containing a {{dead link}} template. WaybackMedic checked each tagged link and replaced with a working archive if available. It made other standard fixes. The number of links saved was 31,317

Archive.org: 12,804
Archive.is: 16,541
Webcite: 20
Library of Congress: 413
National Archives UK: 405
NLA Australia: 6
arquivo.pt (Portugal): 284
Stanford University: 17
Archive-It.org: 584
BibAlex: 86
National Archives Iceland: 61
Europa Archives Ireland: 29
Proni Web Archives: 6
Parliament UK: 34
UK Web Archive (British Library): 55
Canada: 8

The reason Archive.is is so high is because most of the articles had already been checked for archive.org saves on previous runs of IABot. Archive.is has many pages unavailable anywhere else and WaybackMedic is the only bot adding Archive.is links. Generally WaybackMedic uses Archive.is as last resort.

Statistics

Run 1: 2016-12-15 -> 2017-03-15

Run 2: 2017-03-19 -> 2017-04-07

Notes

Links