User:GreenC/WaybackMedic 2.1
This is an old version. The latest version is WaybackMedic 2.5 |
Wayback Medic 2.1 is a bot that adds and maintains links from the list of known webarchive services in use on the English Wikipedia.
Edits made after 2017-01-07 are by version 2.1
The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".
- WM fixes
Fix number | Function name | Example edit | Description | Notes | Date added |
---|---|---|---|---|---|
1 | fixthespuriousone | Example | Remove spurious |1= in cite templates.
|
August 2016 | |
2 | fixmissingprotocol | Example | 1. Add https if protocol missing from the archive.org URL. 2. Convert existing protocol http to https. 3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/) 4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those. |
HTTPS per RFC | August 2016 |
3 | fixemptyarchive | Example | 1. If |archiveurl= is empty or missing but |archivedate= has content, attempt to find a working archive URL based on the archive date, otherwise add {{dead link}} if appropriate.2. If |archivedate= is empty or missing but |archiveurl= has content, generate date value based on timestamp in the archive URL.3. If |archiveurl= and |archivedate= are empty, remove both and leave a {{dead link}} if appropriate.
|
August 2016 | |
4 | fixbadstatus | Example | Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}} .
|
August 2016 | |
5 | Retired | ||||
6 | fixemptywayback | Example | The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). | August 2016 | |
7 | fixencodedurl | Example | The URL was incorrectly encoded. Fully decode URL and re-encode. | August 2016 | |
8 | fixdatemismatch | Example | 1. Ensure |archivedate= matches the snapshot date in the URL2. Ensure date format matches dmy or mdy if set (retain ymd if in use) |
August 2016 | |
9 | fixwebcitlong | Example Example |
Convert WebCite URL's from short-form to long-form Convert Freezepage.com URL's from short-form to long-form |
WebCite Usage | January 2017 |
10 | fixstraydt | Example | Remove stray {{dead link}} template when an archive exists for the link
|
January 2017 | |
11 | fixwam | Example | Merge {{wayback}} and {{webcite}} --> {{webarchive}} Merge completed February 5, 2017 |
Webarchive TfM | January 2017 |
12 | fixiats | Example | archive url -> |archive-url) | January 2017 | |
13 | fixswitchurl | Example | Move an archive.org URL from |url= to |archiveurl= and add |archivedate= if missing.
|
January 2017 | |
14 | Retired | ||||
15 | fixembway | Example Example |
1. A {{wayback}} is embedded in a CS template.2. A {{dead link}} is embedded in a CS template.
|
January 2017 | |
16 | <various> | Example | Timestamp and/or |archivedate= is 19700101 and/or out-of-bounds.
|
January 2017 | |
17 | fixdoubleurl | Example | archive.org URLs are doubled, tripled, etc.. | January 2017 | |
18 | fixemptywebarchive | Example | {{webarchive}} |date= is missing or empty value.
|
January 2017 | |
19 | fixdoublewebarchive | Example | Remove duplicate {{webarchive}} instances.
|
January 2017 | |
20 | fixembwebarchive | Example | A {{cite web}} is embedded in a {{webarchive}}
|
January 2017 | |
21 | fixarchiveis | Example Example |
1. Convert Archive.is URL's from short-form to long-form 2. Fix URL encoding of broken links |
Archive.is Usage | January 2017 |
22 | fixitems | Example | Change "/items/" URLs that are using machine IDs | BRFA | January 2017 |
23 | encodemag | Example | Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}}) | RFC3986 | January 2017 |
24 | decodespace | Example | Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way | See also | June 2017 |
25 | waytree_trailgarb | Example Example Example |
Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('') | February 2018 | |
26 | fixcommentarchive | Example | Open-up commented-out archives and add a |deadurl= "yes" or "no"
|
February 2018 | |
27 | waytree_x2encoding | Example | Repair double URL-encoding eg. %3A -> %253A | February 2018 | |
28 | fixencodebug | Example | Repair missed URL-encoding of square brackets | T186417 | February 2018 |
29 | fixiats | Example Example |
Restore truncated Wayback URL | February 2018 | |
30 | fixiats | Example | Convert |title={title } -> |title=Archived copy
|
T203865 | September 2018 |
31 | urlchanger | Example | Move broken URL to a new working URL and undo previous archives. | BOTREQ | November 2018 |
- Technical details
- Real-time operations, no link database.
- Many APIs including Internet Archive, Memento, WebCite and "Timemap" APIs at individual service sites
- Multiple HTTP header status code checks at the application (WaybackMedic) layer
- Additional time-out & retries built-in to the web transfer libraries.
- Additional operating-procedure level checks against network and other errors - semi-supervised.
- Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
- Accepts API results but then verifies by looking at page headers and/or contents
- If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
- If link is policy blocked by robots.txt log it but leave alone - in the future, robots may be lifted by the site owner or IA
Statistics
[edit]The bot runs through a batch of articles about every 2-3 months taking a break in-between. Below are some stats from the first two runs.
Run 1: 2016-12-15 -> 2017-03-15
[edit]From December 15, 2016 to March 15, 2017, WaybackMedic processed 336,271 articles. This set represents articles edited by InternetArchiveBot from July 2016 -> February 2017 plus articles requiring merge of {{wayback}}
-> {{webarchive}}
. WaybackMedic made 115,066 changes in 47,810 articles. All changes are logged and available on request eg. which articles had the Archive.is URL fix. Diffs of each article pre and post edit are also saved and searchable.
Bummer : 507 (Wayback links that return "Bummer page not found") Robots.txt : 6477 (Wayback links blocked by robots.txt) Bogusapi : 13273 (Wayback API-returned links that don't match real status code) API mismatch : 17117 (Wayback API returned fewer records than sent.) JSON mismatch : 28972 (Wayback API returned different size JSON) Discovered : 47810 (Number of articles edited by WaybackMedic) Log 404 : 9894 (Dead wayback links) Log emptyarch : 2001 (Empty archiveurl arguments) Log emptyway : 0 (Ref has an empty {{wayback}}) Log encode : 0 (URL misencoded) Log spurious 1 : 191 (Spurious "|1=" parameter) Log trail : 3 (URL has a trailing bad character) Log dead URL : 185 (|url= is dead even though dead-url=no, archiveurl is dead and no {{dead}}) Log skindeep : 8466 (changes to URL are skindeep) Log doubleurl : 416 (Double archive.org URL error) Log datemismatch: 27433 (Date in archive URL doesn't match archivedate argument in cite template) Log wrong https : 895 (https and :80 conflict) Log WAM : 32198 (webarchive merge) Log stray dead : 2709 (stray {{dead link}} - straydt.awk) Log WC|IS->IA : 1022 (Convert WebCite|Archive.is to Wayback et al.) Log short url : 10552 (WebCite URL elongated - webcitlong.awk) Log short url : 522 (Archive.is URL elongated - archiveis.awk) Log citeaddl : 256 (webarchive merge - citeaddl.awk) Log nowikiway : 41 (Wayback mangled a certain way) Log br bug : 0 (br bug) Log miss timest : 3043 (Timestamp missing from IA URL) Log embeded way : 559 (embedded wayback template in cite template) Log embeded wa : 18 (embedded cite template in webarchive template) Log switch URL : 6051 (archive in url= field) Log dead /items/: 281 (/items/ URL dead replacement) Log x2 webarch : 2311 (double webarchive template) Log pct encode : 15 (pct encode magic characters in URLs) New alt archive : 1009 (Replaced with archive URL found at Mementoweb.org) New IA link : 509 (Added new IA link) New IA date : 1642 (Changed snapshot date) Redirects : 52 (Page was a redirect) Zombie links : 650 (Links needing removal by hand) Wayback RM : 2836 (Wayback link deleted) ; Links found Wayback All : 1099355 (Wayback links total found) WebCite All : 39120 (WebCite links total found) Archive.is All : 1288 (Archive.is links total found) Loc.gov All : 410 (Loc.gov links total found) Portugal All : 180 (Portugal links total found) Stanford All : 30 (Stanford links total found) Archive-it All : 76 (Archive-it.org links total found) Bibalex All : 17 (Bibalex.org links total found) NatArchiveUK All: 4668 (National Archives (UK) links total found) Europa Archives : 2 (Europa Archives (Ireland) links total found) Perma.cc All : 0 (Perma.CC links total found) PRONI All : 0 (PRONI links total found) UK Parliament : 1 (UK Parliament links total found) UK Web Archive : 125 (UK Web Archive (British Library) links total found) Canada All : 68 (Canada links total found) Catalonian All : 1 (Catalonian links total found) Singapore Archiv: 10 (Singapore Archives links total found) Slovenian Archiv: 1 (Slovenian Archives links total found) Freezepage.com : 1524 (Freezepage.com links total found) Webharvest.gov : 4 (US Nat. Archives links total found) NLA AU ALL : 2610 (AU Nat. Archives links total found) archiveorg items: 419 (Archive.org /items/ total found)
Run 2: 2017-03-19 -> 2017-04-07
[edit]From March 19, 2017 to April 7, 2017, WaybackMedic processed 149,195 articles. These were all articles on English Wikipedia containing a {{dead link}}
template. WaybackMedic checked each tagged link and replaced with a working archive if available. It made other standard fixes. The number of links saved was 31,317
- Archive.org: 12,804
- Archive.is: 16,541
- Webcite: 20
- Library of Congress: 413
- National Archives UK: 405
- NLA Australia: 6
- arquivo.pt (Portugal): 284
- Stanford University: 17
- Archive-It.org: 584
- BibAlex: 86
- National Archives Iceland: 61
- Europa Archives Ireland: 29
- Proni Web Archives: 6
- Parliament UK: 34
- UK Web Archive (British Library): 55
- Canada: 8
The reason Archive.is is so high is because most of the articles had already been checked for archive.org saves on previous runs of IABot. Archive.is has many pages unavailable anywhere else and WaybackMedic is the only bot adding Archive.is links. Generally WaybackMedic uses Archive.is as last resort.