Since the effect of running Tidy on MW Parser main pass output is poorly specified, I suggest parsing the MW Parser output using the HTML 5 algorithm and then reserializing the DOM for output.
This is what Parsoid is already doing, and Gabriel reports that the behaviour is similar to Tidy.
MWTidy::tidy() would become an abstract wrapper for the following backends:
- External tidy
- Internal tidy
- New web service (Html5Depurate)
- Existing pure-PHP code in Parser.php around line 1326, labelled "bug #2702"
- Future pure-PHP code. When a compliant pure-PHP HTML 5 parser becomes available, it could be used as a low-performance backend to replace the bug 2702 code.
A new configuration variable has been introduced to control backend selection ($wgTidyConfig).
if ( $wgUseTidy ) { $wgTidyConfig = array( 'cmd' => $wgTidyBin ); }
See: https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy
Update June 2016
Backend abstraction is complete. The new web service (Html5Depurate) is basically complete. Packages are available in apt.wikimedia.org. Tim is working on a pure PHP equivalent.
We created a testing system which renders a large sample of articles with both Tidy and Depurate, generates screenshots, and compares the results visually.
In order to reduce the number of visible differences for an initial deployment, we added a "compatibility" endpoint to the Depurate API, which mimics Tidy's p-wrapping behaviour, and marks empty li, p and tr elements with a class so that they can be hidden with CSS.
Despite this, we still see significant differences, such as:
- Navbox lists composed of nowrap spans sometimes end up being completely nowrapped, running off the right margin, either due to editor error or a MediaWiki parser bug which generates invalid HTML.
- Active formatting element (AFE) reconstruction causes certain unclosed tags such as <i> to run on to the end of the page instead, instead of running on to the end of the enclosing element.
The main question now is: what should our deployment plan be?
- Are we close enough now in visual diff testing to call that part of the project done? (96.79% showed less than 1% differences, 93.35% rendered with pixel-perfect accuracy.)
- What tools should we provide to editors to migrate the remaining broken pages? Some issues (e.g. adjacent nowrap spans) are difficult to detect automatically.