Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	tstarling
	Feb 12 2015, 6:18 AM

Description

Since the effect of running Tidy on MW Parser main pass output is poorly specified, I suggest parsing the MW Parser output using the HTML 5 algorithm and then reserializing the DOM for output.

This is what Parsoid is already doing, and Gabriel reports that the behaviour is similar to Tidy.

MWTidy::tidy() would become an abstract wrapper for the following backends:

External tidy
Internal tidy
New web service (Html5Depurate)
Existing pure-PHP code in Parser.php around line 1326, labelled "bug #2702"
Future pure-PHP code. When a compliant pure-PHP HTML 5 parser becomes available, it could be used as a low-performance backend to replace the bug 2702 code.

A new configuration variable has been introduced to control backend selection ($wgTidyConfig).

if ( $wgUseTidy ) {
  $wgTidyConfig = array(
     'cmd' => $wgTidyBin
  );
}

See: https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy

Update June 2016

Backend abstraction is complete. The new web service (Html5Depurate) is basically complete. Packages are available in apt.wikimedia.org. Tim is working on a pure PHP equivalent.

We created a testing system which renders a large sample of articles with both Tidy and Depurate, generates screenshots, and compares the results visually.

In order to reduce the number of visible differences for an initial deployment, we added a "compatibility" endpoint to the Depurate API, which mimics Tidy's p-wrapping behaviour, and marks empty li, p and tr elements with a class so that they can be hidden with CSS.

Despite this, we still see significant differences, such as:

Navbox lists composed of nowrap spans sometimes end up being completely nowrapped, running off the right margin, either due to editor error or a MediaWiki parser bug which generates invalid HTML.
Active formatting element (AFE) reconstruction causes certain unclosed tags such as to run on to the end of the page instead, instead of running on to the end of the enclosing element.

The main question now is: what should our deployment plan be?

Are we close enough now in visual diff testing to call that part of the project done? (96.79% showed less than 1% differences, 93.35% rendered with pixel-perfect accuracy.)
What tools should we provide to editors to migrate the remaining broken pages? Some issues (e.g. adjacent nowrap spans) are difficult to detect automatically.

Related Objects
Search...

Status	Assigned	Task
Resolved	Arlolra	T109897 Table parsing diffs: Parsoid adds implicit <td>s after a \|- if explicit pipe is not present
Resolved	Arlolra	T109650 Minor P-wrapping diff between Parsoid & PHP Parser+Tidy combo
Resolved	Arlolra	T110004 DOM Pass for wrapping bare text found in <body> and other "block" (in html4-parlance) nodes like <blockquote>, <td>, <th>.
Declined	None	T65699 Parsoid and Tidy differ in how they deal with misnested tags
Declined	ssastry	T69452 Empty elements in DOM: PHP parser+tidy strips them; Parsoid doesn't
Resolved	ssastry	T175706 Progressively switch Wikimedia wikis from Tidy to RemexHTML
Open	None	T49544 <references/> list item must not wrap the text in <span>
Resolved	tstarling	T89331 Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool
Resolved	ssastry	T120345 Set up mass visual diff testing with a custom install of mediawiki
Resolved	bd808	T125882 Create labs project for parsing team to run experimental mediawiki installs
Resolved	ssastry	T134423 Deprecate nonstandard behavior of self-closed HTML tags in wikitext.
Resolved	cscott	T136652 Add tracking category to pages that use deprecated Tidy-specific self-closing tags (<b />, <div />, etc.)
Declined	tstarling	T136668 Puppetize HTML5Depurate
Resolved	tstarling	T136669 Merge mediawiki.raggett.css into other CSS + remove conditional enabling for Tidy
Resolved	• Elitre	T145530 Support the Parsing team with the "Remove Tidy dependency from MediaWiki output" goal
Declined	None	T155634 Tidy strips whitespace after HTML tags AND adds newlines between HTML tags
Resolved	ssastry	T175099 PHP-parser + Remex combo output differs from PHP-parser + Tidy combo on some dl-dt wikitext snippets
Resolved	Jdforrester-WMF	T185753 MediaWiki should default to using RemexHtml for tidy
Resolved	ssastry	T188167 Run parser tests with RemexHtml as the tidy implementation
Resolved	Jdforrester-WMF	T191670 Determine future of MWTidy::checkErrors in a Remex world

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ssastry added a parent task: T69452: Empty elements in DOM: PHP parser+tidy strips them; Parsoid doesn't.May 27 2016, 11:30 PM

Danny_B added a project: Tidy.May 28 2016, 10:31 AM

Danny_B removed a parent task: T4542: [DO NOT USE] HTML Tidy issues (tracking) [superseded by the #Tidy tag].May 28 2016, 10:50 AM

ssastry added a subtask: T136668: Puppetize HTML5Depurate.Jun 1 2016, 3:03 AM

ssastry added a subtask: T136669: Merge mediawiki.raggett.css into other CSS + remove conditional enabling for Tidy.Jun 1 2016, 3:08 AM

• Mholloway unsubscribed.Jun 1 2016, 3:10 AM

• brooke subscribed.Jun 1 2016, 8:27 PM

tstarling moved this task from Under discussion to Request IRC meeting on the TechCom-RFC board.Jun 1 2016, 8:28 PM

• RobLa-WMF mentioned this in E203: ArchCom RFC Meeting: Replace Tidy in MW parser with HTML 5 parse/reserialize (2016-06-08, #wikimedia-office).Jun 6 2016, 11:17 PM

tstarling updated the task description. (Show Details)Jun 7 2016, 1:23 AM

The ArchCom-RFC office hour today (E203) was dedicated to this. Summary is captured in the description of E203, and the full transcript is captured at P3228. Much of the meeting was spent discussing alternative approaches to Html5Depurate, with the clarification that it is still the plan of record.

The plan (subject to modification based on initial meetings and experience):

Meeting with SRE about Html5Depurate instances
Meeting with Community-Relations-Support about rollout strategy
Rollout Html5Depurate instances
Rollout special page+gadget
Publicize the migration + enlist help in identifying showstoppers
Rollout full Tidy->Html5Depurate transition on first wikis
Roll out further based on initial results

@GWicke made the point that third party deployments need to be considered sooner rather than later, but we tabled that part of the conversation in this meeting.

ssastry mentioned this in T74416: Navbox rendering incorrect, all items in the same line.Jun 16 2016, 1:27 PM

• RobLa-WMF moved this task from Request IRC meeting to Under discussion on the TechCom-RFC board.Jun 20 2016, 4:12 AM

Status of this RFC (from my understanding): this is not "approved" yet, but is "in progress" (see T137860 for what "in progress" means)

• RobLa-WMF mentioned this in T137860: Create "In progress" or "in prototype" column on #ArchCom-RFC board.Jul 6 2016, 7:16 PM

• RobLa-WMF moved this task from Under discussion to In progress on the TechCom-RFC board.Jul 12 2016, 11:11 PM

ssastry mentioned this in T134423: Deprecate nonstandard behavior of self-closed HTML tags in wikitext..Jul 14 2016, 6:50 PM

Legoktm closed subtask T136669: Merge mediawiki.raggett.css into other CSS + remove conditional enabling for Tidy as Resolved.Jul 27 2016, 12:02 AM

Liuxinyu970226 subscribed.Aug 3 2016, 11:16 PM

Snaevar subscribed.Aug 13 2016, 5:07 PM

zhuyifei1999 subscribed.Aug 16 2016, 9:56 AM

Jdforrester-WMF mentioned this in T143077: A table cell may appear inside the table while using VisualEditor, but before the table when the page is rendered.Aug 16 2016, 7:07 PM

Arlolra mentioned this in T143775: Different behaviour between desktop and mobile with templates returning an initial asterisk.Aug 24 2016, 4:28 PM

JJMC89 subscribed.Sep 22 2016, 10:06 PM

Tpt mentioned this in T146460: pages tag doesn't include the first page.Sep 24 2016, 7:55 AM

• Elitre added a subtask: T145530: Support the Parsing team with the "Remove Tidy dependency from MediaWiki output" goal.Sep 27 2016, 11:17 AM

Liuxinyu970226 added a project: User-notice.Sep 30 2016, 11:37 AM

Liuxinyu970226 moved this task from To Triage to In current Tech/News draft on the User-notice board.

Johan moved this task from In current Tech/News draft to Recently announced in Tech/News on the User-notice board.Oct 6 2016, 11:56 AM

• Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 10:25 PM

• Pchelolo edited projects, added Services (watching); removed Services.

Johan moved this task from Recently announced in Tech/News to Already announced/Archive on the User-notice board.Oct 13 2016, 1:01 PM

RandomDSdevel awarded a token.Oct 19 2016, 11:31 PM

ssastry mentioned this in T149364: Create wikipage for explaining Tidy replacement to editors.Oct 27 2016, 8:44 PM

daniel mentioned this in E198: RFC Meeting: Security is all of our jobs (2016-06-01, #wikimedia-office).Dec 9 2016, 7:43 AM

• Elitre closed subtask T145530: Support the Parsing team with the "Remove Tidy dependency from MediaWiki output" goal as Resolved.Dec 30 2016, 1:46 PM

Bianjiang subscribed.Jan 10 2017, 7:57 PM

ssastry mentioned this in T155634: Tidy strips whitespace after HTML tags AND adds newlines between HTML tags.Jan 19 2017, 3:45 PM

ssastry added a subtask: T155634: Tidy strips whitespace after HTML tags AND adds newlines between HTML tags.Jan 19 2017, 3:52 PM

Dinoguy1000 subscribed.Apr 13 2017, 2:47 AM

Samuele2002 subscribed.Apr 16 2017, 8:56 PM

jrbs subscribed.Apr 25 2017, 6:16 PM

Krinkle unsubscribed.Apr 26 2017, 12:55 AM

tstarling closed subtask T136668: Puppetize HTML5Depurate as Declined.May 10 2017, 5:02 AM

• AnotherLadsgroup subscribed.Jul 16 2017, 2:20 AM

Krinkle removed a parent task: T56617: Replace Tidy with a library that doesn't suck.Jul 19 2017, 8:34 PM

Krinkle merged a task: T56617: Replace Tidy with a library that doesn't suck.

Krinkle added subscribers: Jdforrester-WMF, DanielFriesen, Ltrlg.

ssastry closed subtask T134423: Deprecate nonstandard behavior of self-closed HTML tags in wikitext. as Resolved.Jul 19 2017, 9:29 PM

Izno added a subtask: T49544: <references/> list item must not wrap the text in .Aug 3 2017, 3:59 PM

IKhitron subscribed.Aug 10 2017, 7:09 PM

Verdy_p reopened subtask T134423: Deprecate nonstandard behavior of self-closed HTML tags in wikitext. as Open.Aug 11 2017, 4:10 PM

matmarex closed subtask T134423: Deprecate nonstandard behavior of self-closed HTML tags in wikitext. as Resolved.Aug 11 2017, 4:26 PM

ssastry renamed this task from Replace Tidy in MW parser with HTML 5 parse/reserialize to Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool.Sep 5 2017, 11:07 PM

ssastry added a parent task: T175095: Enable RemexHTML on mediawiki.org and testwiki.

Ladsgroup subscribed.Sep 6 2017, 9:43 AM

Jdforrester-WMF created subtask T175706: Progressively switch Wikimedia wikis from Tidy to RemexHTML.Sep 12 2017, 3:19 PM

Jdforrester-WMF removed a parent task: T175095: Enable RemexHTML on mediawiki.org and testwiki.

MattFitzpatrick subscribed.Oct 8 2017, 7:07 AM

Jdforrester-WMF added a subtask: T175099: PHP-parser + Remex combo output differs from PHP-parser + Tidy combo on some dl-dt wikitext snippets.Oct 11 2017, 11:30 PM

ssastry mentioned this in T33871: Fix usage of tidy to work cleanly with html5.Oct 23 2017, 8:34 PM

cscott mentioned this in T179978: Bad render of notifications about edition of flow board description.Nov 13 2017, 10:02 PM

ssastry moved this task from In Progress to Non-Parsing-Team Tasks on the Parsoid board.Dec 18 2017, 9:35 PM

Krinkle moved this task from In progress to Under discussion on the TechCom-RFC board.Dec 22 2017, 12:47 AM

Liuxinyu970226 awarded a token.Jan 12 2018, 12:19 PM

ssastry mentioned this in T183313: Wikimedia Developer Summit 2018 Topic: Evolving the MediaWiki Architecture.Jan 23 2018, 5:41 PM

Jdforrester-WMF mentioned this in T185753: MediaWiki should default to using RemexHtml for tidy.Feb 10 2018, 12:25 AM

Jdforrester-WMF removed a subtask: T175706: Progressively switch Wikimedia wikis from Tidy to RemexHTML.

Jdforrester-WMF added a parent task: T175706: Progressively switch Wikimedia wikis from Tidy to RemexHTML.

The task summary is out of date since depurate is no longer being used and instead we're using Tim's pure-PHP RemexHtml library. Once T185753: MediaWiki should default to using RemexHtml for tidy is completed and all Wikimedia wikis are using Remex for tidy, I think we can consider this resolved.

Jdforrester-WMF added a subtask: T185753: MediaWiki should default to using RemexHtml for tidy.Feb 26 2018, 11:53 PM

Prod subscribed.Mar 5 2018, 8:11 PM