Extracting Structured Data From HTML Pages Using CSS Selectors - MQL5 Articles
Extracting Structured Data From HTML Pages Using CSS Selectors - MQL5 Articles
METATRADER 5 — INTEGRATION
0 15 501
STANISLAV KOROTKY
The MetaTrader development environment enables the integration of applications with external data, in
particular with the data obtained from the Internet using the WebRequest function. HTML is the most universal
and the most frequently used data format on the web. If a public service does not provide an open API for
requests or its protocol is difficult to implement in MQL, the desired HTML pages can be parsed. In particular,
traders often use various economic calendars. Although the task is not so relevant now, since the platform
features the built-in Calendar, some traders may need specific news from specific sites. Also, we sometimes
need to analyze deals from a trading HTML report received from third parties.
The MQL5 ecosystem provides various solutions to the problem, which however are usually specific and have
their limitations. On the other hand, there is kind of "native" and universal method to search and parse data
from HTML. This method is connected with the use of CSS selectors. In this article we will consider the MQL5
implementation of this method, as well as examples of their practical use.
To analyze HTML, we need to create a parser which can convert internal page text into a hierarchy of some
objects called Document Object Model or DOM. From this hierarchy, we will be able to find objects with
specified parameters. This approach is based on the use of service information about the document structure,
which is not available in the external page view.
For example, we can select rows of a specific table in a document, read the required columns from them and
get an array with values, which can be easily saved into a csv file, displayed on a chart or used in Expert
Advisor calculations.
HTML is a popular format which is familiar to almost everyone. Therefore, I will not describe in detail the
syntax of this hypertext markup language.
The primary source of related technical information is IETF (Internet Engineering Task Force) and its
specifications, the so-called RFC (Request For Comments). There are a lot of HTML specifications (here is an
example). Standards are also available on the website of the related organization, W3C (World Wide Web
Consortium, HTML5.2).
These organizations have developed the CSS (Cascading Style Sheets) technology and they regulate it. However
we are interested in this technology not because it describes information representation styles on web pages,
but because of CSS selectors contained therein, i.e. a special query language which enables the search of
elements inside html pages.
Both HTML and CSS keep constantly evolving, while new versions are being created. For example, the currently
relevant versions are HTML5.2 and CSS4. However the update and expansion is always accompanied with the
inheritance of old version features. The web is so large, heterogeneous and is often inert, and thus new
versions exist along old ones. As a result, when writing algorithms which imply the use of web technologies,
you should carefully use the specifications: on the one hand, you should take into account possible traditional
deviations and on the other hand you should add some simplifications which will help in avoiding issues with
multiple variations.
https://www.mql5.com/en/articles/5706 1/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
An html document consists of tags inside characters '<' and '>'. The tag name and optional attributes are
specified inside the tag. Optional attributes are string pairs of name="value", while the sign '=' can sometimes
be omitted. Here is a tag example:
— this is a tag named 'a' (which is interpreted by web browsers as a hyperlink), with two parameters: 'href' for
the website address at the specified hyperlink and 'target' for the website opening option (in this case it is
equal to "_blank", i.e. the site should open in a new browser tab).
This first tag is the opening tag. It is followed by the text which is actually visible on the web page: "HTML and
CSS", and the matching closing tag, having the same name as the opening tag and an additional slash '/' after
the angle bracket '<' (all characters together make up the tag '</a>'). In other words, opening and closing tags
are used in pairs and may include other tags, but only whole tags, without overlapping. Here is an example of a
correct nesting:
<group attribute1="value1">
<name>text1</name>
<name>text2</name>
</group>
<group id="id1">
<name>text1
</group>
</name>
However, the use is not allowed only in theory. In practice, tags may often be opened or closed by mistake in
the wrong place of the document. The parser should be able to handle this situation.
Some tags may be empty, i.e. this can be an empty line:
<p></p>
In accordance with the standards, some tags may (or rather must) have no content at all. For example, the tag
describing an image:
<img src="/ico20190101.jpg">
It looks like an opening tag, but it does not have a matching closing one. Such tags are called empty. Please
note that the attributes belonging to the tag are not the tag contents.
It is not always easy to determine whether a tag is empty and whether there should be a closing tag further.
Although the names of valid empty tags are defined in specifications, some other tags may remain unclosed.
Also because HTML and XML formats are close (and there is another variety XHTML), some web page designers
create empty tags as follows:
Pay attention to the slash '/' before the angle bracket '>'. This slash is considered excessive in terms of strict
HTML5 rules. All these specific cases can be met in normal web pages, so the parser must be able to handle
them.
Tag and attribute names which are interpreted by web browsers are standard, but HTML can contain
customized elements. Such elements are skipped by browsers unless the developers "connects" them to DOM
using the specialized script API. We should keep in mind that every tag may contain useful information.
A parser can be considered a finite-state machine, which advances letter by letter and changes its state in
accordance with the context. It is clear from the above tag structure description that initially the parser is
outside of any tag (let us call this state "blank"). Then, after encountering the opening angle bracket '<' we get
into an opening tag (the "insideTagOpen" state), which lasts until the closing angle bracket '>'. The combination
of characters '</' suggests that we are in a closing tag (the "insideTagClose" state), and so on. Other states will
be considered in the parser implementation section.
When switching between states, we can select structured information from the current position in the
document, because we know the meaning of the state. For example, if the current position is inside an opening
tag, the tag name can be selected as a line between the last '<' and the subsequent space or '>' (depending on
whether the tag contains attributes). The parser will extract data and create objects of a certain DomElement
class. In addition to the name, attributes and contents, the hierarchy of the objects will be preserved based on
the tags nesting structure. In other words, each object will have a parent (except the root element which
describes the entire document) and an optional array of child objects.
The parser will output the full tree of objects, in which one object will corresponds to one tag in the source
document.
CSS selectors describe standard notations for the conditional selection of objects based on their parameters
and position in the hierarchy. The full list of selectors is quite extensive. We will provide support for some of
them, which are included in the CSS1, CSS2 and CSS3 standards.
They can be accompanied by the so-called pseudo classes which are added on the right:
[attr] — the object has the 'attr' attribute (it does not matter whether the attribute has any value or
not);
[attr=value] — the object has the 'attr' attribute with the 'value';
[attr*=text] — the object has the 'attr' attribute with the value containing the substring 'text';
[attr^=start] — the object has the 'attr' attribute with the value beginning with the 'start' string;
[attr$=end] — the object has the 'attr' attribute with the value ending with the 'end' substring;
Simple selector is the name selector or a universal selector which can be optionally followed by a class, an
identifier, zero or more attributes or a pseudo class, in any order. A simple selector selects an element when all
components of the selector match the element properties.
https://www.mql5.com/en/articles/5706 3/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
CSS selector (or full selector) is a chain of one or more simple selectors joined by combining characters (' '
(space), '>', '+', '~'):
container element — the 'element' object is nested in the 'container' object at an arbitrary level;
parent > element — the 'element' object has a direct parent 'parent' (the nesting level is equal to 1);
e1 + element — the 'element' object has a common parent with 'e1' and immediately follows it;
e1 ~ element — the 'element' object has a common parent with 'e1' and follows it at any distance;
So far, we have been studying pure theory. Let us view how the above ideas work in practice.
Any modern web browser allows viewing HTML of the currently open page. For example, in Chrome you can run
the 'View page source' command from the context menu or open the developer window (Developer tools,
Ctrl+Shift+I). The developer window has the Console tab, in which we can try to find elements using CSS
selectors. To apply a selector, simply call the document.querySelectorAll function from the console (it is
included in the software API of all browsers).
For example, in the start forum page https://www.mql5.com/en/forum, we can run the following command
(JavaScript code):
document.querySelectorAll("div.widgetHeader")
As a result of this we will receive a list of 'div' elements (tags), in which the "widgetHeader" class is specified. I
decided to use this selector after viewing the source page code, based on which it is clear that the forum
topics are designed in this way.
document.querySelectorAll("div.widgetHeader a:first-child")
to receive the list of forum topic discussion headers: they are available as hyperlinks 'a', which are first child
elements in each 'div' block selected at the first stage. Here is how this might look (depends on the browser
version):
https://www.mql5.com/en/articles/5706 4/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
The MQL5 web page and result of selection of HTML elements using CSS selectors
You should similarly analyze the HTML code of desired sites, spot the elements of interest and pick up
appropriate CSS selectors. The developer window features the Elements (or similar) tab, in which you can
select any tag in the document (this tag will be highlighted) and find appropriate CSS selectors for this tag.
Thus you will gradually practice the use of selectors and learn to create selector chains manually. Further we
will consider how to select appropriate selectors for a specific web page.
Designing
Let us view the classes which we may need, at a global level. The initial HTML text processing will be
performed by the the HtmlParser class. The class will scan the text for markup characters '<', '/', '>' and some
others, and it will create DomElement class objects according to the above described finite-state machine
rules: one object will be created for each empty tag or a pair of opening and closing tags. The opening tag may
have attributes, which we need to read and save in the current DomElement object. This will be performed by
the AttributesParser class. The class will also operate following the principle of the finite-state machine.
The parser will create DomElement objects taking into account the hierarchy, which is identical to tag nesting
order. For example, if the text contains the 'div' tag, within which several paragraphs are placed (which means
the presence of 'p' tags), such paragraphs will be converted into child objects of the object which describes
'div'.
The initial root object will contain the entire document. Similarly to the browser (which provides the
document.querySelectorAll method), we should provide in DomElement a method for requesting objects
corresponding to passed CSS selectors. The selectors should also be pre-analyzed and converted from the string
representation to objects: a single selector component will be stored in the SubSelector class and the entire
simple selector will be stored in SubSelectorArray.
Once we have the ready DOM tree as a result of the parser operation, we can request from the root
DomElement object (or any other object) all its child elements matching selector parameters. All selected
https://www.mql5.com/en/articles/5706 5/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
elements will be placed in the iterable DomIterator list. For simplicity, let us implement the list as a child of
DomElement, in which an array of child nodes is used for storing the found elements.
Settings with specific site or HTML files processing rules and the algorithm execution result can be conveniently
stored in a class, which combines map properties (i.e. provides access to values based on the names of
appropriate attributes) and array properties (i.e. access to elements by index). Let us call this class IndexMap.
Let us provide the possibility to nest IndexMap one into another: when collecting tabular data from a web
page, we get a list of rows each containing the list of columns. For both of data types we can save the names
of source elements. This can be especially useful in cases where some of the required elements are missing in
the source document (which may happen quite often) - in such cases simple indexing ignores important
information about which data is missing. As a bonus, let us "train" IndexMap to get serialized into a multiline
text, including CSV format. This feature is useful when converting HTML pages into tabular data. If necessary,
you can replace the IndexMap class with your own while preserving the main functionality.
The following UML diagram displays the described classes.
Implementation
HtmlParser
In the HtmlParser class, we describe the variables which are necessary to scan the source text and to generate
the object tree, as well as to arrange the finite-state machine algorithm.
The current position in the text is stored in the 'offset' variable. The resulting tree root and the current object
(scanning is performed in this object context) are represented by the 'root' and 'cursor' pointers. Their
DomElement type will be considered later. The list of tags, which may be empty according to the HTML
specification, will be loaded into the 'empties' map (which is initialized in the constructor, see below). Finally,
we provide the 'state' variable for the description of finite-state machine states. The variable is an
enumeration of the StateBit type.
enum StateBit
blank,
insideTagOpen,
insideTagClose,
https://www.mql5.com/en/articles/5706 6/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
insideComment,
insideScript
};
class HtmlParser
private:
StateBit state;
int offset;
DomElement *root;
DomElement *cursor;
IndexMap empties;
...
The StateBit enumeration contains elements describing the following parser states depending on the current
position in the text:
In addition, let us describe constant strings which will be used to search for markup:
public:
HtmlParser():
TAG_OPEN_START("<"),
TAG_OPEN_STOP(">"),
TAG_OPENCLOSE_STOP("/>"),
TAG_CLOSE_START("</"),
TAG_CLOSE_STOP(">"),
COMMENT_START("<!--"),
COMMENT_STOP("-->"),
SCRIPT_STOP("/script>"),
state(blank)
{
for(int i = 0; i < ArraySize(empty_tags); i++)
{
empties.set(empty_tags[i]);
}
}
https://www.mql5.com/en/articles/5706 7/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
An array of empty_tags strings is used here. This array is preliminary connected from an external text file:
string empty_tags[] =
#include <empty_strings.h>
};
See the contents below (valid empty tags, but the list is not complete):
// header
"isindex",
"base",
"meta",
"link",
"nextid",
"range",
// body
"img",
"br",
"hr",
"frame",
"wbr",
"basefont",
"spacer",
"area",
"param",
"keygen",
"col",
"limittext"
~HtmlParser()
{
if(root != NULL)
{
delete root;
}
}
{
if(root != NULL)
{
delete root;
}
cursor = root;
offset = 0;
while(processText(html));
return root;
}
https://www.mql5.com/en/articles/5706 8/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
The web page is input, an empty root DomElement is created, the cursor is set to it, while the current position
in the text (offset) is set to the very beginning. Then the processText helper method is called in a loop until the
entire text is successfully read. The finite-state machine is then executed in this method. The default state of
the machine is blank.
if(state == blank)
{
return(false);
}
{
{
StringTrimLeft(text);
StringTrimRight(text);
if(StringLen(text) > 0)
{
cursor.setText(text);
}
}
}
offset = p;
else
else
return(true);
}
The algorithm searches the text for the angle bracket '<'. If it is not found, then there are no more tags so
processing should be interrupted (false is returned). If the bracket is found and there is a fragment of text
between the new found tag and the previous position (offset), the fragment is considered to be the contents of
the current tag (the object is available at the 'cursor' pointer) - so this text is added to the object using the call
of cursor.setText().
Then position in the text is moved to the beginning of the new found tag and depending on the signature which
follows '<' (COMMENT_START, TAG_CLOSE_START, TAG_OPEN_START) the parser is switched to the appropriate
new state. The IsString function is a small helper string comparison method, which uses StringSubstr.
In any case true is returned from the processText method, which means that the method will be called again in
the loop, but the parser state will be different now. If the current position is in the opening tag, the following
code is executed.
else
if(state == insideTagOpen)
{
offset++;
https://www.mql5.com/en/articles/5706 9/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
if(p == -1)
{
}
{
return(false);
}
If the text has neither space nor '>', the HTML syntax is broken, so false is returned. Further steps select the
tag name.
{
}
selfclose = true;
pright--;
}
StringToLower(name);
StringTrimRight(name);
Here we have created a new object with the found name. The current object (cursor) is used as the object
parent.
if(pspace != -1)
{
string txt;
{
e.parseAttributes(txt);
}
}
The parseAttributes method "lives" directly in the DomElement class, which we will consider later.
If the tag is not closed, you should check if it is not the one which can be empty. If it is, it should be "closed"
implicitly.
if(!selfclose)
{
if(empties.isKeyExisting(name))
{
selfclose = true;
https://www.mql5.com/en/articles/5706 10/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
softSelfClose = true;
}
}
Depending on whether the tag is closed or not, we either move deeper along the object hierarchy, setting the
newly created object as the current one (e), or we remain within the context of the previous object. In any
case, position in the text (offset) is moved to the last read character, i.e. beyond '>'.
pright++;
if(!selfclose)
{
cursor = e;
}
else
{
if(!softSelfClose) pright++;
}
offset = pright;
A special case is the script. If we meed the <script> tag, the parser switches to the insideScript state,
otherwise it switches to the blank state.
{
state = insideScript;
}
else
{
state = blank;
}
return(true);
}
else
if(state == insideTagClose)
{
offset += StringLen(TAG_CLOSE_START);
if(p == -1)
{
return(false);
}
Again search for '>',which must be available according to the HTML syntax. Iа the bracket is not found, the
process should be interrupted. The tag name is highlighted in case of success. This is done to check if the
closing tag matches the opening one. And if the matching is broken, it is necessary to somehow overcome this
layout error and try to continue parsing.
StringToLower(tag);
https://www.mql5.com/en/articles/5706 11/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
while(StringCompare(cursor.getName(), tag) != 0)
{
cursor = cursor.getParent();
if(cursor == NULL)
{
cursor = rewind;
state = blank;
offset = p + 1;
return(true);
}
}
We are processing the closing tag, which means that the context of the current object has ended and so the
parser switches back to the parent DomElement:
cursor = cursor.getParent();
state = blank;
offset = p + 1;
return(true);
}
else
if(state == insideComment)
{
offset += StringLen(COMMENT_START);
if(p == -1)
{
return(false);
}
offset = p + StringLen(COMMENT_STOP);
state = blank;
return(true);
}
When the parser is inside a script, it searches for the end of the script.
else
if(state == insideScript)
{
if(p == -1)
{
return(false);
}
offset = p + StringLen(SCRIPT_STOP);
https://www.mql5.com/en/articles/5706 12/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
state = blank;
cursor = cursor.getParent();
return(true);
}
return(false);
}
This was actually the entire HtmlParser class. Now let us consider DomElement.
DomElement. Beginning
The DomElement class has variables for storing the name (mandatory), contents, attributes, links to parent and
child elements (created as 'protected' because it will be used in the derived class DomIterator).
class DomElement
private:
string name;
string content;
IndexMap attributes;
DomElement *parent;
protected:
DomElement *children[];
public:
DomElement(): parent(NULL) {}
{
name = n;
}
{
p.addChild(&this);
parent = p;
name = n;
}
Of course, the class has "setter" and "getter" field methods (they are omitted in the article), as well as a set of
methods for operations with child elements (only prototypes are shown in the article):
The parseAttributes method which was used at the parsing stage, delegates further work to the
AttributesParser helper class.
https://www.mql5.com/en/articles/5706 13/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
{
AttributesParser p;
p.parseAll(data, attributes);
}
A simple 'data' string is output, based on which the method fills the 'attributes' map with the properties found.
The full AttributesParser code is available in attachments below. The class is not large and operates by finite-
state machine principle, similarly to HtmlParser. But it has only two states:
enum AttrBit
name,
value
};
Since the list of attributes is a string consisting of name="value" pairs, AttributesParser is always either at the
name or at the value. This parser could be implemented using the StringSplit function, but because of possible
formatting deviations (such as the presence or absence of quotes, the use of spaces inside the quotes, etc.),
the machine approach was chosen.
As for the DomElement class, most of the work in it should be performed by methods which select child
elements corresponding to given CSS selectors. Before we proceed to this feature, it is necessary to describe
the selector classes.
tag name - td
align attribute condition - [align=left]
width attribute condition - [width=325]
tag name - td
child index condition using the pseudo class - :first-child
class SubSelector
enum PseudoClassModifier
{
none,
firstChild,
lastChild,
nthChild,
nthLastChild
};
public:
https://www.mql5.com/en/articles/5706 14/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
ushort type;
string value;
PseudoClassModifier modifier;
string param;
};
The 'type' variable contains the first character of the selector ('.', '#', '[') or the default 0, which corresponds to
the name selector. The value variable stores the substring which follows the character. i.e. the actual searched
element. If the selector string has a pseudo class, its id is written to the 'modifier' field. In the description of
selectors ":nth-child" and ":nth-last-child", the index of the searched element is specified in brackets. This will
be saved in the 'param' field (it can only be a number in the current implementation, but special formulas are
also allowed and therefore the field is declared as string).
The SubSelectorArray class provides a bunch of components, therefore let us declare the 'selectors' array in it:
class SubSelectorArray
private:
SubSelector *selectors[];
SubSelectorArray is one simple selector as a whole. No class is needed for full CSS selectors since they are
processed sequentially, step by step, i.e. one selector at each hierarchy level.
Let us add the supported pseudo class selectors to the 'mod' map. This enables the immediate getting of the
appropriate modifier from PseudoClassModifier for that string:
IndexMap mod;
void init()
{
mod.add(":first-child", &first);
mod.add(":last-child", &last);
mod.add(":nth-child", &nth);
mod.add(":nth-last-child", &nthLast);
}
The TypeContainer class is a template wrapper for the values which are added to IndexMap.
Note that static members (in this case objects for the map) must be initialized after the class description:
TypeContainer<PseudoClassModifier> SubSelectorArray::first(PseudoClassModifier::firstChi
TypeContainer<PseudoClassModifier> SubSelectorArray::last(PseudoClassModifier::lastChild
TypeContainer<PseudoClassModifier> SubSelectorArray::nth(PseudoClassModifier::nthChild);
TypeContainer<PseudoClassModifier> SubSelectorArray::nthLast(PseudoClassModifier::nthLas
When it is necessary to add a simple selector component to the array, the add function is called:
{
int n = ArraySize(selectors);
https://www.mql5.com/en/articles/5706 15/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
y ( );
ArrayResize(selectors, n + 1);
PseudoClassModifier m = PseudoClassModifier::none;
string param;
{
{
{
}
else
{
param = "";
}
}
m = mod[j].get<PseudoClassModifier>();
break;
}
}
if(StringLen(param) == 0)
{
}
else
{
}
}
The first character (type) and the next string is passed to it. The string is parsed to the searched object name,
optionally a pseudo class and a parameter. All this is then passed to the SubSelector constructor, while a new
selector component is added to the 'selectors' array.
The add function is used indirectly from the simple selector constructor:
private:
{
ushort p = 0; // previous/pending type
int ppos = 0;
int i, n = StringLen(selector);
{
{
if(i == 0) v = "*";
}
add(p, v);
p = t;
if(p == ']') p = 0;
ppos = i + 1;
}
}
if(ppos < n)
{
}
add(p, v);
}
}
public:
{
init();
createFromString(selector);
}
The createFromString function receives a text representation of the CSS selector and views it in a loop to find
special beginning characters '.', '#' or '[', then determines where the component ends and calls the 'add' method
for the selected information. The loop continues as long as the chain of components continues.
Now it is time to get back to the DomElement class. This is the most difficult part.
DomElement. Continued
The querySelect method is used to search for elements matching the specified selectors (in the textual
representation). In this method, the full CSS selector is divided into simple selectors, which are then converted
to the SubSelectorArray object. The list of matching elements us searched for each simple selector. Other
elements matching the next simple selector are searched relative to the found elements. This is continued
until the last simple selector is met or until the list of found elements becomes empty.
{
DomIterator *result = new DomIterator();
Here the return value is the unfamiliar DomIterator class, which is the child of DomElement. It provides
auxiliary functionality in addition to DomElement (in particular, it allows "scrolling" child elements), so we will
not analyze DomIterator in details now. There is another complicated part.
The selector string is analyzed character by character. For this purpose several local variables are used. The
current character is stored in the c variable (abbr. of 'character'). The previous character is stored in
the p variable (abbr. of 'previous'). If a character is one of combinator characters (' ', '+', '>', '~'), it is saved in a
variable (a), but is not used until the next simple selector is determined.
Combinators are located between simple selectors, while the operation defined by the combinators can only be
performed after reading the entire selector on the right. Therefore, the last read combinator (a) first passes
through the "waiting" state: the a variable is not used until the next combinator appears or the string end is
https://www.mql5.com/en/articles/5706 17/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
reached, while both cases mean that the selector has been fully formed. Only at this moment the "old"
combinator (b) is applied and is replaced by a new one (a). The code itself is clearer than its description:
int i, n = StringLen(q);
ushort p = 0; // previous character
{
if(isCombinator(c))
{
a = c;
if(!isCombinator(p))
{
}
else
{
cursor = i + 1;
}
else
{
if(isCombinator(p)) // action
{
index = result.getChildrenCount();
SubSelectorArray selectors(selector);
b = a;
result.removeFirst(index);
}
}
p = c;
}
index = result.getChildrenCount();
SubSelectorArray selectors(selector);
result.removeFirst(index);
}
return result;
}
The 'cursor' variable always points at the first character, from which the string with the simple selector begins
(i.e. at the character which immediately follows the previous combinator or at the string beginning). When
another combinator is found, copy the substring from 'cursor' to the current character (i) into the 'selector'
variable.
https://www.mql5.com/en/articles/5706 18/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
Sometimes there are several combinators in succession: this may usually happen when other combinator
characters surround spaces, while the space itself is also a combinator. For example, the entries "td>span" and
"td > span" are equivalent, but spaces are inserted in the second case to improve readability. Such situations
are handled by the line:
a = MathMax(c, p);
It compares the current and previous characters if both are combinators. Then, based on the fact that the
space has the smallest code, select an "older" combinator. The combinator array is obviously defined as follows:
ushort combinators[] =
};
Check of whether the character is included into this array is performed by a simple isCombinator helper
function.
If there are two combinators in a row, other than a space, then the selector is erroneous and behavior is not
defined in specifications. However, our code does not lose performance and suggests consistent behavior.
If the current character is not a combinator while the previous character was a combinator, the execution falls
into a branch marked with an 'action' comment. Now memorize the current size of the array of DomElements
selected to this moment by calling:
index = result.getChildrenCount();
SubSelectorArray selectors(selector);
Pass the combinator character into it (this should be the preceding combinator, i.e. from the b variable), as
well as the simple selector and an array to output results to.
After that move the queue of combinators forward, copy the last found combinator (which has not yet been
processed) from the a variable to b and delete from results everything which was available before the call of
'find' using:
result.removeFirst(index);
The removeFirst method is defined in DomIterator. It performs a simple task: it deletes from an array all first
elements up to the specified number. This is done because in the process of each successive simple selector
processing, we narrow the element selection conditions and everything selected earlier is no longer valid,
while the newly added elements (which meet these narrow conditions) start with 'index'.
Similar processing (marked with the 'action' comment) is also performed after reaching the end of the input
string. In this case, the last pending combinator should be processed in conjunction with the rest of the line
(from the 'cursor' position).
https://www.mql5.com/en/articles/5706 19/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
{
bool found = false;
int i, n;
If one of the combinators setting tag nesting conditions (' ', '>') is input, checks should be recursively called for
all child elements. In this branch we also need to take into account the special combinator '/', which is used at
search beginning in the calling method.
{
n = ArraySize(children);
if(children[i].match(selectors))
{
if(op == '/')
{
found = true;
output.addChild(GetPointer(children[i]));
}
The 'match' method will be considered later. It returns true, if the object corresponds to passed selector or
false if otherwise. At the very beginning of search (combinator op = '/'), there are no combinations yet, so all
tags meeting selector rules are added to the result (output.addChild).
else
{
DomElement *p = &this;
while(p != NULL)
{
if(output.getChildIndex(p) != -1)
{
found = true;
output.addChild(GetPointer(children[i]));
break;
}
p = p.parent;
}
}
For the combinator ' ', a check is performed of whether the current DomElement or any its parent already
exists in 'output'. This means that the new child elements which satisfies search conditions is already nested
into the parent. This is exactly the task of the combinator.
The combinator '>' operates in a similar way, but it needs to track only immediate "relatives" and thus only
check if the current DomElement is available in interim results. If it is, then it has earlier been selected to
'output' by conditions of the selector on the left of the combinator and its i-th child element has just met the
conditions of the selector to the right of the combinator.
else // op == '>'
{
if(output.getChildIndex(&this) != -1)
{
found = true;
output.addChild(GetPointer(children[i]));
}
https://www.mql5.com/en/articles/5706 20/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
}
}
Then similar checks need to be performed deep in the DOM tree, therefore 'find' should be recursively called
for child elements.
}
}
Combinators '+' and '~' set conditions of whether two elements refer to the same parent.
else
if(op == '+' || op == '~')
{
if(CheckPointer(parent) == POINTER_DYNAMIC)
{
if(output.getChildIndex(&this) != -1)
{
One of the elements must be already selected by a selector on the left. If this condition is met, check the
"siblings" for the selector on the right ("siblings" are the children of the current node parent).
int q = parent.getChildIndex(&this);
if(q != -1)
{
DomElement *m = parent.getChild(i);
if(m.match(selectors))
{
found = true;
output.addChild(m);
}
}
}
The difference between handling of '+' and '~' is as follows: with '+' elements must be immediate neighbors
while with '~' there can be any number of other "siblings" between the elements. Therefore the loop is only
performed once for '+', i.e. for the next element in the array of child elements. The 'match' function is called
again inside the loop (see details later).
}
}
{
}
}
return found;
}
After all checks, move to the next DOM element tree hierarchy level and call 'find' for child nods.
https://www.mql5.com/en/articles/5706 21/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
That is all about the 'find' method. Now let us view the 'match' function. This is the last point in the description
of the selector implementation.
The function checks in the current object the entire chain of components of a simple selector passed through
an input parameter. If at least one component in the loop does not match the element properties, the check
fails.
{
bool matched = true;
int i, n = u.size();
{
{
if(u[i].value == "*")
{
}
The 0 type selector is the tag name or a pseudo class. Any tag is suitable for a selector containing an asterisk.
Otherwise the selector string should be compared with the tag name:
else
if(StringCompare(name, u[i].value) != 0)
{
matched = false;
}
The currently implemented pseudo-classes set limitations on the number of the current element in the array of
a parent's child elements, so we analyze the indexes:
else
if(u[i].modifier == PseudoClassModifier::firstChild)
{
{
matched = false;
}
}
else
if(u[i].modifier == PseudoClassModifier::lastChild)
{
matched = false;
}
}
else
if(u[i].modifier == PseudoClassModifier::nthChild)
{
int x = (int)StringToInteger(u[i].param);
matched = false;
}
}
else
if(u[i].modifier == PseudoClassModifier::nthLastChild)
{
matched = false;
}
}
}
else
if(u[i].type == '.')
{
if(attributes.isKeyExisting("class"))
{
Container *c = attributes["class"];
if(c == NULL || StringFind(" " + c.get<string>() + " ", " " + u[i].value + "
{
matched = false;
}
}
else
{
matched = false;
}
}
else
if(u[i].type == '#')
{
if(attributes.isKeyExisting("id"))
{
Container *c = attributes["id"];
{
matched = false;
}
}
else
{
matched = false;
}
}
The selector '[' enables the specification of an arbitrary set of required attributes. Also, in addition to strict
comparison of values, it is possible to check the occurrence of a substring (suffix '*'), beginning ('^') and end
('$').
else
if(u[i].type == '[')
{
AttributesParser p;
IndexMap hm;
p.parseAll(u[i].value, hm);
https://www.mql5.com/en/articles/5706 23/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
p y p
if(hm.getSize() > 0)
{
}
else
{
suffix = 0;
}
{
if(StringLen(v) > 0)
{
if(suffix == 0)
{
if(key == "class")
{
else
{
}
}
else
if(suffix == '*')
{
}
else
if(suffix == '^')
{
}
else
if(suffix == '$')
{
string x = attributes[key].get<string>();
{
}
}
}
else
{
matched = false;
}
}
}
}
return matched;
}
https://www.mql5.com/en/articles/5706 24/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
Please note that the "class" attribute is also supported and processed here. Moreover, similarly to '.', not strict
matching is checked, but the availability of the class among a set of other classes. Often in HTML multiple
classes are assigned to one element. In this case classes are specified in the 'class' attributed separated with a
space.
Let's sum up the intermediate results. We have implemented in the DomElement class the querySelect
method, which accepts a string with the full CSS selector as a parameter and returns the DomIterator
object, i.e. an array of found matching elements. Inside querySelect, the CSS selector string is divided into
a sequence of simple selectors and combinator characters between them. For each simple selector, the
'find' method with the specified combinator is called. This method updates the list of results, while
recursively calling itself for child elements. Comparison of simple selector components with the properties
of a particular element is performed in the 'match' method.
For example, using the querySelect method we can select rows from a table using one CSS selector and then
we can call querySelect for each row with another CSS selector to isolate specific cells. Since operations with
tables are required very often, let us create the tableSelect method in the DomElement class, which will
implement the above described approach. Its code is provided in a simplified form.
The row selector is specified in the rowSelector parameter, while cell selectors are specified in the
columSelectors array.
Once all the elements are selected, we will need to take some information from them, such as text or attribute
value. Let us use the dataSelectors to determine the position of the required information within an element,
while an individual data extraction method can be used for each table column.
If dataSelectors[i] is an empty row, read the textual contents of the tag (between the opening and closing
parts, for example "100%" from tag "<p>100%</p>"). If dataSelectors[i] is a row, consider this the attribute
name and use this value.
DomIterator *r = querySelect(rowSelector);
int counter = 0;
r.rewind();
Here we create an empty map to which table data will be added, and prepare for a loop through row objects.
Here is the loop:
while(r.hasNext())
{
DomElement *e = r.next();
string id = IntegerToString(counter);
https://www.mql5.com/en/articles/5706 25/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
Thus we get the next row, (e), create a container map for it (row), to which cells will be added, and run loop
through columns:
{
DomIterator *d = e.querySelect(columSelectors[i]);
In each row object, select the list of cell objects (d) using the appropriate selector. Select data from each
found cell and save it to the 'row' map:
string value;
if(d.getChildrenCount() > 0)
{
if(dataSelectors[i] == "")
{
value = d[0].getText();
}
else
{
value = d[0].getAttribute(dataSelectors[i]);
}
StringTrimLeft(value);
StringTrimRight(value);
row.setValue(IntegerToString(i), value);
}
Integer keys are used here for code simplicity, while the full source code supports the use of element
identifiers for the keys.
{
row.set(IntegerToString(i));
}
delete d;
}
if(row.getSize() > 0)
{
data.set(id, row);
counter++;
}
else
{
delete row;
}
}
delete r;
return data;
}
https://www.mql5.com/en/articles/5706 26/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
Thus, at the output we get a map of maps, i.e. a table with row numbers along the first dimension and column
numbers along the second. If necessary, the tableSelect function can be adjusted to other data containers.
A non-trading utility Expert Advisor was created to apply all the above classes.
The Expert Advisor received the following input parameters: a link to the source data (a local file or a web
page which can be downloaded using WebRequest), row and column selectors and the CSV file name. The main
input parameters are shown below:
In URL, specify the web page address (beginning with http:// or https://) or the local html file name.
In SaveName, the name of the CSV file with results is specified in normal mode. But it can also be used for
other purpose: to save the downloaded page for subsequent debugging of selectors. In this mode the next
parameter should be left empty: RowSelector, in which the CSS row selector is usually specified.
Since there are several column selectors, they are set in a separate CSV set file, which name is specified in the
ColumnSettingsFile parameter. The file format is as follows.
The first line is the header, each subsequent line describes a separate field (a data column in the table row).
The file should have three columns: name, CSS selector, data locator:
TestQuery and TestSubQuery parameters allow testing selectors for a row and one column, while outputting the
result to log but not saving to CSV and not using settings files for all columns.
Here is the main operating function of the Expert Advisor in a brief form.
int process()
string xml;
{
xml = ReadWebPageWR(URL);
}
else
{
{
Print("Error reading file '", URL, "': ", GetLastError());
return -1;
https://www.mql5.com/en/articles/5706 27/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
return 1;
}
StringInit(xml, (int)FileSize(h));
while(!FileIsEnding(h))
{
xml += FileReadString(h) + "\n";
}
// xml = FileReadString(h, (int)FileSize(h)); - has 4095 bytes limit in binary files
FileClose(h);
}
...
Thus we have read an HTML page from a file or downloaded from the Internet. Now, in order to convert the
document to the hierarchy of DOM objects, let us create the HtmlParser object and start parsing:
HtmlParser p;
if(TestQuery != "")
{
DomIterator *r = document.querySelect(TestQuery);
r.printAll();
if(TestSubQuery != "")
{
r.rewind();
while(r.hasNext())
{
DomElement *e = r.next();
DomIterator *d = e.querySelect(TestSubQuery);
d.printAll();
delete d;
}
}
delete r;
return(0);
}
In a normal operation mode, read the column setting file and call the tableSelect function:
string columnSelectors[];
string dataSelectors[];
string headers[];
If a CSV file for saving results is specified, let the 'data' map perform this task.
if(StringLen(SaveName) > 0)
{
if(h == INVALID_HANDLE)
{
Print("Error writing ", data.getSize() ," rows to file '", SaveName, "': ", GetLas
}
else
{
FileWriteString(h, StringImplodeExt(headers, ",") + "\n");
FileWriteString(h, data.asCSVString());
FileClose(h);
}
}
else
{
Print("\n" + data.asCSVString());
}
delete data;
return(0);
Practical use
with some standard HTML files, such as testing reports and trading reports generated by
Traders often deal
MetaTrader. We sometimes receive such files from other traders or download from the Internet and want to
visualize the data on a chart for further analysis. For this purpose data from HTML should be converted to a
tabular view (to the CSV format in a simple case).
Let us have a look inside the HTML files. Below is the appearance and part of HTML code of the MetaTrader 5
trading report (the ReportHistory.html file is attached below).
https://www.mql5.com/en/articles/5706 29/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
And now here is the appearance and part of HTML code of the MetaTrader 5 testing report (the Tester.html file
is attached below).
https://www.mql5.com/en/articles/5706 30/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
According to the appearance in the above figure, the trading report has 2 tables: Orders and Deals. However,
from the internal layout we can see that this is a single table. All visible headers and the dividing line are
formed by the styles of table cells. We need to learn to distinguish between orders and deals and save each of
the sub-tables to a separate CSV file.
The difference between the first part and the second is in the number of columns: 11 columns for orders and
13 columns for deals. Unfortunately, the CSS standard does not allow setting conditions for selecting parent
elements (in our case, the table rows, 'tr' tag) based on the number or content of children (in our case, table
cells, 'td' tag). So, in some cases, required elements cannot be selected using standard means. But we are
developing our own implementation of selectors and thus we can add a special non-standard selector for the
number of child elements. This will be a new pseudo class. Let us set it as ":has-n-children(n)", by analogy with
":nth-child(n)".
The following selector can be used for selecting order rows:
tr:has-n-children(11)
However, this is not the entire solution to the problem, because this selector selects the table header in
addition to the data rows. Let us remove it. Pay attention to coloring of data rows - the bgcolor attribute is set
for them, and the color value alternates for even and odd rows (#FFFFFF and #F7F7F7). A color, i.e. the bgcolor
attribute is also used for the header, but its value is equal to #E5F0FC. Thus, the data rows have light colors
with bgcolor starting with "#F". Let us add this condition to the selector:
https://www.mql5.com/en/articles/5706 31/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
tr:has-n-children(11)[bgcolor^="#F"]
Parameters of each order can be read from the row cells. To do this, let us write the configuration file
ReportHistoryOrders.cfg.csv:
Name,Selector,Data
Time,td:nth-child(1),
Order,td:nth-child(2),
Symbol,td:nth-child(3),
Type,td:nth-child(4),
Volume,td:nth-child(5),
Price,td:nth-child(6),
S/L,td:nth-child(7),
T/P,td:nth-child(8),
Time,td:nth-child(9),
State,td:nth-child(10),
Comment,td:nth-child(11),
All fields in this file are simply identified by the sequence number. In other cases you may need smarter
selectors with attributes and classes.
To get a table of deals, simply replace the number of child elements to 13 in the row selector:
tr:has-n-children(13)[bgcolor^="#F"]
Now, by launching WebDataExtractor with the following input parameters (the webdataex-report1.set file is
attached):
URL=ReportHistory.html
SaveName=ReportOrders.csv
RowSelector=tr:has-n-children(11)[bgcolor^="#F"]
ColumnSettingsFile=ReportHistoryOrders.cfg.csv
we will receive the resulting ReportOrders.csv file which corresponds to the source HTML report:
https://www.mql5.com/en/articles/5706 32/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
CSV file resulting from the application of CSS selectors to a trading report
To get the table of deals, use the attached settings from webdataex-report2.set.
The selectors which we created are also suitable for tester reports. The attached webdataex-tester1.set and
webdataex-tester2.set allow you to convert a sample HTML report Tester.html into CSV files.
Important! The layout of many web pages as well as of the the generated HTML files in MetaTrader can be
changed from time to time. Due to this the some of selectors will no longer be applicable, even if the
external presentation is almost the same. In this case you should re-analyze the HTML code and modify the
CSS selectors accordingly.
Now let us view the conversion for the MetaTrader 4 tester report; this allows demonstration of some
interesting techniques in selecting CSS selectors. For the check we will use the attached StrategyTester-ecn-
1.htm.
These files have two tables: one with the testing results and the other one with trading trading operations. To
select the second table we will use the selector "table ~ table". Omit the first row in the operations table,
because it contains a header. This can be done using the selector "tr + tr".
table ~ table tr + tr
This actually means the following: select a table after the table (i.e. the second one, inside the table select
each line having a previous row, i.e. all except the first one).
Settings for extracting deal parameters from cells are available in the file test-report-mt4.cfg.csv. The date
field is processed by the class selector:
DateTime,td.msdate,
Additional CSS selector use and setup examples are provided in the WebDataExtractor discussion page.
https://www.mql5.com/en/articles/5706 33/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
automatically substitute strings (for example, change country names to currency symbols or replace
verbal descriptions of news priority with a number)
output the DOM tree to log and find suitable selectors without a browser;
download and convert web pages by timer or by request from a global variable;
If you need help in the setting up of CSS selectors for a specific web page, you can purchase WebDataExtractor
(for MetaTrader 4, for MetaTrader 5) and receive recommendations as part of product support. However, the
availability of source codes allows you to use the entire functionality and expand it if necessary. This is
absolutely free.
Conclusions
We have considered the technology of CSS selectors, which is one of the main standards in the interpretation of
web documents. The implementation of the most commonly used CSS selectors in MQL allows the flexible setup
and conversion of any HTML page, including standard MetaTrader documents, into structured data without
using third-party software.
We have not considered some other technologies which can also provide versatile tools for processing web
documents. Such tools can be useful, because MetaTrader uses not only HTML, but also the XML format. Trader
can be especially interested in XPath and XSLT. These formats can serve as further steps in developing the idea
of automating trading systems based on web standards. Support for CSS selectors in MQL is just the first step
towards this goal.
Translated from Russian by MetaQuotes Software Corp.
Original article: https://www.mql5.com/ru/articles/5706
Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.
https://www.mql5.com/en/articles/5706 34/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
https://www.mql5.com/en/articles/5706 35/35