0% found this document useful (0 votes)

85 views

Extracting Structured Data From HTML Pages Using CSS Selectors - MQL5 Articles

The document discusses extracting structured data from HTML pages using CSS selectors in MQL5. It provides an overview of HTML/CSS and the DOM technology. It explains how to create an HTML parser that converts a page into a DOM hierarchy of objects. CSS selectors can then be used to search and select specific elements from the DOM tree to extract desired data like table rows/columns for use in MQL5 applications.

Uploaded by

MFS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views

Extracting Structured Data From HTML Pages Using CSS Selectors - MQL5 Articles

Uploaded by

MFS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

METATRADER 5 — INTEGRATION

EXTRACTING STRUCTURED DATA FROM HTML

PAGES USING CSS SELECTORS
6 May 2019, 10:11

0 15 501
STANISLAV KOROTKY

The MetaTrader development environment enables the integration of applications with external data, in
particular with the data obtained from the Internet using the WebRequest function. HTML is the most universal
and the most frequently used data format on the web. If a public service does not provide an open API for
requests or its protocol is difficult to implement in MQL, the desired HTML pages can be parsed. In particular,
traders often use various economic calendars. Although the task is not so relevant now, since the platform
features the built-in Calendar, some traders may need specific news from specific sites. Also, we sometimes
need to analyze deals from a trading HTML report received from third parties.
The MQL5 ecosystem provides various solutions to the problem, which however are usually specific and have
their limitations. On the other hand, there is kind of "native" and universal method to search and parse data
from HTML. This method is connected with the use of CSS selectors. In this article we will consider the MQL5
implementation of this method, as well as examples of their practical use.
To analyze HTML, we need to create a parser which can convert internal page text into a hierarchy of some
objects called Document Object Model or DOM. From this hierarchy, we will be able to find objects with
specified parameters. This approach is based on the use of service information about the document structure,
which is not available in the external page view.
For example, we can select rows of a specific table in a document, read the required columns from them and
get an array with values, which can be easily saved into a csv file, displayed on a chart or used in Expert
Advisor calculations.

Overview of HTML/CSS and DOM technology

HTML is a popular format which is familiar to almost everyone. Therefore, I will not describe in detail the
syntax of this hypertext markup language.

The primary source of related technical information is IETF (Internet Engineering Task Force) and its
specifications, the so-called RFC (Request For Comments). There are a lot of HTML specifications (here is an
example). Standards are also available on the website of the related organization, W3C (World Wide Web
Consortium, HTML5.2).

These organizations have developed the CSS (Cascading Style Sheets) technology and they regulate it. However
we are interested in this technology not because it describes information representation styles on web pages,
but because of CSS selectors contained therein, i.e. a special query language which enables the search of
elements inside html pages.
Both HTML and CSS keep constantly evolving, while new versions are being created. For example, the currently
relevant versions are HTML5.2 and CSS4. However the update and expansion is always accompanied with the
inheritance of old version features. The web is so large, heterogeneous and is often inert, and thus new
versions exist along old ones. As a result, when writing algorithms which imply the use of web technologies,
you should carefully use the specifications: on the one hand, you should take into account possible traditional
deviations and on the other hand you should add some simplifications which will help in avoiding issues with
multiple variations.

https://www.mql5.com/en/articles/5706 1/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

In this project, we will consider the simplified HTML syntax.

An html document consists of tags inside characters '<' and '>'. The tag name and optional attributes are
specified inside the tag. Optional attributes are string pairs of name="value", while the sign '=' can sometimes
be omitted. Here is a tag example:

<a href="https://www.w3.org/standards/webdesign/htmlcss" target="_blank">HTML and CSS</a>

— this is a tag named 'a' (which is interpreted by web browsers as a hyperlink), with two parameters: 'href' for
the website address at the specified hyperlink and 'target' for the website opening option (in this case it is
equal to "_blank", i.e. the site should open in a new browser tab).
This first tag is the opening tag. It is followed by the text which is actually visible on the web page: "HTML and
CSS", and the matching closing tag, having the same name as the opening tag and an additional slash '/' after
the angle bracket '<' (all characters together make up the tag '</a>'). In other words, opening and closing tags
are used in pairs and may include other tags, but only whole tags, without overlapping. Here is an example of a
correct nesting:

</group>

the following "overlapping" is not allowed:

However, the use is not allowed only in theory. In practice, tags may often be opened or closed by mistake in
the wrong place of the document. The parser should be able to handle this situation.
Some tags may be empty, i.e. this can be an empty line:

<p></p>

In accordance with the standards, some tags may (or rather must) have no content at all. For example, the tag
describing an image:

It looks like an opening tag, but it does not have a matching closing one. Such tags are called empty. Please
note that the attributes belonging to the tag are not the tag contents.
It is not always easy to determine whether a tag is empty and whether there should be a closing tag further.
Although the names of valid empty tags are defined in specifications, some other tags may remain unclosed.
Also because HTML and XML formats are close (and there is another variety XHTML), some web page designers
create empty tags as follows:

<img src="/ico20190101.jpg" />

https://www.mql5.com/en/articles/5706 2/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Pay attention to the slash '/' before the angle bracket '>'. This slash is considered excessive in terms of strict
HTML5 rules. All these specific cases can be met in normal web pages, so the parser must be able to handle
them.
Tag and attribute names which are interpreted by web browsers are standard, but HTML can contain
customized elements. Such elements are skipped by browsers unless the developers "connects" them to DOM
using the specialized script API. We should keep in mind that every tag may contain useful information.
A parser can be considered a finite-state machine, which advances letter by letter and changes its state in
accordance with the context. It is clear from the above tag structure description that initially the parser is
outside of any tag (let us call this state "blank"). Then, after encountering the opening angle bracket '<' we get
into an opening tag (the "insideTagOpen" state), which lasts until the closing angle bracket '>'. The combination
of characters '</' suggests that we are in a closing tag (the "insideTagClose" state), and so on. Other states will
be considered in the parser implementation section.
When switching between states, we can select structured information from the current position in the
document, because we know the meaning of the state. For example, if the current position is inside an opening
tag, the tag name can be selected as a line between the last '<' and the subsequent space or '>' (depending on
whether the tag contains attributes). The parser will extract data and create objects of a certain DomElement
class. In addition to the name, attributes and contents, the hierarchy of the objects will be preserved based on
the tags nesting structure. In other words, each object will have a parent (except the root element which
describes the entire document) and an optional array of child objects.
The parser will output the full tree of objects, in which one object will corresponds to one tag in the source
document.
CSS selectors describe standard notations for the conditional selection of objects based on their parameters
and position in the hierarchy. The full list of selectors is quite extensive. We will provide support for some of
them, which are included in the CSS1, CSS2 and CSS3 standards.

Here is a the list of the main selector components:

* - any object (universal selector);

.value — an object with the 'class' attribute and a "value"; example: <div class="example"></div>;
appropriate selector: .example;
#id — an object with the 'id' attribute and a "value"; for the tag <div id="unique"></div> it is the selector
#unique;
tag — an object with the 'tag' name; to find all 'div's as the above one or <div>text</div>, use selector:
div;

They can be accompanied by the so-called pseudo classes which are added on the right:

:first-child — the object is the first child class inside a parent;

:last-child — the object is the last child class i9nside the parent;
:nth-child(n) — the object has the specified position number in the list of child nods of its parent;
:nth-last-child(n) — the object has the specified position number in the list of child nods of its parent
with reverse numbering;

A single selector can be supplemented by the condition related to attributes:

[attr] — the object has the 'attr' attribute (it does not matter whether the attribute has any value or
not);
[attr=value] — the object has the 'attr' attribute with the 'value';
[attr*=text] — the object has the 'attr' attribute with the value containing the substring 'text';
[attr^=start] — the object has the 'attr' attribute with the value beginning with the 'start' string;
[attr$=end] — the object has the 'attr' attribute with the value ending with the 'end' substring;

If needed, it is possible to specify several pairs of brackets with different attributes.

Simple selector is the name selector or a universal selector which can be optionally followed by a class, an
identifier, zero or more attributes or a pseudo class, in any order. A simple selector selects an element when all
components of the selector match the element properties.

https://www.mql5.com/en/articles/5706 3/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

CSS selector (or full selector) is a chain of one or more simple selectors joined by combining characters (' '
(space), '>', '+', '~'):

container element — the 'element' object is nested in the 'container' object at an arbitrary level;
parent > element — the 'element' object has a direct parent 'parent' (the nesting level is equal to 1);
e1 + element — the 'element' object has a common parent with 'e1' and immediately follows it;
e1 ~ element — the 'element' object has a common parent with 'e1' and follows it at any distance;

So far, we have been studying pure theory. Let us view how the above ideas work in practice.

Any modern web browser allows viewing HTML of the currently open page. For example, in Chrome you can run
the 'View page source' command from the context menu or open the developer window (Developer tools,
Ctrl+Shift+I). The developer window has the Console tab, in which we can try to find elements using CSS
selectors. To apply a selector, simply call the document.querySelectorAll function from the console (it is
included in the software API of all browsers).
For example, in the start forum page https://www.mql5.com/en/forum, we can run the following command
(JavaScript code):

document.querySelectorAll("div.widgetHeader")

As a result of this we will receive a list of 'div' elements (tags), in which the "widgetHeader" class is specified. I
decided to use this selector after viewing the source page code, based on which it is clear that the forum
topics are designed in this way.

The selector can be expanded as follows:

document.querySelectorAll("div.widgetHeader a:first-child")

to receive the list of forum topic discussion headers: they are available as hyperlinks 'a', which are first child
elements in each 'div' block selected at the first stage. Here is how this might look (depends on the browser
version):

https://www.mql5.com/en/articles/5706 4/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

The MQL5 web page and result of selection of HTML elements using CSS selectors

You should similarly analyze the HTML code of desired sites, spot the elements of interest and pick up
appropriate CSS selectors. The developer window features the Elements (or similar) tab, in which you can
select any tag in the document (this tag will be highlighted) and find appropriate CSS selectors for this tag.
Thus you will gradually practice the use of selectors and learn to create selector chains manually. Further we
will consider how to select appropriate selectors for a specific web page.

Designing

Let us view the classes which we may need, at a global level. The initial HTML text processing will be
performed by the the HtmlParser class. The class will scan the text for markup characters '<', '/', '>' and some
others, and it will create DomElement class objects according to the above described finite-state machine
rules: one object will be created for each empty tag or a pair of opening and closing tags. The opening tag may
have attributes, which we need to read and save in the current DomElement object. This will be performed by
the AttributesParser class. The class will also operate following the principle of the finite-state machine.

The parser will create DomElement objects taking into account the hierarchy, which is identical to tag nesting
order. For example, if the text contains the 'div' tag, within which several paragraphs are placed (which means
the presence of 'p' tags), such paragraphs will be converted into child objects of the object which describes
'div'.
The initial root object will contain the entire document. Similarly to the browser (which provides the
document.querySelectorAll method), we should provide in DomElement a method for requesting objects
corresponding to passed CSS selectors. The selectors should also be pre-analyzed and converted from the string
representation to objects: a single selector component will be stored in the SubSelector class and the entire
simple selector will be stored in SubSelectorArray.

Once we have the ready DOM tree as a result of the parser operation, we can request from the root
DomElement object (or any other object) all its child elements matching selector parameters. All selected

https://www.mql5.com/en/articles/5706 5/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

elements will be placed in the iterable DomIterator list. For simplicity, let us implement the list as a child of
DomElement, in which an array of child nodes is used for storing the found elements.

Settings with specific site or HTML files processing rules and the algorithm execution result can be conveniently
stored in a class, which combines map properties (i.e. provides access to values based on the names of
appropriate attributes) and array properties (i.e. access to elements by index). Let us call this class IndexMap.
Let us provide the possibility to nest IndexMap one into another: when collecting tabular data from a web
page, we get a list of rows each containing the list of columns. For both of data types we can save the names
of source elements. This can be especially useful in cases where some of the required elements are missing in
the source document (which may happen quite often) - in such cases simple indexing ignores important
information about which data is missing. As a bonus, let us "train" IndexMap to get serialized into a multiline
text, including CSV format. This feature is useful when converting HTML pages into tabular data. If necessary,
you can replace the IndexMap class with your own while preserving the main functionality.
The following UML diagram displays the described classes.

UML diagram of classes implementing CSS selectors in MQL

Implementation
HtmlParser
In the HtmlParser class, we describe the variables which are necessary to scan the source text and to generate
the object tree, as well as to arrange the finite-state machine algorithm.

The current position in the text is stored in the 'offset' variable. The resulting tree root and the current object
(scanning is performed in this object context) are represented by the 'root' and 'cursor' pointers. Their
DomElement type will be considered later. The list of tags, which may be empty according to the HTML
specification, will be loaded into the 'empties' map (which is initialized in the constructor, see below). Finally,
we provide the 'state' variable for the description of finite-state machine states. The variable is an
enumeration of the StateBit type.

enum StateBit

blank,

insideTagOpen,

insideTagClose,

https://www.mql5.com/en/articles/5706 6/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
insideComment,

insideScript

};

class HtmlParser

private:

StateBit state;

int offset;
DomElement *root;

DomElement *cursor;

IndexMap empties;

...

The StateBit enumeration contains elements describing the following parser states depending on the current
position in the text:

blank — outside of a tag;

insideTagOpen — inside an opening tag;
insideTagClose — inside a closing tag;
insideComment — inside a comment (comments in HTML code are written in tags ); no
objects are generated as long as the parser is inside a comment, no matter which tags are contained in
the comment;
insideScript — inside a script; this state should be highlighted because javascript code often contains
substrings which can be interpreted as HTML tags, although they are not DOM elements but are parts of a
script);

In addition, let us describe constant strings which will be used to search for markup:

const string TAG_OPEN_START;

const string TAG_OPEN_STOP;

const string TAG_OPENCLOSE_STOP;

const string TAG_CLOSE_START;

const string TAG_CLOSE_STOP;

const string COMMENT_START;

const string COMMENT_STOP;

const string SCRIPT_STOP;

The parser constructor initializes all these variables:

public:

HtmlParser():

TAG_OPEN_START("<"),

TAG_OPEN_STOP(">"),
TAG_OPENCLOSE_STOP("/>"),

TAG_CLOSE_START("</"),

TAG_CLOSE_STOP(">"),

COMMENT_START("<!--"),

COMMENT_STOP("-->"),

SCRIPT_STOP("/script>"),

state(blank)

{
for(int i = 0; i < ArraySize(empty_tags); i++)

{

empties.set(empty_tags[i]);

}

https://www.mql5.com/en/articles/5706 7/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

An array of empty_tags strings is used here. This array is preliminary connected from an external text file:

string empty_tags[] =

#include <empty_strings.h>

};

See the contents below (valid empty tags, but the list is not complete):

// header
"isindex",
"base",

"meta",

"link",

"nextid",

"range",

// body

"img",

"br",
"hr",
"frame",

"wbr",

"basefont",

"spacer",

"area",

"param",

"keygen",

"col",

"limittext"

Do not forget to delete the DOM tree:

~HtmlParser()

{
if(root != NULL)

{

delete root;
}

}

The main operations are performed by the parse method:

DomElement *parse(const string &html)

{
if(root != NULL)

{

delete root;
}

root = new DomElement("root");

cursor = root;

offset = 0;

while(processText(html));

return root;

}

https://www.mql5.com/en/articles/5706 8/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

The web page is input, an empty root DomElement is created, the cursor is set to it, while the current position
in the text (offset) is set to the very beginning. Then the processText helper method is called in a loop until the
entire text is successfully read. The finite-state machine is then executed in this method. The default state of
the machine is blank.

bool processText(const string &html)

{
int p;

if(state == blank)

{

p = StringFind(html, "<", offset);

if(p == -1) // no more tags

{

return(false);

}

else if(p > 0)

{

if(p > offset)

{

string text = StringSubstr(html, offset, p - offset);

StringTrimLeft(text);

StringTrimRight(text);

StringReplace(text, " ", "");

if(StringLen(text) > 0)
{

cursor.setText(text);

}

offset = p;

if(IsString(html, COMMENT_START)) state = insideComment;

else

if(IsString(html, TAG_CLOSE_START)) state = insideTagClose;

else

if(IsString(html, TAG_OPEN_START)) state = insideTagOpen;

return(true);

}

The algorithm searches the text for the angle bracket '<'. If it is not found, then there are no more tags so
processing should be interrupted (false is returned). If the bracket is found and there is a fragment of text
between the new found tag and the previous position (offset), the fragment is considered to be the contents of
the current tag (the object is available at the 'cursor' pointer) - so this text is added to the object using the call
of cursor.setText().
Then position in the text is moved to the beginning of the new found tag and depending on the signature which
follows '<' (COMMENT_START, TAG_CLOSE_START, TAG_OPEN_START) the parser is switched to the appropriate
new state. The IsString function is a small helper string comparison method, which uses StringSubstr.
In any case true is returned from the processText method, which means that the method will be called again in
the loop, but the parser state will be different now. If the current position is in the opening tag, the following
code is executed.

else
if(state == insideTagOpen)

{

offset++;

int pspace = StringFind(html, " ", offset);

int pright = StringFind(html, ">", offset);

p = MathMin(pspace, pright);

https://www.mql5.com/en/articles/5706 9/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
if(p == -1)

{

p = MathMax(pspace, pright);

}

if(p == -1 || pright == -1) // no tag closing

{

return(false);

}

If the text has neither space nor '>', the HTML syntax is broken, so false is returned. Further steps select the
tag name.

if(pspace > pright)

{

pspace = -1; // outer space, disregard

}

bool selfclose = false;

if(IsString(html, TAG_OPENCLOSE_STOP, pright - StringLen(TAG_OPENCLOSE_STOP) + 1

{

selfclose = true;

if(p == pright) p--;

pright--;

}

string name = StringSubstr(html, offset, p - offset);

StringToLower(name);

StringTrimRight(name);

DomElement *e = new DomElement(cursor, name);

Here we have created a new object with the found name. The current object (cursor) is used as the object
parent.

Now we need to process the attributes, if there are any.

if(pspace != -1)

{

string txt;

if(pright - pspace > 1)

{

txt = StringSubstr(html, pspace + 1, pright - (pspace + 1));

e.parseAttributes(txt);

}

The parseAttributes method "lives" directly in the DomElement class, which we will consider later.
If the tag is not closed, you should check if it is not the one which can be empty. If it is, it should be "closed"
implicitly.

bool softSelfClose = false;

if(!selfclose)

{

if(empties.isKeyExisting(name))

{

selfclose = true;

https://www.mql5.com/en/articles/5706 10/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
softSelfClose = true;

}

Depending on whether the tag is closed or not, we either move deeper along the object hierarchy, setting the
newly created object as the current one (e), or we remain within the context of the previous object. In any
case, position in the text (offset) is moved to the last read character, i.e. beyond '>'.

pright++;

if(!selfclose)

{

cursor = e;

}

else

{

if(!softSelfClose) pright++;

}

offset = pright;

A special case is the script. If we meed the <script> tag, the parser switches to the insideScript state,
otherwise it switches to the blank state.

if((name == "script") && !selfclose)

{

state = insideScript;

}

else

{

state = blank;

}

return(true);

}

The following code is executed in the closing tag state.

else
if(state == insideTagClose)

{

offset += StringLen(TAG_CLOSE_START);

p = StringFind(html, ">", offset);

if(p == -1)

{

return(false);

}

Again search for '>',which must be available according to the HTML syntax. Iа the bracket is not found, the
process should be interrupted. The tag name is highlighted in case of success. This is done to check if the
closing tag matches the opening one. And if the matching is broken, it is necessary to somehow overcome this
layout error and try to continue parsing.

string tag = StringSubstr(html, offset, p - offset);

StringToLower(tag);

DomElement *rewind = cursor;

https://www.mql5.com/en/articles/5706 11/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

while(StringCompare(cursor.getName(), tag) != 0)

{

string previous = cursor.getName();

cursor = cursor.getParent();

if(cursor == NULL)

{

// orphan closing tag

cursor = rewind;

state = blank;

offset = p + 1;

return(true);
}

}

We are processing the closing tag, which means that the context of the current object has ended and so the
parser switches back to the parent DomElement:

cursor = cursor.getParent();

if(cursor == NULL) return(false);

state = blank;

offset = p + 1;

return(true);

}

If successful, the parser state again becomes 'blank'.

When the parser is inside a comment, it is obviously looking for the end of the comment.

else
if(state == insideComment)

{

offset += StringLen(COMMENT_START);

p = StringFind(html, COMMENT_STOP, offset);

if(p == -1)

{

return(false);

}

offset = p + StringLen(COMMENT_STOP);

state = blank;

return(true);

}

When the parser is inside a script, it searches for the end of the script.

else
if(state == insideScript)

{

p = StringFind(html, SCRIPT_STOP, offset);

if(p == -1)

{

return(false);

}

offset = p + StringLen(SCRIPT_STOP);

https://www.mql5.com/en/articles/5706 12/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
state = blank;

cursor = cursor.getParent();

if(cursor == NULL) return(false);

return(true);

}

return(false);

}

This was actually the entire HtmlParser class. Now let us consider DomElement.

DomElement. Beginning

The DomElement class has variables for storing the name (mandatory), contents, attributes, links to parent and
child elements (created as 'protected' because it will be used in the derived class DomIterator).

class DomElement

private:

string name;

string content;

IndexMap attributes;

DomElement *parent;

protected:

DomElement *children[];

A set of constructors does not require explanations:

public:

DomElement(): parent(NULL) {}

DomElement(const string n): parent(NULL)

{
name = n;

}

DomElement(DomElement *p, const string &n, const string text = "")

{
p.addChild(&this);

parent = p;

name = n;

if(text != "") content = text;

}

Of course, the class has "setter" and "getter" field methods (they are omitted in the article), as well as a set of
methods for operations with child elements (only prototypes are shown in the article):

void addChild(DomElement *child)

int getChildrenCount() const;

DomElement *getChild(const int i) const;

void addChildren(DomElement *p)

int getChildIndex(DomElement *e) const;

The parseAttributes method which was used at the parsing stage, delegates further work to the
AttributesParser helper class.

https://www.mql5.com/en/articles/5706 13/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

void parseAttributes(const string &data)

{
AttributesParser p;

p.parseAll(data, attributes);

}

A simple 'data' string is output, based on which the method fills the 'attributes' map with the properties found.

The full AttributesParser code is available in attachments below. The class is not large and operates by finite-
state machine principle, similarly to HtmlParser. But it has only two states:

enum AttrBit

name,

value

};

Since the list of attributes is a string consisting of name="value" pairs, AttributesParser is always either at the
name or at the value. This parser could be implemented using the StringSplit function, but because of possible
formatting deviations (such as the presence or absence of quotes, the use of spaces inside the quotes, etc.),
the machine approach was chosen.
As for the DomElement class, most of the work in it should be performed by methods which select child
elements corresponding to given CSS selectors. Before we proceed to this feature, it is necessary to describe
the selector classes.

SubSelector and SubSelectorArray

The SubSelector class describes one component of a simple selector. For example the simple selector
"td[align=left][width=325]" has three components:

tag name - td
align attribute condition - [align=left]
width attribute condition - [width=325]

The simple selector "td:first-child" has two components:

tag name - td
child index condition using the pseudo class - :first-child

The simple selector "span.main[id^=calendarTip]" again has three components:

tag name — span

class — main
the id attribute must start with the calendarTip string

Here is the class:

class SubSelector

enum PseudoClassModifier

{

none,

firstChild,

lastChild,

nthChild,

nthLastChild

};

public:

https://www.mql5.com/en/articles/5706 14/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
ushort type;

string value;

PseudoClassModifier modifier;

string param;

};

The 'type' variable contains the first character of the selector ('.', '#', '[') or the default 0, which corresponds to
the name selector. The value variable stores the substring which follows the character. i.e. the actual searched
element. If the selector string has a pseudo class, its id is written to the 'modifier' field. In the description of
selectors ":nth-child" and ":nth-last-child", the index of the searched element is specified in brackets. This will
be saved in the 'param' field (it can only be a number in the current implementation, but special formulas are
also allowed and therefore the field is declared as string).

The SubSelectorArray class provides a bunch of components, therefore let us declare the 'selectors' array in it:

class SubSelectorArray

private:

SubSelector *selectors[];

SubSelectorArray is one simple selector as a whole. No class is needed for full CSS selectors since they are
processed sequentially, step by step, i.e. one selector at each hierarchy level.

Let us add the supported pseudo class selectors to the 'mod' map. This enables the immediate getting of the
appropriate modifier from PseudoClassModifier for that string:

IndexMap mod;

static TypeContainer<PseudoClassModifier> first;

static TypeContainer<PseudoClassModifier> last;

static TypeContainer<PseudoClassModifier> nth;

static TypeContainer<PseudoClassModifier> nthLast;

void init()

{
mod.add(":first-child", &first);

mod.add(":last-child", &last);

mod.add(":nth-child", &nth);

mod.add(":nth-last-child", &nthLast);

}

The TypeContainer class is a template wrapper for the values which are added to IndexMap.

Note that static members (in this case objects for the map) must be initialized after the class description:

TypeContainer<PseudoClassModifier> SubSelectorArray::first(PseudoClassModifier::firstChi
TypeContainer<PseudoClassModifier> SubSelectorArray::last(PseudoClassModifier::lastChild
TypeContainer<PseudoClassModifier> SubSelectorArray::nth(PseudoClassModifier::nthChild);
TypeContainer<PseudoClassModifier> SubSelectorArray::nthLast(PseudoClassModifier::nthLas

Let us get back to the SubSelectorArray class.

When it is necessary to add a simple selector component to the array, the add function is called:

void add(const ushort t, string v)

    {
      int n = ArraySize(selectors);
https://www.mql5.com/en/articles/5706 15/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
y ( );
      ArrayResize(selectors, n + 1);

PseudoClassModifier m = PseudoClassModifier::none;

string param;

for(int j = 0; j < mod.getSize(); j++)

{

int p = StringFind(v, mod.getKey(j));

if(p > -1)

{

if(p + StringLen(mod.getKey(j)) < StringLen(v))

{

param = StringSubstr(v, p + StringLen(mod.getKey(j)));

if(StringGetCharacter(param, 0) == '(' && StringGetCharacter(param, StringLe

{

param = StringSubstr(param, 1, StringLen(param) - 2);

}

else

{

param = "";

}

m = mod[j].get<PseudoClassModifier>();

v = StringSubstr(v, 0, p);

break;

}

if(StringLen(param) == 0)

{

selectors[n] = new SubSelector(t, v, m);

}

else
{

selectors[n] = new SubSelector(t, v, m, param);

}

The first character (type) and the next string is passed to it. The string is parsed to the searched object name,
optionally a pseudo class and a parameter. All this is then passed to the SubSelector constructor, while a new
selector component is added to the 'selectors' array.

The add function is used indirectly from the simple selector constructor:

private:

void createFromString(const string &selector)

{
ushort p = 0; // previous/pending type

int ppos = 0;

int i, n = StringLen(selector);

for(i = 0; i < n; i++)

{

ushort t = StringGetCharacter(selector, i);

if(t == '.' || t == '#' || t == '[' || t == ']')

{

string v = StringSubstr(selector, ppos, i - ppos);

if(i == 0) v = "*";

if(p == '[' && StringLen(v) > 0 && StringGetCharacter(v, StringLen(v) - 1) ==

{

v = StringSubstr(v, 0, StringLen(v) - 1);

https://www.mql5.com/en/articles/5706 16/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
v StringSubstr(v, 0, StringLen(v) 1);

}

add(p, v);

p = t;

if(p == ']') p = 0;

ppos = i + 1;

}

if(ppos < n)

{

string v = StringSubstr(selector, ppos, n - ppos);

if(p == '[' && StringLen(v) > 0 && StringGetCharacter(v, StringLen(v) - 1) == ']

{

v = StringSubstr(v, 0, StringLen(v) - 1);

}

add(p, v);

}

public:

SubSelectorArray(const string selector)

{
init();

createFromString(selector);

}

The createFromString function receives a text representation of the CSS selector and views it in a loop to find
special beginning characters '.', '#' or '[', then determines where the component ends and calls the 'add' method
for the selected information. The loop continues as long as the chain of components continues.

The full code of the SubSelectorArray is attached below.

Now it is time to get back to the DomElement class. This is the most difficult part.

DomElement. Continued
The querySelect method is used to search for elements matching the specified selectors (in the textual
representation). In this method, the full CSS selector is divided into simple selectors, which are then converted
to the SubSelectorArray object. The list of matching elements us searched for each simple selector. Other
elements matching the next simple selector are searched relative to the found elements. This is continued
until the last simple selector is met or until the list of found elements becomes empty.

DomIterator *querySelect(const string q)

{
DomIterator *result = new DomIterator();

Here the return value is the unfamiliar DomIterator class, which is the child of DomElement. It provides
auxiliary functionality in addition to DomElement (in particular, it allows "scrolling" child elements), so we will
not analyze DomIterator in details now. There is another complicated part.

The selector string is analyzed character by character. For this purpose several local variables are used. The
current character is stored in the c variable (abbr. of 'character'). The previous character is stored in
the p variable (abbr. of 'previous'). If a character is one of combinator characters (' ', '+', '>', '~'), it is saved in a
variable (a), but is not used until the next simple selector is determined.

Combinators are located between simple selectors, while the operation defined by the combinators can only be
performed after reading the entire selector on the right. Therefore, the last read combinator (a) first passes
through the "waiting" state: the a variable is not used until the next combinator appears or the string end is

https://www.mql5.com/en/articles/5706 17/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

reached, while both cases mean that the selector has been fully formed. Only at this moment the "old"
combinator (b) is applied and is replaced by a new one (a). The code itself is clearer than its description:

int cursor = 0; // where selector string started

int i, n = StringLen(q);
ushort p = 0; // previous character

ushort a = 0; // next/pending operator

ushort b = '/'; // current operator, 'root' notation from the start

string selector = "*"; // current simple selector, 'any' by default

int index = 0; // position in the resulting array of objects

for(i = 0; i < n; i++)

{

ushort c = StringGetCharacter(q, i);

if(isCombinator(c))

{

a = c;

if(!isCombinator(p))

{

selector = StringSubstr(q, cursor, i - cursor);

}

else

{

// suppress blanks around other combinators

a = MathMax(c, p);

}

cursor = i + 1;

}

else

{

if(isCombinator(p)) // action

{

index = result.getChildrenCount();

SubSelectorArray selectors(selector);

find(b, &selectors, result);

b = a;

// now we can delete outdated results in positions up to 'index'

result.removeFirst(index);

}

p = c;

}

if(cursor < i) // action

{

selector = StringSubstr(q, cursor, i - cursor);

index = result.getChildrenCount();

SubSelectorArray selectors(selector);

find(b, &selectors, result);

result.removeFirst(index);

}

return result;
}

The 'cursor' variable always points at the first character, from which the string with the simple selector begins
(i.e. at the character which immediately follows the previous combinator or at the string beginning). When
another combinator is found, copy the substring from 'cursor' to the current character (i) into the 'selector'
variable.
https://www.mql5.com/en/articles/5706 18/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Sometimes there are several combinators in succession: this may usually happen when other combinator
characters surround spaces, while the space itself is also a combinator. For example, the entries "td>span" and
"td > span" are equivalent, but spaces are inserted in the second case to improve readability. Such situations
are handled by the line:

a = MathMax(c, p);

It compares the current and previous characters if both are combinators. Then, based on the fact that the
space has the smallest code, select an "older" combinator. The combinator array is obviously defined as follows:

ushort combinators[] =

' ', '+', '>', '~'

};

Check of whether the character is included into this array is performed by a simple isCombinator helper
function.

If there are two combinators in a row, other than a space, then the selector is erroneous and behavior is not
defined in specifications. However, our code does not lose performance and suggests consistent behavior.

If the current character is not a combinator while the previous character was a combinator, the execution falls
into a branch marked with an 'action' comment. Now memorize the current size of the array of DomElements
selected to this moment by calling:

index = result.getChildrenCount();

The array is initially empty and index = 0.

Create an array of selector objects corresponding to the current simple selector, i.e. the 'selector' string:

SubSelectorArray selectors(selector);

Then call the 'find' method, which will be considered further.

find(b, &selectors, result);

Pass the combinator character into it (this should be the preceding combinator, i.e. from the b variable), as
well as the simple selector and an array to output results to.
After that move the queue of combinators forward, copy the last found combinator (which has not yet been
processed) from the a variable to b and delete from results everything which was available before the call of
'find' using:

result.removeFirst(index);

The removeFirst method is defined in DomIterator. It performs a simple task: it deletes from an array all first
elements up to the specified number. This is done because in the process of each successive simple selector
processing, we narrow the element selection conditions and everything selected earlier is no longer valid,
while the newly added elements (which meet these narrow conditions) start with 'index'.

Similar processing (marked with the 'action' comment) is also performed after reaching the end of the input
string. In this case, the last pending combinator should be processed in conjunction with the rest of the line
(from the 'cursor' position).

Now let us consider the 'find' method.

https://www.mql5.com/en/articles/5706 19/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

bool find(const ushort op, const SubSelectorArray selectors, DomIterator output)

    {
      bool found = false;
      int i, n;

If one of the combinators setting tag nesting conditions (' ', '>') is input, checks should be recursively called for
all child elements. In this branch we also need to take into account the special combinator '/', which is used at
search beginning in the calling method.

if(op == ' ' || op == '>' || op == '/')

{

n = ArraySize(children);

for(i = 0; i < n; i++)

{

if(children[i].match(selectors))

{

if(op == '/')

{

found = true;

output.addChild(GetPointer(children[i]));

}

The 'match' method will be considered later. It returns true, if the object corresponds to passed selector or
false if otherwise. At the very beginning of search (combinator op = '/'), there are no combinations yet, so all
tags meeting selector rules are added to the result (output.addChild).

else

if(op == ' ')

{

DomElement *p = &this;

while(p != NULL)
{

if(output.getChildIndex(p) != -1)

{

found = true;

output.addChild(GetPointer(children[i]));

break;

}

p = p.parent;

}

For the combinator ' ', a check is performed of whether the current DomElement or any its parent already
exists in 'output'. This means that the new child elements which satisfies search conditions is already nested
into the parent. This is exactly the task of the combinator.

The combinator '>' operates in a similar way, but it needs to track only immediate "relatives" and thus only
check if the current DomElement is available in interim results. If it is, then it has earlier been selected to
'output' by conditions of the selector on the left of the combinator and its i-th child element has just met the
conditions of the selector to the right of the combinator.

else // op == '>'

{

if(output.getChildIndex(&this) != -1)

{

found = true;

output.addChild(GetPointer(children[i]));

}

https://www.mql5.com/en/articles/5706 20/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
}

}

Then similar checks need to be performed deep in the DOM tree, therefore 'find' should be recursively called
for child elements.

children[i].find(op, selectors, output);

}

Combinators '+' and '~' set conditions of whether two elements refer to the same parent.

else
if(op == '+' || op == '~')

{

if(CheckPointer(parent) == POINTER_DYNAMIC)

{

if(output.getChildIndex(&this) != -1)

{

One of the elements must be already selected by a selector on the left. If this condition is met, check the
"siblings" for the selector on the right ("siblings" are the children of the current node parent).

int q = parent.getChildIndex(&this);

if(q != -1)

{

n = (op == '+') ? (q + 2) : parent.getChildrenCount();

if(n > parent.getChildrenCount()) n = parent.getChildrenCount();

for(i = q + 1; i < n; i++)

{

DomElement *m = parent.getChild(i);

if(m.match(selectors))

{

found = true;

output.addChild(m);

}

The difference between handling of '+' and '~' is as follows: with '+' elements must be immediate neighbors
while with '~' there can be any number of other "siblings" between the elements. Therefore the loop is only
performed once for '+', i.e. for the next element in the array of child elements. The 'match' function is called
again inside the loop (see details later).

}

for(i = 0; i < ArraySize(children); i++)

{

found = children[i].find(op, selectors, output) || found;

}

return found;

}

After all checks, move to the next DOM element tree hierarchy level and call 'find' for child nods.

https://www.mql5.com/en/articles/5706 21/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

That is all about the 'find' method. Now let us view the 'match' function. This is the last point in the description
of the selector implementation.

The function checks in the current object the entire chain of components of a simple selector passed through
an input parameter. If at least one component in the loop does not match the element properties, the check
fails.

bool match(const SubSelectorArray *u)

{
bool matched = true;

int i, n = u.size();

for(i = 0; i < n && matched; i++)

{

if(u[i].type == 0) // tag name and pseudo-classes

{

if(u[i].value == "*")

{

// any tag

}

The 0 type selector is the tag name or a pseudo class. Any tag is suitable for a selector containing an asterisk.
Otherwise the selector string should be compared with the tag name:

else

if(StringCompare(name, u[i].value) != 0)

{

matched = false;

}

The currently implemented pseudo-classes set limitations on the number of the current element in the array of
a parent's child elements, so we analyze the indexes:

else

if(u[i].modifier == PseudoClassModifier::firstChild)

{

if(parent != NULL && parent.getChildIndex(&this) != 0)

{

matched = false;

}

else

if(u[i].modifier == PseudoClassModifier::lastChild)

{

if(parent != NULL && parent.getChildIndex(&this) != parent.getChildrenCount(

{

matched = false;

}

else

if(u[i].modifier == PseudoClassModifier::nthChild)

{

int x = (int)StringToInteger(u[i].param);

if(parent != NULL && parent.getChildIndex(&this) != x - 1) // children are c

{

matched = false;

}

else

if(u[i].modifier == PseudoClassModifier::nthLastChild)

{

int x = (int)StringToInteger(u[i] param);

https://www.mql5.com/en/articles/5706 22/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
int x = (int)StringToInteger(u[i].param);

if(parent != NULL && parent.getChildrenCount() - parent.getChildIndex(&this)

{

matched = false;

}

Selector '.' imposes a restriction on the "class" attribute:

else

if(u[i].type == '.')

{

if(attributes.isKeyExisting("class"))

{

Container *c = attributes["class"];

if(c == NULL || StringFind(" " + c.get<string>() + " ", " " + u[i].value + "
{

matched = false;

}

else

{

matched = false;

}

Selector '#' imposes a restriction on the "id" attribute:

else

if(u[i].type == '#')

{

if(attributes.isKeyExisting("id"))

{

Container *c = attributes["id"];

if(c == NULL || StringCompare(c.get<string>(), u[i].value) != 0)

{

matched = false;

}

else

{

matched = false;

}

The selector '[' enables the specification of an arbitrary set of required attributes. Also, in addition to strict
comparison of values, it is possible to check the occurrence of a substring (suffix '*'), beginning ('^') and end
('$').

else

if(u[i].type == '[')

{

AttributesParser p;

IndexMap hm;

p.parseAll(u[i].value, hm);

// attributes are selected one by one: element[attr1=value][attr2=value]

// the map should contain only 1 valid pair at a time

https://www.mql5.com/en/articles/5706 23/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
p y p
if(hm.getSize() > 0)
{

string key = hm.getKey(0);

ushort suffix = StringGetCharacter(key, StringLen(key) - 1);

if(suffix == '*' || suffix == '^' || suffix == '$') // contains, starts with

{

key = StringSubstr(key, 0, StringLen(key) - 1);

}

else

{

suffix = 0;

}

if(hasAttribute(key) && attributes[key] != NULL)

{

string v = hm[0] != NULL ? hm[0].get<string>() : "";

if(StringLen(v) > 0)

{

if(suffix == 0)

{

if(key == "class")

{

matched &= (StringFind(" " + attributes[key].get<string>() + " ", "

}

else

{

matched &= (StringCompare(v, attributes[key].get<string>()) == 0);

}

else

if(suffix == '*')

{

matched &= (StringFind(attributes[key].get<string>(), v) != -1);

}

else

if(suffix == '^')

{

matched &= (StringFind(attributes[key].get<string>(), v) == 0);

}

else

if(suffix == '$')

{

string x = attributes[key].get<string>();

if(StringLen(x) > StringLen(v))

{

matched &= (StringFind(x, v, StringLen(x) - StringLen(v)) == StringL

}

else

{

matched = false;

}

return matched;

}

https://www.mql5.com/en/articles/5706 24/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Please note that the "class" attribute is also supported and processed here. Moreover, similarly to '.', not strict
matching is checked, but the availability of the class among a set of other classes. Often in HTML multiple
classes are assigned to one element. In this case classes are specified in the 'class' attributed separated with a
space.

Let's sum up the intermediate results. We have implemented in the DomElement class the querySelect
method, which accepts a string with the full CSS selector as a parameter and returns the DomIterator
object, i.e. an array of found matching elements. Inside querySelect, the CSS selector string is divided into
a sequence of simple selectors and combinator characters between them. For each simple selector, the
'find' method with the specified combinator is called. This method updates the list of results, while
recursively calling itself for child elements. Comparison of simple selector components with the properties
of a particular element is performed in the 'match' method.

For example, using the querySelect method we can select rows from a table using one CSS selector and then
we can call querySelect for each row with another CSS selector to isolate specific cells. Since operations with
tables are required very often, let us create the tableSelect method in the DomElement class, which will
implement the above described approach. Its code is provided in a simplified form.

IndexMap *tableSelect(const string rowSelector, const string &columSelectors[], cons

{

The row selector is specified in the rowSelector parameter, while cell selectors are specified in the
columSelectors array.

Once all the elements are selected, we will need to take some information from them, such as text or attribute
value. Let us use the dataSelectors to determine the position of the required information within an element,
while an individual data extraction method can be used for each table column.

If dataSelectors[i] is an empty row, read the textual contents of the tag (between the opening and closing
parts, for example "100%" from tag "<p>100%</p>"). If dataSelectors[i] is a row, consider this the attribute
name and use this value.

Let us view the full implementation in detail:

DomIterator *r = querySelect(rowSelector);

Here we get the resulting list of elements by row selector.

IndexMap *data = new IndexMap('\n');

int counter = 0;

r.rewind();

Here we create an empty map to which table data will be added, and prepare for a loop through row objects.
Here is the loop:

while(r.hasNext())

{

DomElement *e = r.next();

string id = IntegerToString(counter);

IndexMap *row = new IndexMap();

https://www.mql5.com/en/articles/5706 25/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Thus we get the next row, (e), create a container map for it (row), to which cells will be added, and run loop
through columns:

for(int i = 0; i < ArraySize(columSelectors); i++)

{

DomIterator *d = e.querySelect(columSelectors[i]);

In each row object, select the list of cell objects (d) using the appropriate selector. Select data from each
found cell and save it to the 'row' map:

string value;

if(d.getChildrenCount() > 0)

{

if(dataSelectors[i] == "")

{

value = d[0].getText();

}

else

{

value = d[0].getAttribute(dataSelectors[i]);

}

StringTrimLeft(value);

StringTrimRight(value);

row.setValue(IntegerToString(i), value);

}

Integer keys are used here for code simplicity, while the full source code supports the use of element
identifiers for the keys.

If a matching cell is not found, mark it as empty.

else // field not found

{

row.set(IntegerToString(i));

}

delete d;

}

Add the field 'row' to the 'data' table.

if(row.getSize() > 0)

{

data.set(id, row);

counter++;

}

else

{

delete row;

}

delete r;

return data;

}

https://www.mql5.com/en/articles/5706 26/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Thus, at the output we get a map of maps, i.e. a table with row numbers along the first dimension and column
numbers along the second. If necessary, the tableSelect function can be adjusted to other data containers.

A non-trading utility Expert Advisor was created to apply all the above classes.

The WebDataExtractor utility Expert Advisor

The Expert Advisor is used to convert data from web pages into a tabular structure with the possibility to save
the result to a CSV file.

The Expert Advisor received the following input parameters: a link to the source data (a local file or a web
page which can be downloaded using WebRequest), row and column selectors and the CSV file name. The main
input parameters are shown below:

input string URL = "";

input string SaveName = "";

input string RowSelector = "";

input string ColumnSettingsFile = "";

input string TestQuery = "";

input string TestSubQuery = "";

In URL, specify the web page address (beginning with http:// or https://) or the local html file name.

In SaveName, the name of the CSV file with results is specified in normal mode. But it can also be used for
other purpose: to save the downloaded page for subsequent debugging of selectors. In this mode the next
parameter should be left empty: RowSelector, in which the CSS row selector is usually specified.
Since there are several column selectors, they are set in a separate CSV set file, which name is specified in the
ColumnSettingsFile parameter. The file format is as follows.

The first line is the header, each subsequent line describes a separate field (a data column in the table row).

The file should have three columns: name, CSS selector, data locator:

name — the i-th column in the output CSV file;

CSS selector — for selecting an element, from which data for the i-th column of the output CSV file will
be use. This selector is applied inside each DOM-element, previously selected from a web page using
RowSelector. To directly select a row element specify here '.';
data "locator" — determines which element part data will be used from; the attribute name can be
specified or it can be left empty to take the text contents of a tag.

TestQuery and TestSubQuery parameters allow testing selectors for a row and one column, while outputting the
result to log but not saving to CSV and not using settings files for all columns.

Here is the main operating function of the Expert Advisor in a brief form.

int process()

string xml;

if(StringFind(URL, "http://") == 0 || StringFind(URL, "https://") == 0)

{

xml = ReadWebPageWR(URL);

}

else

{

Print("Reading html-file ", URL);

int h = FileOpen(URL, FILE_READ|FILE_TXT|FILE_SHARE_WRITE|FILE_SHARE_READ|FILE_ANSI,

if(h == INVALID_HANDLE)

{
Print("Error reading file '", URL, "': ", GetLastError());

return -1;
https://www.mql5.com/en/articles/5706 27/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
return 1;

}
StringInit(xml, (int)FileSize(h));

while(!FileIsEnding(h))

{
xml += FileReadString(h) + "\n";

    }
    // xml = FileReadString(h, (int)FileSize(h)); - has 4095 bytes limit in binary files
    FileClose(h);

}

...

Thus we have read an HTML page from a file or downloaded from the Internet. Now, in order to convert the
document to the hierarchy of DOM objects, let us create the HtmlParser object and start parsing:

HtmlParser p;

DomElement *document = p.parse(xml);

If testing selectors are specified, handle them by querySelect calls:

if(TestQuery != "")

{

Print("Testing query, subquery: '", TestQuery, "', '", TestSubQuery, "'");

DomIterator *r = document.querySelect(TestQuery);

r.printAll();

if(TestSubQuery != "")

{
r.rewind();

while(r.hasNext())

{

DomElement *e = r.next();

DomIterator *d = e.querySelect(TestSubQuery);

d.printAll();

delete d;

}

delete r;

return(0);

}

In a normal operation mode, read the column setting file and call the tableSelect function:

string columnSelectors[];

string dataSelectors[];
string headers[];

if(!loadColumnConfig(columnSelectors, dataSelectors, headers)) return(-1);

IndexMap *data = document.tableSelect(RowSelector, columnSelectors, dataSelectors);

If a CSV file for saving results is specified, let the 'data' map perform this task.

if(StringLen(SaveName) > 0)

{

Print("Saving data as CSV to ", SaveName);

i t h Fil O (S N FILE WRITE|FILE CSV|FILE ANSI '\t' CP UTF8)

https://www.mql5.com/en/articles/5706 28/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles
int h = FileOpen(SaveName, FILE_WRITE|FILE_CSV|FILE_ANSI, '\t', CP_UTF8);

if(h == INVALID_HANDLE)

    {
      Print("Error writing ", data.getSize() ," rows to file '", SaveName, "': ", GetLas
    }
    else

{
FileWriteString(h, StringImplodeExt(headers, ",") + "\n");

FileWriteString(h, data.asCSVString());

FileClose(h);

Print((string)data.getSize() + " rows written");

}
}

else

{

Print("\n" + data.asCSVString());

}

delete data;

return(0);

Let us proceed to practical application of the Expert Advisor.

Practical use

with some standard HTML files, such as testing reports and trading reports generated by
Traders often deal
MetaTrader. We sometimes receive such files from other traders or download from the Internet and want to
visualize the data on a chart for further analysis. For this purpose data from HTML should be converted to a
tabular view (to the CSV format in a simple case).

CSS selector in our utility can automate this process.

Let us have a look inside the HTML files. Below is the appearance and part of HTML code of the MetaTrader 5
trading report (the ReportHistory.html file is attached below).

https://www.mql5.com/en/articles/5706 29/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Trading report appearance and part of HTML code

And now here is the appearance and part of HTML code of the MetaTrader 5 testing report (the Tester.html file
is attached below).

https://www.mql5.com/en/articles/5706 30/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

Tester report appearance and part of HTML code

According to the appearance in the above figure, the trading report has 2 tables: Orders and Deals. However,
from the internal layout we can see that this is a single table. All visible headers and the dividing line are
formed by the styles of table cells. We need to learn to distinguish between orders and deals and save each of
the sub-tables to a separate CSV file.

The difference between the first part and the second is in the number of columns: 11 columns for orders and
13 columns for deals. Unfortunately, the CSS standard does not allow setting conditions for selecting parent
elements (in our case, the table rows, 'tr' tag) based on the number or content of children (in our case, table
cells, 'td' tag). So, in some cases, required elements cannot be selected using standard means. But we are
developing our own implementation of selectors and thus we can add a special non-standard selector for the
number of child elements. This will be a new pseudo class. Let us set it as ":has-n-children(n)", by analogy with
":nth-child(n)".
The following selector can be used for selecting order rows:

tr:has-n-children(11)

However, this is not the entire solution to the problem, because this selector selects the table header in
addition to the data rows. Let us remove it. Pay attention to coloring of data rows - the bgcolor attribute is set
for them, and the color value alternates for even and odd rows (#FFFFFF and #F7F7F7). A color, i.e. the bgcolor
attribute is also used for the header, but its value is equal to #E5F0FC. Thus, the data rows have light colors
with bgcolor starting with "#F". Let us add this condition to the selector:

https://www.mql5.com/en/articles/5706 31/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

tr:has-n-children(11)[bgcolor^="#F"]

The selector correctly determines all rows with orders.

Parameters of each order can be read from the row cells. To do this, let us write the configuration file
ReportHistoryOrders.cfg.csv:

Name,Selector,Data
Time,td:nth-child(1),
Order,td:nth-child(2),
Symbol,td:nth-child(3),
Type,td:nth-child(4),
Volume,td:nth-child(5),
Price,td:nth-child(6),
S/L,td:nth-child(7),
T/P,td:nth-child(8),
Time,td:nth-child(9),
State,td:nth-child(10),
Comment,td:nth-child(11),

All fields in this file are simply identified by the sequence number. In other cases you may need smarter
selectors with attributes and classes.

To get a table of deals, simply replace the number of child elements to 13 in the row selector:

tr:has-n-children(13)[bgcolor^="#F"]

The configuration file ReportHistoryDeals.cfg.csv is attached below.

Now, by launching WebDataExtractor with the following input parameters (the webdataex-report1.set file is
attached):

URL=ReportHistory.html
SaveName=ReportOrders.csv
RowSelector=tr:has-n-children(11)[bgcolor^="#F"]
ColumnSettingsFile=ReportHistoryOrders.cfg.csv

we will receive the resulting ReportOrders.csv file which corresponds to the source HTML report:

https://www.mql5.com/en/articles/5706 32/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

CSV file resulting from the application of CSS selectors to a trading report

To get the table of deals, use the attached settings from webdataex-report2.set.

The selectors which we created are also suitable for tester reports. The attached webdataex-tester1.set and
webdataex-tester2.set allow you to convert a sample HTML report Tester.html into CSV files.

Important! The layout of many web pages as well as of the the generated HTML files in MetaTrader can be
changed from time to time. Due to this the some of selectors will no longer be applicable, even if the
external presentation is almost the same. In this case you should re-analyze the HTML code and modify the
CSS selectors accordingly.

Now let us view the conversion for the MetaTrader 4 tester report; this allows demonstration of some
interesting techniques in selecting CSS selectors. For the check we will use the attached StrategyTester-ecn-
1.htm.

These files have two tables: one with the testing results and the other one with trading trading operations. To
select the second table we will use the selector "table ~ table". Omit the first row in the operations table,
because it contains a header. This can be done using the selector "tr + tr".

Having combined them, we obtain a selector for selecting working rows:

table ~ table tr + tr

This actually means the following: select a table after the table (i.e. the second one, inside the table select
each line having a previous row, i.e. all except the first one).
Settings for extracting deal parameters from cells are available in the file test-report-mt4.cfg.csv. The date
field is processed by the class selector:

DateTime,td.msdate,

i.e. it searches for td tags with the class="msdate" attribute.

The full settings file for the utility is webdataex-tester-mt4.set.

Additional CSS selector use and setup examples are provided in the WebDataExtractor discussion page.
https://www.mql5.com/en/articles/5706 33/35
8/27/22, 12:30 PM Extracting structured data from HTML pages using CSS selectors - MQL5 Articles

The utility can do much more:

automatically substitute strings (for example, change country names to currency symbols or replace
verbal descriptions of news priority with a number)
output the DOM tree to log and find suitable selectors without a browser;
download and convert web pages by timer or by request from a global variable;

If you need help in the setting up of CSS selectors for a specific web page, you can purchase WebDataExtractor
(for MetaTrader 4, for MetaTrader 5) and receive recommendations as part of product support. However, the
availability of source codes allows you to use the entire functionality and expand it if necessary. This is
absolutely free.

Conclusions
We have considered the technology of CSS selectors, which is one of the main standards in the interpretation of
web documents. The implementation of the most commonly used CSS selectors in MQL allows the flexible setup
and conversion of any HTML page, including standard MetaTrader documents, into structured data without
using third-party software.

We have not considered some other technologies which can also provide versatile tools for processing web
documents. Such tools can be useful, because MetaTrader uses not only HTML, but also the XML format. Trader
can be especially interested in XPath and XSLT. These formats can serve as further steps in developing the idea
of automating trading systems based on web standards. Support for CSS selectors in MQL is just the first step
towards this goal.
Translated from Russian by MetaQuotes Software Corp.
Original article: https://www.mql5.com/ru/articles/5706

Attached files | Download ZIP

html2css.zip (35.94 KB)

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.

Library for easy and quick development of

MetaTrader programs (part III). Collection of
market orders and positions, search and sorting
In the first part, we started creating a large cross-platform
library simplifying the development of programs for
MetaTrader 5 and MetaTrader 4 platforms. Further on, we
implemented the collection of history orders and deals. Our
next step is creating a class for a convenient selection and
sorting of orders, deals and positions in collection lists. We
are going to implement the base library object called Engine
and add collection of market orders and positions to the
library.

Library for easy and quick development of

MetaTrader programs (part II). Collection of
historical orders and deals
In the first part, we started creating a large cross-platform
library simplifying the development of programs for
MetaTrader 5 and MetaTrader 4 platforms. We created the
COrder abstract object which is a base object for storing
data on history orders and deals, as well as on market orders
and positions. Now we will develop all the necessary objects
for storing account history data in collections.

MTF indicators as the technical analysis tool

Most of traders agree that the current market state analysis
starts with the evaluation of higher chart timeframes. The
analysis is performed downwards to lower timeframes until
the one, at which deals are performed. This analysis method
seems to be a mandatory part of professional approach for
successful trading. In this article, we will discuss multi-
timeframe indicators and their creation ways, as well as we
will provide MQL5 code examples. In addition to the general
evaluation of advantages and disadvantages, we will propose
a new indicator approach using the MTF mode.

Studying candlestick analysis techniques (part III):

Library for pattern operations
The purpose of this article is to create a custom tool, which
would enable users to receive and use the entire array of
information about patterns discussed earlier. We will create
a library of pattern related functions which you will be able
to use in your own indicators, trading panels, Expert
Advisors, etc.

https://www.mql5.com/en/articles/5706 35/35

HTML & CSS For Beginners: Your Step by Step Guide to Easily HTML & CSS Programming in 7 Days
From Everand
HTML & CSS For Beginners: Your Step by Step Guide to Easily HTML & CSS Programming in 7 Days
i Code Academy
4/5 (7)
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet
Ultra HTML Reference
From Everand
Ultra HTML Reference
Mike Abelar
2/5 (1)
Web Viva Questions & Answers
90% (93)
Web Viva Questions & Answers
8 pages
TDL Reference Manual1
No ratings yet
TDL Reference Manual1
551 pages
A Mobile Application For Dyslexia, Dysgraphia and Dyscalculia in Sinhala
No ratings yet
A Mobile Application For Dyslexia, Dysgraphia and Dyscalculia in Sinhala
49 pages
Learn complete HTML and CSS in 7 days | "HTML & CSS Masterclass: Unleash Your Web Design Skills"
From Everand
Learn complete HTML and CSS in 7 days | "HTML & CSS Masterclass: Unleash Your Web Design Skills"
Prajwal
No ratings yet
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
Html5: QuickStudy Laminated Reference Guide
From Everand
Html5: QuickStudy Laminated Reference Guide
Robin Nixon
No ratings yet
HTML5 & CSS3 For Beginners: Your Guide To Easily Learn HTML5 & CSS3 Programming in 7 Days
From Everand
HTML5 & CSS3 For Beginners: Your Guide To Easily Learn HTML5 & CSS3 Programming in 7 Days
i Code Academy
4/5 (11)
The Basics of HTML (Hypertext Markup Language) Coding For Beginners: Learn Foundational HTML Programming Concepts
From Everand
The Basics of HTML (Hypertext Markup Language) Coding For Beginners: Learn Foundational HTML Programming Concepts
Roggie Clark
No ratings yet
CSS Selectors and Specificity
From Everand
CSS Selectors and Specificity
Abdelfattah Ragab
No ratings yet
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Html5 for Beginners: A Step-By-Step Guide
From Everand
Html5 for Beginners: A Step-By-Step Guide
Zack Mark Lakeman
No ratings yet
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
PW-03 HTML DOM and Javascript
No ratings yet
PW-03 HTML DOM and Javascript
51 pages
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
Concepts
No ratings yet
Concepts
9 pages
Web Development
No ratings yet
Web Development
28 pages
XHTML
From Everand
XHTML
Jitendra Patel
No ratings yet
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
5/5 (1)
Beginning HTML and CSS
From Everand
Beginning HTML and CSS
Rob Larsen
No ratings yet
Two Marks
No ratings yet
Two Marks
8 pages
23vmt0c205-Web Technologies
No ratings yet
23vmt0c205-Web Technologies
4 pages
Web Viva Questions & Answers
No ratings yet
Web Viva Questions & Answers
8 pages
WP IMP PDF
No ratings yet
WP IMP PDF
28 pages
Unit IV Final
No ratings yet
Unit IV Final
20 pages
7 Oral Questions Oral Question Bank
No ratings yet
7 Oral Questions Oral Question Bank
25 pages
7 Oral Questions Oral Question Bank
No ratings yet
7 Oral Questions Oral Question Bank
25 pages
Session - 5 Introduction To Html/Css Assignment
No ratings yet
Session - 5 Introduction To Html/Css Assignment
17 pages
The Basics of CSS (Cascading Style Sheets) Coding For Beginners: Learn Basic CSS Programming Concepts
From Everand
The Basics of CSS (Cascading Style Sheets) Coding For Beginners: Learn Basic CSS Programming Concepts
Roggie Clark
No ratings yet
Lecture-4 Internet and Web Programming Architecture – Part 1
No ratings yet
Lecture-4 Internet and Web Programming Architecture – Part 1
54 pages
WT Oral Questions
No ratings yet
WT Oral Questions
28 pages
Topics-_For_Study_Final of Web Technologies
No ratings yet
Topics-_For_Study_Final of Web Technologies
10 pages
CSS Grid Layout: 5 Practical Projects
From Everand
CSS Grid Layout: 5 Practical Projects
Craig Buckler
No ratings yet
XML Data Format
From Everand
XML Data Format
Lucas Lee
No ratings yet
7 Oral Questions
No ratings yet
7 Oral Questions
24 pages
An Introduction To Web Technologies: Ankit Jain
No ratings yet
An Introduction To Web Technologies: Ankit Jain
25 pages
VTU Exam Question Paper With Solution of 20MCA23 Web Technologies July-2022-Ashwini
No ratings yet
VTU Exam Question Paper With Solution of 20MCA23 Web Technologies July-2022-Ashwini
41 pages
The Basics of Front-End Web Development (HTML, CSS, and JavaScript): Learn How To Design and Build Websites As A Beginner
From Everand
The Basics of Front-End Web Development (HTML, CSS, and JavaScript): Learn How To Design and Build Websites As A Beginner
Roggie Clark
No ratings yet
Lab 7
No ratings yet
Lab 7
12 pages
WT Oral Questions
No ratings yet
WT Oral Questions
24 pages
Web
No ratings yet
Web
5 pages
JavaScript for Kids: Start Your Coding Adventure
From Everand
JavaScript for Kids: Start Your Coding Adventure
Abdelfattah Ragab
No ratings yet
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Html5 Interview Questions
No ratings yet
Html5 Interview Questions
14 pages
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
The CSS Guide: The Complete Guide to Modern CSS
From Everand
The CSS Guide: The Complete Guide to Modern CSS
Tim Robards
5/5 (2)
Presentation Technologies
No ratings yet
Presentation Technologies
9 pages
8.Question Bank
No ratings yet
8.Question Bank
18 pages
ITT 05103 - 2023mimi Internet Programming-1
No ratings yet
ITT 05103 - 2023mimi Internet Programming-1
142 pages
E-Book - Full Stack Python
No ratings yet
E-Book - Full Stack Python
13 pages
Wa Html5 PDF
No ratings yet
Wa Html5 PDF
33 pages
Wa Html5 PDF
No ratings yet
Wa Html5 PDF
33 pages
Wa Html5 PDF
No ratings yet
Wa Html5 PDF
33 pages
Wa Html5 PDF
No ratings yet
Wa Html5 PDF
33 pages
Wa Html5 PDF
No ratings yet
Wa Html5 PDF
33 pages
Final Abhishek
No ratings yet
Final Abhishek
16 pages
Os Notes
No ratings yet
Os Notes
67 pages
MN MPL024 e
No ratings yet
MN MPL024 e
45 pages
Haskell Language Examples
No ratings yet
Haskell Language Examples
3 pages
Case Study Autodesk
No ratings yet
Case Study Autodesk
8 pages
Definition - What Does Mean?: Artificial Intelligence (AI)
No ratings yet
Definition - What Does Mean?: Artificial Intelligence (AI)
3 pages
SAP OM-Matrix STR
No ratings yet
SAP OM-Matrix STR
9 pages
Fuzzy Logic Toolbox™ Release Notes
No ratings yet
Fuzzy Logic Toolbox™ Release Notes
94 pages
CPS-UNIT - 1-Compressed
No ratings yet
CPS-UNIT - 1-Compressed
183 pages
Issue Tracking Log
No ratings yet
Issue Tracking Log
4 pages
License
No ratings yet
License
2 pages
Install LCT 1678
No ratings yet
Install LCT 1678
81 pages
Chapter 11 Intro To DeviceNet
No ratings yet
Chapter 11 Intro To DeviceNet
85 pages
5069 td002 - en P PDF
No ratings yet
5069 td002 - en P PDF
24 pages
microFMS FESTO PDF
No ratings yet
microFMS FESTO PDF
16 pages
Get Integrated Science in Digital Age 2020 1st Edition Tatiana Antipova PDF ebook with Full Chapters Now
100% (1)
Get Integrated Science in Digital Age 2020 1st Edition Tatiana Antipova PDF ebook with Full Chapters Now
55 pages
OpenText Communications Center Enterprise 16.0 - Supervisor User Guide English (CCMWEBRETR160000-UGD-EN-03)
100% (1)
OpenText Communications Center Enterprise 16.0 - Supervisor User Guide English (CCMWEBRETR160000-UGD-EN-03)
44 pages
PS-300B Eng
No ratings yet
PS-300B Eng
164 pages
Free Web Hosting With PHP and MySQL - InfinityFree
No ratings yet
Free Web Hosting With PHP and MySQL - InfinityFree
1 page
OS Lab Manual
No ratings yet
OS Lab Manual
76 pages
Full-Stack Engineer - Viewer Page _ Infosys Springboard
No ratings yet
Full-Stack Engineer - Viewer Page _ Infosys Springboard
6 pages
IP Practical File Part I - Class XII
No ratings yet
IP Practical File Part I - Class XII
6 pages
Object Oriented Software Engineering: Assignment # 02
No ratings yet
Object Oriented Software Engineering: Assignment # 02
5 pages
Realflow Maya Connectivity
No ratings yet
Realflow Maya Connectivity
17 pages
faq_1
No ratings yet
faq_1
48 pages
LTRSPG-2212
No ratings yet
LTRSPG-2212
62 pages
Excel 2010 User Interface
No ratings yet
Excel 2010 User Interface
3 pages
Experiment 6 (EE381A DCMP Lab. 2021-2022/II) : Objectives
No ratings yet
Experiment 6 (EE381A DCMP Lab. 2021-2022/II) : Objectives
2 pages
Ids of Am
No ratings yet
Ids of Am
8 pages