Skip to content

Commit

Permalink
Add tests to explain what the namespace declarations in HTMLParsingHe…
Browse files Browse the repository at this point in the history
…lper are used for (#10)

This PR aims to explain by adding tests what the following lines are
doing:


https://github.com/webfactory/dom/blob/17f6c52d64424830c86f3e28ecdda9c6d8351cf5/src/Webfactory/Dom/HTMLParsingHelper.php#L72-L73

They define _implicit_ namespace mappings, i. e. namespace
prefix-to-URI-mappings that will be used by various methods in this
library when no _explicit_ mappings are given. Those methods are:

* `BaseParsingHelper::createXPath()`, to create an XPath query with
namespace bindings
* `BaseParsingHelper::dump()`, to know which namespaces are in effect at
the place where the dumped XML string shall be used
* `BaseParsingHelper::parseFragment()`, to provide context which
namespace declarations and which default namespace is active at the
place where the XML fragement string was taken from.

`BaseParsingHelper::parseDocument()` does not need any explicit
namespace declarations. After all, those are part of the XML document
given.

The "default namespace" mapping (for the empty `''` prefix) is relevant
for `BaseParsingHelper::parseFragment()`. It defines that code like...

```php
$parser = PolyglotHTML5ParsingHelper();
$fragment = $parser->parseFragment('<p>Hello XML</p>');
```

... will associate the `<p>` element with the
`http://www.w3.org/1999/xhtml` namespace. This is the _native_ namespace
for HTML5 elements that does not need to be declared.[^1] The following
code achieves the same, but parses a full HTML5 document:

```php
$parser = PolyglotHTML5ParsingHelper();
$document = $parser->parseDocument('<html xmlns="http://www.w3.org/1999/xhtml"><body><p>test</p></body></html>');
```

Now, in both examples, we have a `<p>` element from that namespace.

In order to match this element with an XPath expression, one needs to be
aware that an XPath expression like `//p` queries for a `<p>` element
_not connected to a namespace_.[^2] But, as explained above, for XHTML
and Polyglot HTML5 documents, nodes are connected to
`http://www.w3.org/1999/xhtml`.

To make using XPath more convenient, in the absence of explicit
declarations we also include the implicit defaults. The `''` prefix is
ignored in this case (it's not a valid prefix, after all), but `html` is
what you're probably after.

So, the XPath expression to match the `<p>` node from both preceding
examples is `//html:p`.

[^1]: When parsing a full HTML5 document with a parser that is aware of
XML only, but not HTML5, this needs to be explicitly specified as the
default namespace on the root element, see
https://www.w3.org/TR/html-polyglot/#h4_element-level-namespaces. When
parsing an HTML5 fragment only, the `BaseParsingHelper::parseFragment()`
method will use a wrapping container to provide this default
declaration.

[^2]: There is no such thing as a "default" namespace in XPath. The
default namespace at some point in an XML document is the namespace URI
that elements will be connected to when no other namespace prefix is
given. It can be different at different places in the XML document. An
XPath expression matches an element if it is from the expected namespace
or not namespaced.

---------

Co-authored-by: mpdude <[email protected]>
  • Loading branch information
mpdude and mpdude authored Nov 16, 2023
1 parent 17f6c52 commit dcb1022
Show file tree
Hide file tree
Showing 3 changed files with 100 additions and 6 deletions.
1 change: 1 addition & 0 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
],
"require": {
"php": "^7.2|8.0.*|8.1.*|8.2.*",
"ext-dom": "*",
"ext-xml": "*"
},
"require-dev": {
Expand Down
10 changes: 5 additions & 5 deletions src/Webfactory/Dom/HTMLParsingHelper.php
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,15 @@ protected function defineImplicitNamespaces(): array
*/
if ((phpversion('xml') >= '8.1.21') && (phpversion('xml') < '8.1.25')) {
return [
'html' => 'http://www.w3.org/1999/xhtml', // für XPath
'' => 'http://www.w3.org/1999/xhtml', // default ns
'hx' => 'http://purl.org/NET/hinclude', // fuer HInclude http://mnot.github.io/hinclude/; ein Weg um z.B. Controller in Symfony per Ajax zu embedden
'html' => 'http://www.w3.org/1999/xhtml',
'' => 'http://www.w3.org/1999/xhtml',
'hx' => 'http://purl.org/NET/hinclude',
];
}

return [
'' => 'http://www.w3.org/1999/xhtml', // default ns
'html' => 'http://www.w3.org/1999/xhtml', // für XPath
'' => 'http://www.w3.org/1999/xhtml', // ignored by BaseParsingHelper::createXPath(), but defining the default namespace that will be assumed to be active when BaseParsingHelper::parseFragment() is called and no explicit namespace declarations are given
'html' => 'http://www.w3.org/1999/xhtml', // so XPath expressions can use the "html" prefix to match the current HTML variant (unless an explicit mapping is given to BaseParsingHelper::createXPath())
'hx' => 'http://purl.org/NET/hinclude', // fuer HInclude http://mnot.github.io/hinclude/; ein Weg um z.B. Controller in Symfony per Ajax zu embedden
];
}
Expand Down
95 changes: 94 additions & 1 deletion tests/Webfactory/Dom/Test/PolyglotHTML5ParsingHelperTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@

namespace Webfactory\Dom\Test;

use Webfactory\Dom\PolyglotHTML5ParsingHelper;

class PolyglotHTML5ParsingHelperTest extends HTMLParsingHelperTest
{
protected function createParsingHelper()
{
return new \Webfactory\Dom\PolyglotHTML5ParsingHelper();
return new PolyglotHTML5ParsingHelper();
}

/**
Expand Down Expand Up @@ -88,4 +90,95 @@ public function svgNamespaceIsNotReconciled()
'<div><svg xmlns="http://www.w3.org/2000/svg" class="x" width="300" height="150" viewBox="0 0 300 150"><path fill="#FF7949" d="M300 5.49c0-2.944-1.057-4.84-2.72-5.49h-2.92c-.79.247-1.632.67-2.505 1.293L158.145 96.56c-4.48 3.19-11.81 3.19-16.29 0L8.146 1.292C7.27.67 6.43.247 5.64 0H2.72C1.056.65 0 2.546 0 5.49V150h300V5.49z"></path></svg></div>'
);
}

/**
* @test
* @dataProvider provideXpathForDocuments
*/
public function xpathParseDocument($xml, $xpathExpression)
{
$document = $this->parser->parseDocument($xml);
$xpath = $this->parser->createXPath($document);

$domNodeList = $xpath->query($xpathExpression);

self::assertCount(1, $domNodeList);
self::assertSame('test', $domNodeList[0]->textContent);
}

public function provideXpathForDocuments()
{
yield 'HTML document that does not use a default namespace' => [
/*
In this document, nodes are not in a namespace at all. Thus, we have to use the
XPath expression "//p" which searches for an item _not associated with a namespace_.
Note that this _should not be done in practice_, since HTML5 has a built-in, undeclared "native"
default namespace for the <html> element.
The libxml XML parser, however, does not know about HTML5 - only about XML. This is why the
Polyglot spec (https://www.w3.org/TR/html-polyglot/#h4_element-level-namespaces) states that
<html xmlns="http://www.w3.org/1999/xhtml">
... should be used to achieve the same semantics for HTML5-aware and XML-only parsers.
*/
'<html><body><p>test</p></body></html>',
'//p',
];

yield 'HTML document that uses a default namespace' => [
/*
In this document, a default namespace is used. All nodes are associated with this namespace.
The XPath expression has to match namespaced nodes, and "//p" would be a node _without_ a
namespace. -> We have to register a namespace on the Xpath expression, and use its prefix.
If we don't give an explicit namespace mapping when creating the xpath expression, the
\Webfactory\Dom\BaseParsingHelper::createXPath() will register the ParsingHelper's implicit
namespaces for us as convenience. That includes the "html" namespace prefix for the URI
according to the current HTML variant (XHTML vs HTML5) in use.
*/
'<html xmlns="http://www.w3.org/1999/xhtml"><body><p>test</p></body></html>',
'//html:p',
];

yield 'HTML document with explicit namespace' => [
/*
Basically, as before, this time using an explicit namespace prefix.
*/
'<html xmlns:foo="http://www.w3.org/1999/xhtml"><foo:body><foo:p>test</foo:p></foo:body></html>',
'//html:p',
];
}

/**
* @test
* @dataProvider provideXpathForFragments
*/
public function xpathParseFragment($xmlFragment, $xpathExpression)
{
$fragment = $this->parser->parseFragment($xmlFragment);
$xpath = $this->parser->createXPath($fragment);

$domNodeList = $xpath->query($xpathExpression);

self::assertCount(1, $domNodeList);
self::assertSame('test', $domNodeList[0]->textContent);
}

public function provideXpathForFragments()
{
yield 'default namespace assumed for fragments' => [
/*
When BaseParsingHelper::parseFragment() is used without passing a mapping of
namespaces, a 'default' assumption is made depending on the ParsingHelper instance.
For HTML5, this assumes fragment elements without namespace declarations live in the
http://www.w3.org/1999/xhtml namespace URI; this corresponds to the 'html' convenience
prefix set up in xpath expressions.
*/
'<p>test</p>',
'//html:p',
];
}
}

0 comments on commit dcb1022

Please sign in to comment.