Add tests to explain what the namespace declarations in HTMLParsingHe…

…lper are used for (#10) This PR aims to explain by adding tests what the following lines are doing: https://github.com/webfactory/dom/blob/17f6c52d64424830c86f3e28ecdda9c6d8351cf5/src/Webfactory/Dom/HTMLParsingHelper.php#L72-L73 They define _implicit_ namespace mappings, i. e. namespace prefix-to-URI-mappings that will be used by various methods in this library when no _explicit_ mappings are given. Those methods are: * `BaseParsingHelper::createXPath()`, to create an XPath query with namespace bindings * `BaseParsingHelper::dump()`, to know which namespaces are in effect at the place where the dumped XML string shall be used * `BaseParsingHelper::parseFragment()`, to provide context which namespace declarations and which default namespace is active at the place where the XML fragement string was taken from. `BaseParsingHelper::parseDocument()` does not need any explicit namespace declarations. After all, those are part of the XML document given. The "default namespace" mapping (for the empty `''` prefix) is relevant for `BaseParsingHelper::parseFragment()`. It defines that code like... ```php $parser = PolyglotHTML5ParsingHelper(); $fragment = $parser->parseFragment('<p>Hello XML</p>'); ``` ... will associate the `<p>` element with the `http://www.w3.org/1999/xhtml` namespace. This is the _native_ namespace for HTML5 elements that does not need to be declared.[^1] The following code achieves the same, but parses a full HTML5 document: ```php $parser = PolyglotHTML5ParsingHelper(); $document = $parser->parseDocument('<html xmlns="http://www.w3.org/1999/xhtml"><body><p>test</p></body></html>'); ``` Now, in both examples, we have a `<p>` element from that namespace. In order to match this element with an XPath expression, one needs to be aware that an XPath expression like `//p` queries for a `<p>` element _not connected to a namespace_.[^2] But, as explained above, for XHTML and Polyglot HTML5 documents, nodes are connected to `http://www.w3.org/1999/xhtml`. To make using XPath more convenient, in the absence of explicit declarations we also include the implicit defaults. The `''` prefix is ignored in this case (it's not a valid prefix, after all), but `html` is what you're probably after. So, the XPath expression to match the `<p>` node from both preceding examples is `//html:p`. [^1]: When parsing a full HTML5 document with a parser that is aware of XML only, but not HTML5, this needs to be explicitly specified as the default namespace on the root element, see https://www.w3.org/TR/html-polyglot/#h4_element-level-namespaces. When parsing an HTML5 fragment only, the `BaseParsingHelper::parseFragment()` method will use a wrapping container to provide this default declaration. [^2]: There is no such thing as a "default" namespace in XPath. The default namespace at some point in an XML document is the namespace URI that elements will be connected to when no other namespace prefix is given. It can be different at different places in the XML document. An XPath expression matches an element if it is from the expected namespace or not namespaced. --------- Co-authored-by: mpdude <[email protected]>
webfactory · Nov 16, 2023 · dcb1022 · dcb1022
1 parent 17f6c52
commit dcb1022
Show file tree

Hide file tree

Showing 3 changed files with 100 additions and 6 deletions.
diff --git a/composer.json b/composer.json
@@ -12,6 +12,7 @@
     ],
     "require": {
         "php": "^7.2|8.0.*|8.1.*|8.2.*",
+        "ext-dom": "*",
         "ext-xml": "*"
     },
     "require-dev": {

diff --git a/src/Webfactory/Dom/HTMLParsingHelper.php b/src/Webfactory/Dom/HTMLParsingHelper.php
@@ -62,15 +62,15 @@ protected function defineImplicitNamespaces(): array
          */
         if ((phpversion('xml') >= '8.1.21') && (phpversion('xml') < '8.1.25')) {
             return [
-                'html' => 'http://www.w3.org/1999/xhtml', // für XPath
-                '' => 'http://www.w3.org/1999/xhtml', // default ns
-                'hx' => 'http://purl.org/NET/hinclude', // fuer HInclude http://mnot.github.io/hinclude/; ein Weg um z.B. Controller in Symfony per Ajax zu embedden
+                'html' => 'http://www.w3.org/1999/xhtml',
+                '' => 'http://www.w3.org/1999/xhtml',
+                'hx' => 'http://purl.org/NET/hinclude',
             ];
         }
 
         return [
-            '' => 'http://www.w3.org/1999/xhtml', // default ns
-            'html' => 'http://www.w3.org/1999/xhtml', // für XPath
+            '' => 'http://www.w3.org/1999/xhtml', // ignored by BaseParsingHelper::createXPath(), but defining the default namespace that will be assumed to be active when BaseParsingHelper::parseFragment() is called and no explicit namespace declarations are given
+            'html' => 'http://www.w3.org/1999/xhtml', // so XPath expressions can use the "html" prefix to match the current HTML variant (unless an explicit mapping is given to BaseParsingHelper::createXPath())
             'hx' => 'http://purl.org/NET/hinclude', // fuer HInclude http://mnot.github.io/hinclude/; ein Weg um z.B. Controller in Symfony per Ajax zu embedden
         ];
     }

diff --git a/tests/Webfactory/Dom/Test/PolyglotHTML5ParsingHelperTest.php b/tests/Webfactory/Dom/Test/PolyglotHTML5ParsingHelperTest.php
@@ -8,11 +8,13 @@
 
 namespace Webfactory\Dom\Test;
 
+use Webfactory\Dom\PolyglotHTML5ParsingHelper;
+
 class PolyglotHTML5ParsingHelperTest extends HTMLParsingHelperTest
 {
     protected function createParsingHelper()
     {
-        return new \Webfactory\Dom\PolyglotHTML5ParsingHelper();
+        return new PolyglotHTML5ParsingHelper();
     }
 
     /**
@@ -88,4 +90,95 @@ public function svgNamespaceIsNotReconciled()
             '<div><svg xmlns="http://www.w3.org/2000/svg" class="x" width="300" height="150" viewBox="0 0 300 150"><path fill="#FF7949" d="M300 5.49c0-2.944-1.057-4.84-2.72-5.49h-2.92c-.79.247-1.632.67-2.505 1.293L158.145 96.56c-4.48 3.19-11.81 3.19-16.29 0L8.146 1.292C7.27.67 6.43.247 5.64 0H2.72C1.056.65 0 2.546 0 5.49V150h300V5.49z"></path></svg></div>'
         );
     }
+
+    /**
+     * @test
+     * @dataProvider provideXpathForDocuments
+     */
+    public function xpathParseDocument($xml, $xpathExpression)
+    {
+        $document = $this->parser->parseDocument($xml);
+        $xpath = $this->parser->createXPath($document);
+
+        $domNodeList = $xpath->query($xpathExpression);
+
+        self::assertCount(1, $domNodeList);
+        self::assertSame('test', $domNodeList[0]->textContent);
+    }
+
+    public function provideXpathForDocuments()
+    {
+        yield 'HTML document that does not use a default namespace' => [
+            /*
+                In this document, nodes are not in a namespace at all. Thus, we have to use the
+                XPath expression "//p" which searches for an item _not associated with a namespace_.
+
+                Note that this _should not be done in practice_, since HTML5 has a built-in, undeclared "native"
+                default namespace for the <html> element.
+
+                The libxml XML parser, however, does not know about HTML5 - only about XML. This is why the
+                Polyglot spec (https://www.w3.org/TR/html-polyglot/#h4_element-level-namespaces) states that
+
+                <html xmlns="http://www.w3.org/1999/xhtml">
+
+                ... should be used to achieve the same semantics for HTML5-aware and XML-only parsers.
+            */
+            '<html><body><p>test</p></body></html>',
+            '//p',
+        ];
+
+        yield 'HTML document that uses a default namespace' => [
+            /*
+                In this document, a default namespace is used. All nodes are associated with this namespace.
+                The XPath expression has to match namespaced nodes, and "//p" would be a node _without_ a
+                namespace. -> We have to register a namespace on the Xpath expression, and use its prefix.
+
+                If we don't give an explicit namespace mapping when creating the xpath expression, the
+                \Webfactory\Dom\BaseParsingHelper::createXPath() will register the ParsingHelper's implicit
+                namespaces for us as convenience. That includes the "html" namespace prefix for the URI
+                according to the current HTML variant (XHTML vs HTML5) in use.
+            */
+            '<html xmlns="http://www.w3.org/1999/xhtml"><body><p>test</p></body></html>',
+            '//html:p',
+        ];
+
+        yield 'HTML document with explicit namespace' => [
+            /*
+                Basically, as before, this time using an explicit namespace prefix.
+            */
+            '<html xmlns:foo="http://www.w3.org/1999/xhtml"><foo:body><foo:p>test</foo:p></foo:body></html>',
+            '//html:p',
+        ];
+    }
+
+    /**
+     * @test
+     * @dataProvider provideXpathForFragments
+     */
+    public function xpathParseFragment($xmlFragment, $xpathExpression)
+    {
+        $fragment = $this->parser->parseFragment($xmlFragment);
+        $xpath = $this->parser->createXPath($fragment);
+
+        $domNodeList = $xpath->query($xpathExpression);
+
+        self::assertCount(1, $domNodeList);
+        self::assertSame('test', $domNodeList[0]->textContent);
+    }
+
+    public function provideXpathForFragments()
+    {
+        yield 'default namespace assumed for fragments' => [
+            /*
+                When BaseParsingHelper::parseFragment() is used without passing a mapping of
+                namespaces, a 'default' assumption is made depending on the ParsingHelper instance.
+
+                For HTML5, this assumes fragment elements without namespace declarations live in the
+                http://www.w3.org/1999/xhtml namespace URI; this corresponds to the 'html' convenience
+                prefix set up in xpath expressions.
+            */
+            '<p>test</p>',
+            '//html:p',
+        ];
+    }
 }