Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests to explain what the namespace declarations in HTMLParsingHelper are used for #10

Merged
merged 2 commits into from
Nov 16, 2023

Conversation

mpdude
Copy link
Member

@mpdude mpdude commented Nov 13, 2023

This PR aims to explain by adding tests what the following lines are doing:

'' => 'http://www.w3.org/1999/xhtml', // default ns
'html' => 'http://www.w3.org/1999/xhtml', // für XPath

They define implicit namespace mappings, i. e. namespace prefix-to-URI-mappings that will be used by various methods in this library when no explicit mappings are given. Those methods are:

  • BaseParsingHelper::createXPath(), to create an XPath query with namespace bindings
  • BaseParsingHelper::dump(), to know which namespaces are in effect at the place where the dumped XML string shall be used
  • BaseParsingHelper::parseFragment(), to provide context which namespace declarations and which default namespace is active at the place where the XML fragement string was taken from.

BaseParsingHelper::parseDocument() does not need any explicit namespace declarations. After all, those are part of the XML document given.

The "default namespace" mapping (for the empty '' prefix) is relevant for BaseParsingHelper::parseFragment(). It defines that code like...

$parser = PolyglotHTML5ParsingHelper();
$fragment = $parser->parseFragment('<p>Hello XML</p>');

... will associate the <p> element with the http://www.w3.org/1999/xhtml namespace. This is the native namespace for HTML5 elements that does not need to be declared.1 The following code achieves the same, but parses a full HTML5 document:

$parser = PolyglotHTML5ParsingHelper();
$document = $parser->parseDocument('<html xmlns="http://www.w3.org/1999/xhtml"><body><p>test</p></body></html>');

Now, in both examples, we have a <p> element from that namespace.

In order to match this element with an XPath expression, one needs to be aware that an XPath expression like //p queries for a <p> element not connected to a namespace.2 But, as explained above, for XHTML and Polyglot HTML5 documents, nodes are connected to http://www.w3.org/1999/xhtml.

To make using XPath more convenient, in the absence of explicit declarations we also include the implicit defaults. The '' prefix is ignored in this case (it's not a valid prefix, after all), but html is what you're probably after.

So, the XPath expression to match the <p> node from both preceding examples is //html:p.

Footnotes

  1. When parsing a full HTML5 document with a parser that is aware of XML only, but not HTML5, this needs to be explicitly specified as the default namespace on the root element, see https://www.w3.org/TR/html-polyglot/#h4_element-level-namespaces. When parsing an HTML5 fragment only, the BaseParsingHelper::parseFragment() method will use a wrapping container to provide this default declaration.

  2. There is no such thing as a "default" namespace in XPath. The default namespace at some point in an XML document is the namespace URI that elements will be connected to when no other namespace prefix is given. It can be different at different places in the XML document. An XPath expression matches an element if it is from the expected namespace or not namespaced.

@mpdude
Copy link
Member Author

mpdude commented Nov 13, 2023

@relthyg Does this make sense to you?

If yes, could you squash-merge this PR? We don't need to cut a new release for the change, it's just tests added.

@mpdude mpdude requested a review from relthyg November 13, 2023 14:46
@relthyg relthyg merged commit dcb1022 into master Nov 16, 2023
@relthyg relthyg deleted the explain-namespace-decls branch November 16, 2023 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants