当前位置：文江博客话题详情

在格式不正确的 HTML 中使用 XPath 查找节点（或靠近该节点的节点）

发布于 2024-07-10 18:57:13 字数 261 浏览 12 评论 0原文

我正在使用 XPath 来定位模板中的节点（或接近它的节点），该模板具有大约 10 层深度的格式不正确的 HTML。（不，我没有编写此 HTML...但我的任务是深入研究它。）

我似乎能够使用 Firefox 的 XPartner 插件检索相关元素的 XPath；然而，它只给我提供了实时站点中的位置，而不是给我的模板中的位置。（该模板来自非标准服务器端脚本语言；请阅读内部构建的语言）

您知道是否有任何 XPath 工具特别擅长处理格式不正确的 HTML。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雅心素梦 2024-07-17 18:57:13

无法根据格式不正确的 XML 文档计算 XPath 表达式，这正是所描述的情况。

可以通过两个连锁步骤来完成此操作，第一个是将 HTML 转换为格式良好的 XML，然后是第二个 - 应用 XPath 表达式。

因此，问题可以更准确地表述为“如何将 HTML 转换为 XML，以便可以根据它计算 XPath 表达式”。

这里有两个很好的工具：

TagSoup，一个开放的-source程序，是一个基于Java和SAX的工具，由John Cowan开发。这是一个用 Java 编写的符合 SAX 的解析器，它不是解析格式良好或有效的 XML，而是解析在野外发现的 HTML：糟糕、肮脏和粗鲁，尽管通常远非短小。 TagSoup 是为那些必须使用某种合理的应用程序设计来处理这些东西的人而设计的。通过提供 SAX 接口，它允许将标准 XML 工具应用于甚至最差的 HTML。 TagSoup 还包括一个命令行处理器，可以读取 HTML 文件并生成干净的 HTML 或格式良好的 XML（非常接近 XHTML）。
Taggle 是 TagSoup 的商业 C++ 端口。
SgmlReader是微软开发的一款工具克里斯·洛维特。
SgmlReader 是任何 SGML 文档上的 XmlReader API（包括对 HTML 的内置支持）。还提供了一个命令行实用程序，可输出格式良好的 XML 结果。
下载包含独立可执行文件和完整源代码的 zip 文件： SgmlReader.zip
HTML 的纯 XSLT 2.0 解析器，作者：David Carlisle< /a>. 阅读其代码对我们每个人来说都是一次很好的学习练习。

从描述：

“d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)

单参数形式相当于）
d:htmlparse(string,'http://ww.w3.org/1999/xhtml ',true()))

使用一些内置的启发式方法将字符串解析为 HTML 和/或 XML）
控制元素的隐式打开和关闭。

它没有 HTML DTD 的完整知识，但有完整的列表
空元素和实体定义的完整列表。 HTML 实体，以及
十进制和十六进制字符引用均被接受。注意 html 实体
即使 html-mode=false() 也会被识别。

元素名称小写（如果 html-mode 为 true()）并放入
命名空间参数指定的命名空间（可以是“”来表示
无命名空间，除非输入具有显式命名空间声明，在
在这种情况下，这些将受到尊重。

如果 html-mode=true() 则属性名称为小写"

阅读更详细的说明这里。

XPath expressions cannot be evaluated agaist a non-wellformed XML document, which is exactly the described case.

It is possible to do this in two chained steps, the first of which is to convert the HTML to wellformed XML and then the second -- to apply the XPath expression.

Therefore, the question could be more precisely stated as "How to convert HTML to XML so that XPath expressions can be evaluated against it".

Here are two good tools:

TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Taggle is a commercial C++ port of TagSoup.
SgmlReader is a tool developed by Microsoft's Chris Lovett.
SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone executable and the full source code: SgmlReader.zip
The pure XSLT 2.0 Parser of HTML written by David Carlisle. Reading its code would be a great learning exercise for everyone of us.

From the description:

"d:htmlparse(string)
d:htmlparse(string,namespace,html-mode)

The one argument form is equivalent to)
d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

Parses the string as HTML and/or XML using some inbuilt heuristics to)
control implied opening and closing of elements.

It doesn't have full knowledge of HTML DTD but does have full list of
empty elements and full list of entity definitions. HTML entities, and
decimal and hex character references are all accepted. Note html-entities
are recognised even if html-mode=false().

Element names are lowercased (if html-mode is true()) and placed into the
namespace specified by the namespace parameter (which may be "" to denote
no-namespace unless the input has explict namespace declarations, in
which case these will be honoured.

Attribute names are lowercased if html-mode=true()"

Read a more detailed description here.

回复收藏 0 原文