Java 中的 XHTML 操作库

发布于 2024-11-29 07:21:22 字数 2937 浏览 0 评论 0原文

我正在寻找可以为我执行以下两项任务的 XML/XHTML Java 库/框架。

在进行一些定义之前:

  • NodeOffset(Node node, int offset) 标记 XML 树中文本节点中的某个点。
  • nodeBnodeInodeP 是下面提到的 XHTML 树和 nodeSpan< 的对应 Node 实例/code> 是一些新创建的节点(其中 Node 不一定是 org.w3c.dom.Node 并且可能是任何其他抽象)

将 XHTML 变成纯文本

库应该能够产生纯文本输出(例如通过实现CharSequence 或类似的)来自给定的XHTML,并提供输出和原始XHTML 节点树中的字符之间的一对一映射(例如,通过函数NodeOffset getNodeOffset(int plainTextOffset))。

示例:假设我们有以下 XHTML:

<p><b>GeForce</b> 9300M GS provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

那么明文表示显然将是:

GeForce 9300M GS provides powerful visual computing features to thin and light notebooks.

那么例如

  • getNodeOffset(0) 应该返回节点 NodeOffset(nodeB, 0)
  • getNodeOffset (40) 应返回节点 NodeOffset(nodeI, 5)
  • getNodeOffset(80) 应返回节点 NodeOffset(nodeP, 49)

我可能会错过正确的数字,但我希望你明白了。我重复这个例子,现在插入了伪标记:

|GeForce 9300M GS provides powerful visua|l computing features to thin and light n|otebooks.

<p><b>|GeForce</b> 9300M GS provides powerful <i>visua|l computing features</i> to thin and light n|otebooks.</p>

节点操作

库应该提供将节点注入 XHTML 的可能性,这可能跨越树,可能跨越节点边界,例如通过操作 NodeSet insert(Node nodeToInsert, NodeOffset开始,NodeOffset结束,int模式)。该函数有两种工作模式:

  • mode1:如有必要,拆分要插入的节点。在这种情况下,从 nodeToInsert 节点中分割出来的节点将作为操作结果返回。
  • mode2:关闭父节点。 nodeToInsert 按原样返回。

例如: insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode1) 操作应生成

<p><b>Ge<span>Force</span></b><span> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset( nodeP, 9), mode2) 操作应产生:

<p><b>Ge</b><span><b>Force</b> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

它类似于用户在富编辑器中所做的操作:

GeForce 9300M GS

我想知道,开源世界中是否有这样的事情,因为我真的不想重新实现轮子......我很快就检查了 Java 中的开源 HTML 解析器 没有成功。

当您发布答案时:

  • 确保上述函数在库 API 中可用(提供 JavaDoc 的链接)。
  • 该库是 Java 原生的(无 JNI)并且是开源的。

I am looking for XML/XHTML Java library/framework that can perform the following two tasks for me.

Before going on few definitions:

  • NodeOffset(Node node, int offset) marks some point in text node in the XML tree.
  • nodeB, nodeI, nodeP are the corresponding Node instances of the below mentioned XHTML tree and nodeSpan is some newly created node (where Node is not necessarily org.w3c.dom.Node and may be any other abstraction)

Flattering XHTML into plain text

The library should be able to produce plaintext output (e.g. by implementing CharSequence or similar) from given XHTML and provide one-to-one mapping between chars in the output and original XHTML node tree (e.g. via the function NodeOffset getNodeOffset(int plainTextOffset)).

Example: Suppose we have the following XHTML:

<p><b>GeForce</b> 9300M GS provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

Then the plaintext representation will obviously be:

GeForce 9300M GS provides powerful visual computing features to thin and light notebooks.

Then e.g.

  • getNodeOffset(0) should return node NodeOffset(nodeB, 0)
  • getNodeOffset(40) should return node NodeOffset(nodeI, 5)
  • getNodeOffset(80) should return node NodeOffset(nodeP, 49).

I might miss the correct numbers, but I hope, you got the idea. I repeat the example, now with pseudo-markers inserted:

|GeForce 9300M GS provides powerful visua|l computing features to thin and light n|otebooks.

and

<p><b>|GeForce</b> 9300M GS provides powerful <i>visua|l computing features</i> to thin and light n|otebooks.</p>

Node manipulating

The library should provide a possibility to inject nodes into XHTML, that may span the tree possibly crossing the node boundaries e.g. via the operation NodeSet insert(Node nodeToInsert, NodeOffset start, NodeOffset end, int mode). The function works in two modes:

  • mode1: Split the node to be inserted if necessary. In this case the splitted from nodeToInsert nodes are returned as operations result.
  • mode2: Close the parent nodes. nodeToInsert is returned as is.

For example: the insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode1) operation should produce

<p><b>Ge<span>Force</span></b><span> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode2) operation should produce:

<p><b>Ge</b><span><b>Force</b> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

It is analogue to what users do in rich editor:

GeForce 9300M GS

I wonder, if there is anything like this in OpenSource world, as I really don't want to re-implement the wheel... I've checked quickly Open Source HTML Parsers in Java without success.

When you post an answer:

  • Make sure the above mentioned functions are available in library API (provide a link to JavaDoc).
  • The library is Java-native (no JNI) and OpenSource.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

挥剑断情 2024-12-06 07:21:22

我在一个开源项目中封装了已有的代码,并进行了修改以匹配您的请求(WIP):ShtutXML。它有很好的文档记录,所以我怀疑您在使用它时会遇到问题。

第一个请求(查找节点和相对于全局位置的偏移量)已经内置,并且 XML 中文本节点的分割也已经内置(因此您可以根据需要轻松地将它们包装在新节点中)。因此,添加用元素标记区域的逻辑相当简单。我稍后会尝试这样做,但这是我目前针对此请求的最大努力。

在您的 XML 上,使用我的 示例程序 这是我的输出:

************* BASE DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeForce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Text ***
"GeForce 9300M GS provides powerful visual computing features to thin and light notebooks."

*** Node of each text segment ***
[b: null]: GeForce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.


*** Offset testing ***
offset 0 is at [b: null] at 0
offset 40 is at [i: null] at 5
offset 80 is at [p: null] at 48

要求它在全局位置 4 处拆分元素将产生

*********** Split(4) DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeFo
| | |-Text: rce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Node of each text segment ***
[b: null]: GeFo
[b: null]: rce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.

当然,这种语法拆分对于与该文档匹配的实际 XML 代码没有任何意义,但它将允许包装一个文本一次与您希望的任何其他节点分开。

编辑: 已支持第一种插入模式

编辑 2: 已支持第二种插入模式

注释:

  • 您所做的任何文档修改都会使所有偏移量无效。稍后使用它们将导致整个文档损坏。因此,每次修改后,您必须执行 GetOffset 来再次检索偏移量。
  • 我知道有些功能没有记录。基本上,唯一应该在包外部使用的函数是您从 StrXML 类请求的函数。稍后将添加更多文档,您可以通过电子邮件与我联系(请参阅我的个人资料页面)以解决问题。

I wrapped code that I had already, with modifications to match your requests (WIP) in an open-source project: ShtutXML. It's pretty documented, so I doubt you'll have a problem using it.

The first request (Finding a node and offsets from a global position) is already built in, and splitting of text nodes in the XML is already built in (so you can easily wrap them in new nodes as you wish). Therefore, adding the logic for marking areas with an element is rather trivial. I'll try to do it later, but this is my best effort on this request for now.

On your XML, using my example program this is my output:

************* BASE DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeForce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Text ***
"GeForce 9300M GS provides powerful visual computing features to thin and light notebooks."

*** Node of each text segment ***
[b: null]: GeForce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.


*** Offset testing ***
offset 0 is at [b: null] at 0
offset 40 is at [i: null] at 5
offset 80 is at [p: null] at 48

Asking it to split the element at the global position 4 will produce

*********** Split(4) DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeFo
| | |-Text: rce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Node of each text segment ***
[b: null]: GeFo
[b: null]: rce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.

Of course this syntactical split means nothing for the actual XML code that matches that document, but it will allow wrapping one text part at a time with any other node you wish.

Edit: The first insertion mode is already supported

Edit 2: The second insertion mode is already supported

Notes:

  • Any document modification you may do, will make all the offsetts invalid. Using them later will cause corruption of the entire document. So, after each modification you must do GetOffset to retreive the offsetts again.
  • I know some of the functions are not documented. Basically the only functions that should be used outside of the package are the ones you requested from the StrXML class. More documentation will be added later and you can contact me by email (see my profile page) for questions.
安人多梦 2024-12-06 07:21:22

也许你可以尝试 jsoup - http://jsoup.org。

它是一个开源 Java根据 MIT 许可证分发的库。其源代码可在 GitHub 上获取。

从主页:

jsoup 是一个用于处理实际 HTML 的 Java 库。
它使用最好的 DOM、CSS 和类似 jquery 的方法,提供了一个非常方便的 API 来提取和操作数据。

使用 jsoup,您可以:

  • 使用 DOM 遍历或 CSS 选择器查找和提取数据
  • 操作 HTML 元素、属性和文本

以下是其 Javadoc: http:// /jsoup.org/apidocs/

Maybe you could try jsoup - http://jsoup.org.

It is an open source Java library distributed under the MIT license. Its source code is available at GitHub.

From the home page:

jsoup is a Java library for working with real-world HTML.
It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

With jsoup you can:

  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text

Here is its Javadoc: http://jsoup.org/apidocs/

灼痛 2024-12-06 07:21:22

几年前我尝试过 Jericho ,使用它的 API 进行解析看似简单。我用它登录雅虎邮件并从地址簿中获取联系人。我确信它能做的远不止于此。主页提到您的要求之一“Flattering XHTML into plain text”作为其功能之一。可能与您的问题相关的一些功能是

  • 从 HTML 标记中提取所有文本的内置功能
  • 所有已解析片段的源文档中的开始和结束位置都是可访问的,允许仅修改所选的片段文档,而不必从树重建整个文档。

    并且它是免费开源的。 (引用该网站:因此,您可以在商业应用程序中自由使用它,但须遵守这些许可文档之一中详细说明的条款。)

I tried Jericho couple of years back it it was deceptively simple to use its API for parsing. I used it for logging into yahoo mail and fetching the contacts from the address book. I sure it can do much more than. The home page mentions one of your requirement "Flattering XHTML into plain text" as one of its features. Some of the features which might be relevant to your questions are

  • Built-in functionality to extract all text from HTML markup
  • The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.

    And its Free open source. (Quoting the site :You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文