Java 中的 XHTML 操作库

发布于 2024-11-29 07:21:22 字数 2937 浏览 0 评论 0原文

我正在寻找可以为我执行以下两项任务的 XML/XHTML Java 库/框架。

在进行一些定义之前：

NodeOffset(Node node, int offset) 标记 XML 树中文本节点中的某个点。
nodeB、nodeI、nodeP 是下面提到的 XHTML 树和 nodeSpan< 的对应 Node 实例/code> 是一些新创建的节点（其中 Node 不一定是 org.w3c.dom.Node 并且可能是任何其他抽象）

将 XHTML 变成纯文本

库应该能够产生纯文本输出（例如通过实现CharSequence 或类似的）来自给定的XHTML，并提供输出和原始XHTML 节点树中的字符之间的一对一映射（例如，通过函数NodeOffset getNodeOffset(int plainTextOffset)）。

示例：假设我们有以下 XHTML：

<p><b>GeForce</b> 9300M GS provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

那么明文表示显然将是：

GeForce 9300M GS provides powerful visual computing features to thin and light notebooks.

那么例如

getNodeOffset(0) 应该返回节点 NodeOffset(nodeB, 0)
getNodeOffset (40) 应返回节点 NodeOffset(nodeI, 5)
getNodeOffset(80) 应返回节点 NodeOffset(nodeP, 49）。

我可能会错过正确的数字，但我希望你明白了。我重复这个例子，现在插入了伪标记：

|GeForce 9300M GS provides powerful visua|l computing features to thin and light n|otebooks.

和

<p><b>|GeForce</b> 9300M GS provides powerful <i>visua|l computing features</i> to thin and light n|otebooks.</p>

节点操作

库应该提供将节点注入 XHTML 的可能性，这可能跨越树，可能跨越节点边界，例如通过操作 NodeSet insert(Node nodeToInsert, NodeOffset开始，NodeOffset结束，int模式）。该函数有两种工作模式：

mode1：如有必要，拆分要插入的节点。在这种情况下，从 nodeToInsert 节点中分割出来的节点将作为操作结果返回。
mode2：关闭父节点。 nodeToInsert 按原样返回。

例如： insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode1) 操作应生成

<p><b>Ge<span>Force</span></b><span> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset( nodeP, 9), mode2) 操作应产生：

<p><b>Ge</b><span><b>Force</b> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

它类似于用户在富编辑器中所做的操作：

GeForce 9300M GS

我想知道，开源世界中是否有这样的事情，因为我真的不想重新实现轮子......我很快就检查了 Java 中的开源 HTML 解析器没有成功。

当您发布答案时：

确保上述函数在库 API 中可用（提供 JavaDoc 的链接）。
该库是 Java 原生的（无 JNI）并且是开源的。

原文

I am looking for XML/XHTML Java library/framework that can perform the following two tasks for me.

Before going on few definitions:

NodeOffset(Node node, int offset) marks some point in text node in the XML tree.
nodeB, nodeI, nodeP are the corresponding Node instances of the below mentioned XHTML tree and nodeSpan is some newly created node (where Node is not necessarily org.w3c.dom.Node and may be any other abstraction)

Flattering XHTML into plain text

The library should be able to produce plaintext output (e.g. by implementing CharSequence or similar) from given XHTML and provide one-to-one mapping between chars in the output and original XHTML node tree (e.g. via the function NodeOffset getNodeOffset(int plainTextOffset)).

Example: Suppose we have the following XHTML:

<p><b>GeForce</b> 9300M GS provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

Then the plaintext representation will obviously be:

GeForce 9300M GS provides powerful visual computing features to thin and light notebooks.

Then e.g.

getNodeOffset(0) should return node NodeOffset(nodeB, 0)
getNodeOffset(40) should return node NodeOffset(nodeI, 5)
getNodeOffset(80) should return node NodeOffset(nodeP, 49).

I might miss the correct numbers, but I hope, you got the idea. I repeat the example, now with pseudo-markers inserted:

|GeForce 9300M GS provides powerful visua|l computing features to thin and light n|otebooks.

and

<p><b>|GeForce</b> 9300M GS provides powerful <i>visua|l computing features</i> to thin and light n|otebooks.</p>

Node manipulating

The library should provide a possibility to inject nodes into XHTML, that may span the tree possibly crossing the node boundaries e.g. via the operation NodeSet insert(Node nodeToInsert, NodeOffset start, NodeOffset end, int mode). The function works in two modes:

mode1: Split the node to be inserted if necessary. In this case the splitted from nodeToInsert nodes are returned as operations result.
mode2: Close the parent nodes. nodeToInsert is returned as is.

For example: the insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode1) operation should produce

<p><b>Ge<span>Force</span></b><span> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode2) operation should produce:

<p><b>Ge</b><span><b>Force</b> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>

It is analogue to what users do in rich editor:

GeForce 9300M GS

I wonder, if there is anything like this in OpenSource world, as I really don't want to re-implement the wheel... I've checked quickly Open Source HTML Parsers in Java without success.

When you post an answer:

Make sure the above mentioned functions are available in library API (provide a link to JavaDoc).
The library is Java-native (no JNI) and OpenSource.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挥剑断情 2024-12-06 07:21:22

我在一个开源项目中封装了已有的代码，并进行了修改以匹配您的请求（WIP）：ShtutXML。它有很好的文档记录，所以我怀疑您在使用它时会遇到问题。

第一个请求（查找节点和相对于全局位置的偏移量）已经内置，并且 XML 中文本节点的分割也已经内置（因此您可以根据需要轻松地将它们包装在新节点中）。因此，添加用元素标记区域的逻辑相当简单。我稍后会尝试这样做，但这是我目前针对此请求的最大努力。

在您的 XML 上，使用我的示例程序这是我的输出：

************* BASE DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeForce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Text ***
"GeForce 9300M GS provides powerful visual computing features to thin and light notebooks."

*** Node of each text segment ***
[b: null]: GeForce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.


*** Offset testing ***
offset 0 is at [b: null] at 0
offset 40 is at [i: null] at 5
offset 80 is at [p: null] at 48

要求它在全局位置 4 处拆分元素将产生

*********** Split(4) DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeFo
| | |-Text: rce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Node of each text segment ***
[b: null]: GeFo
[b: null]: rce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.

当然，这种语法拆分对于与该文档匹配的实际 XML 代码没有任何意义，但它将允许包装一个文本一次与您希望的任何其他节点分开。

编辑： 已支持第一种插入模式

编辑 2： 已支持第二种插入模式

注释：

您所做的任何文档修改都会使所有偏移量无效。稍后使用它们将导致整个文档损坏。因此，每次修改后，您必须执行 GetOffset 来再次检索偏移量。
我知道有些功能没有记录。基本上，唯一应该在包外部使用的函数是您从 StrXML 类请求的函数。稍后将添加更多文档，您可以通过电子邮件与我联系（请参阅我的个人资料页面）以解决问题。

I wrapped code that I had already, with modifications to match your requests (WIP) in an open-source project: ShtutXML. It's pretty documented, so I doubt you'll have a problem using it.

The first request (Finding a node and offsets from a global position) is already built in, and splitting of text nodes in the XML is already built in (so you can easily wrap them in new nodes as you wish). Therefore, adding the logic for marking areas with an element is rather trivial. I'll try to do it later, but this is my best effort on this request for now.

On your XML, using my example program this is my output:

************* BASE DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeForce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Text ***
"GeForce 9300M GS provides powerful visual computing features to thin and light notebooks."

*** Node of each text segment ***
[b: null]: GeForce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.


*** Offset testing ***
offset 0 is at [b: null] at 0
offset 40 is at [i: null] at 5
offset 80 is at [p: null] at 48

Asking it to split the element at the global position 4 will produce

*********** Split(4) DOCUMENT *****************
DOCUMENT ROOT
|-<p >
| |-<b >
| | |-Text: GeFo
| | |-Text: rce
| |-Text:  9300M GS provides powerful 
| |-<i >
| | |-Text: visual computing features
| |-Text:  to thin and light notebooks.

*** Node of each text segment ***
[b: null]: GeFo
[b: null]: rce
[p: null]:  9300M GS provides powerful 
[i: null]: visual computing features
[p: null]:  to thin and light notebooks.

Of course this syntactical split means nothing for the actual XML code that matches that document, but it will allow wrapping one text part at a time with any other node you wish.

Edit: The first insertion mode is already supported

Edit 2: The second insertion mode is already supported

Notes:

Any document modification you may do, will make all the offsetts invalid. Using them later will cause corruption of the entire document. So, after each modification you must do GetOffset to retreive the offsetts again.
I know some of the functions are not documented. Basically the only functions that should be used outside of the package are the ones you requested from the StrXML class. More documentation will be added later and you can contact me by email (see my profile page) for questions.

回复收藏 0 原文