Java 中的 XHTML 操作库
我正在寻找可以为我执行以下两项任务的 XML/XHTML Java 库/框架。
在进行一些定义之前:
NodeOffset(Node node, int offset)
标记 XML 树中文本节点中的某个点。nodeB
、nodeI
、nodeP
是下面提到的 XHTML 树和nodeSpan< 的对应
Node
实例/code> 是一些新创建的节点(其中Node
不一定是org.w3c.dom.Node
并且可能是任何其他抽象)
将 XHTML 变成纯文本
库应该能够产生纯文本输出(例如通过实现CharSequence
或类似的)来自给定的XHTML,并提供输出和原始XHTML 节点树中的字符之间的一对一映射(例如,通过函数NodeOffset getNodeOffset(int plainTextOffset)
)。
示例:假设我们有以下 XHTML:
<p><b>GeForce</b> 9300M GS provides powerful <i>visual computing features</i> to thin and light notebooks.</p>
那么明文表示显然将是:
GeForce 9300M GS provides powerful visual computing features to thin and light notebooks.
那么例如
getNodeOffset(0)
应该返回节点NodeOffset(nodeB, 0)
getNodeOffset (40)
应返回节点NodeOffset(nodeI, 5)
getNodeOffset(80)
应返回节点NodeOffset(nodeP, 49)
。
我可能会错过正确的数字,但我希望你明白了。我重复这个例子,现在插入了伪标记:
|GeForce 9300M GS provides powerful visua|l computing features to thin and light n|otebooks.
和
<p><b>|GeForce</b> 9300M GS provides powerful <i>visua|l computing features</i> to thin and light n|otebooks.</p>
节点操作
库应该提供将节点注入 XHTML 的可能性,这可能跨越树,可能跨越节点边界,例如通过操作 NodeSet insert(Node nodeToInsert, NodeOffset开始,NodeOffset结束,int模式)
。该函数有两种工作模式:
- mode1:如有必要,拆分要插入的节点。在这种情况下,从
nodeToInsert
节点中分割出来的节点将作为操作结果返回。 - mode2:关闭父节点。
nodeToInsert
按原样返回。
例如: insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode1)
操作应生成
<p><b>Ge<span>Force</span></b><span> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>
insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset( nodeP, 9), mode2)
操作应产生:
<p><b>Ge</b><span><b>Force</b> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>
它类似于用户在富编辑器中所做的操作:
GeForce 9300M GS
我想知道,开源世界中是否有这样的事情,因为我真的不想重新实现轮子......我很快就检查了 Java 中的开源 HTML 解析器 没有成功。
当您发布答案时:
- 确保上述函数在库 API 中可用(提供 JavaDoc 的链接)。
- 该库是 Java 原生的(无 JNI)并且是开源的。
I am looking for XML/XHTML Java library/framework that can perform the following two tasks for me.
Before going on few definitions:
NodeOffset(Node node, int offset)
marks some point in text node in the XML tree.nodeB
,nodeI
,nodeP
are the correspondingNode
instances of the below mentioned XHTML tree andnodeSpan
is some newly created node (whereNode
is not necessarilyorg.w3c.dom.Node
and may be any other abstraction)
Flattering XHTML into plain text
The library should be able to produce plaintext output (e.g. by implementing CharSequence
or similar) from given XHTML and provide one-to-one mapping between chars in the output and original XHTML node tree (e.g. via the function NodeOffset getNodeOffset(int plainTextOffset)
).
Example: Suppose we have the following XHTML:
<p><b>GeForce</b> 9300M GS provides powerful <i>visual computing features</i> to thin and light notebooks.</p>
Then the plaintext representation will obviously be:
GeForce 9300M GS provides powerful visual computing features to thin and light notebooks.
Then e.g.
getNodeOffset(0)
should return nodeNodeOffset(nodeB, 0)
getNodeOffset(40)
should return nodeNodeOffset(nodeI, 5)
getNodeOffset(80)
should return nodeNodeOffset(nodeP, 49)
.
I might miss the correct numbers, but I hope, you got the idea. I repeat the example, now with pseudo-markers inserted:
|GeForce 9300M GS provides powerful visua|l computing features to thin and light n|otebooks.
and
<p><b>|GeForce</b> 9300M GS provides powerful <i>visua|l computing features</i> to thin and light n|otebooks.</p>
Node manipulating
The library should provide a possibility to inject nodes into XHTML, that may span the tree possibly crossing the node boundaries e.g. via the operation NodeSet insert(Node nodeToInsert, NodeOffset start, NodeOffset end, int mode)
. The function works in two modes:
- mode1: Split the node to be inserted if necessary. In this case the splitted from
nodeToInsert
nodes are returned as operations result. - mode2: Close the parent nodes.
nodeToInsert
is returned as is.
For example: the insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode1)
operation should produce
<p><b>Ge<span>Force</span></b><span> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>
insert(nodeSpan, NodeOffset(nodeB, 2), NodeOffset(nodeP, 9), mode2)
operation should produce:
<p><b>Ge</b><span><b>Force</b> 9300M GS</span> provides powerful <i>visual computing features</i> to thin and light notebooks.</p>
It is analogue to what users do in rich editor:
GeForce 9300M GS
I wonder, if there is anything like this in OpenSource world, as I really don't want to re-implement the wheel... I've checked quickly Open Source HTML Parsers in Java without success.
When you post an answer:
- Make sure the above mentioned functions are available in library API (provide a link to JavaDoc).
- The library is Java-native (no JNI) and OpenSource.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我在一个开源项目中封装了已有的代码,并进行了修改以匹配您的请求(WIP):ShtutXML。它有很好的文档记录,所以我怀疑您在使用它时会遇到问题。
第一个请求(查找节点和相对于全局位置的偏移量)已经内置,并且 XML 中文本节点的分割也已经内置(因此您可以根据需要轻松地将它们包装在新节点中)。因此,添加用元素标记区域的逻辑相当简单。我稍后会尝试这样做,但这是我目前针对此请求的最大努力。
在您的 XML 上,使用我的 示例程序 这是我的输出:
要求它在全局位置 4 处拆分元素将产生
当然,这种语法拆分对于与该文档匹配的实际 XML 代码没有任何意义,但它将允许包装一个文本一次与您希望的任何其他节点分开。
编辑: 已支持第一种插入模式
编辑 2: 已支持第二种插入模式
注释:
StrXML
类请求的函数。稍后将添加更多文档,您可以通过电子邮件与我联系(请参阅我的个人资料页面)以解决问题。I wrapped code that I had already, with modifications to match your requests (WIP) in an open-source project: ShtutXML. It's pretty documented, so I doubt you'll have a problem using it.
The first request (Finding a node and offsets from a global position) is already built in, and splitting of text nodes in the XML is already built in (so you can easily wrap them in new nodes as you wish). Therefore, adding the logic for marking areas with an element is rather trivial. I'll try to do it later, but this is my best effort on this request for now.
On your XML, using my example program this is my output:
Asking it to split the element at the global position 4 will produce
Of course this syntactical split means nothing for the actual XML code that matches that document, but it will allow wrapping one text part at a time with any other node you wish.
Edit: The first insertion mode is already supported
Edit 2: The second insertion mode is already supported
Notes:
StrXML
class. More documentation will be added later and you can contact me by email (see my profile page) for questions.也许你可以尝试 jsoup - http://jsoup.org。
它是一个开源 Java根据 MIT 许可证分发的库。其源代码可在 GitHub 上获取。
从主页:
使用 jsoup,您可以:
以下是其 Javadoc: http:// /jsoup.org/apidocs/
Maybe you could try jsoup - http://jsoup.org.
It is an open source Java library distributed under the MIT license. Its source code is available at GitHub.
From the home page:
With jsoup you can:
Here is its Javadoc: http://jsoup.org/apidocs/
几年前我尝试过 Jericho ,使用它的 API 进行解析看似简单。我用它登录雅虎邮件并从地址簿中获取联系人。我确信它能做的远不止于此。主页提到您的要求之一“Flattering XHTML into plain text”作为其功能之一。可能与您的问题相关的一些功能是
所有已解析片段的源文档中的开始和结束位置都是可访问的,允许仅修改所选的片段文档,而不必从树重建整个文档。
并且它是免费开源的。 (引用该网站:因此,您可以在商业应用程序中自由使用它,但须遵守这些许可文档之一中详细说明的条款。)
I tried Jericho couple of years back it it was deceptively simple to use its API for parsing. I used it for logging into yahoo mail and fetching the contacts from the address book. I sure it can do much more than. The home page mentions one of your requirement "Flattering XHTML into plain text" as one of its features. Some of the features which might be relevant to your questions are
The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
And its Free open source. (Quoting the site :You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.)