使用 OpenOffice API 抓取整个文档树
我想在 Writer 文档中获取整个树http://en.wikipedia.org/wiki/OpenOffice.org" rel="nofollow noreferrer">OpenOffice 3.1。 我需要收集树中所有元素的数据,而不仅仅是 Text
元素。
通过加载 XTextDocument
并执行 getText()
将给出 XText
元素。 更具体地说,使用来自 XText
的 XEnumerationAccess
只会迭代 TextRange
。
来自 OpenOffice 文档 /DevGuide/Text/Iterating_over_Text:
com.sun.star.text.Text的第二个接口是XEnumerationAccess。 文本服务枚举文本中的所有段落并返回支持com.sun.star.text.Paragraph 的对象。 这包括表格,因为作者将表格视为支持 com.sun.star.text.TextTable 服务的专用段落。
这里有一些附加文档:
段落的文本部分枚举不提供属于该段落的内容,但不与文本流融合在一起。 这些可以是文本框架、图形对象、嵌入对象或锚定在段落、字符或字符上的绘图形状。 TextPortionType“TextContent”指示是否存在锚定在字符处或作为字符的内容。 如果您有 TextContent 部分类型,您就知道存在锚定在字符处或作为字符的形状对象。
我的测试文档表明我确实得到了 XTextContent 和 XTextRange 可以通过getAnchor()
收集。 但我如何确定我正在收集的内容类型? 唯一的方法是getString()
。 如果对象是嵌入图像,我如何收集其数据?
我正在使用 C++,但我相信 Java 中的解决方案是可移植的。
从答案迁移
由于格式不当,此评论作为答案发布。
感谢您的答复。
我打算使用 API。
我正在尝试从文档中收集 GrahicObjects
的示例。 通过使用 XGraphicObjectsSupplier,我可以通过 getGraphicObjects() 获取集合。 集合中的对象是 Any
,通过 getValueTypeName()
打印类型会得到 XTextContent
。
API 描述该集合包含一个 TextGraphicObject
“服务”。 我如何获取它的实例?
I would like to grab the entire tree for a Writer document in OpenOffice 3.1. I need to collect data on all the elements in the tree, not only the Text
elements.
By loading the XTextDocument
and doing getText()
will give the XText
element. More specifically, using an XEnumerationAccess
from the XText
will only iterate over the TextRange
.
From the OpenOffice documentation /DevGuide/Text/Iterating_over_Text:
The second interface of com.sun.star.text.Text is XEnumerationAccess. A Text service enumerates all paragraphs in a text and returns objects which support com.sun.star.text.Paragraph. This includes tables, because writer sees tables as specialized paragraphs that support the com.sun.star.text.TextTable service.
Some additional documentation here:
The text portion enumeration of a paragraph does not supply contents which do belong to the paragraph, but do not fuse together with the text flow. These could be text frames, graphic objects, embedded objects or drawing shapes anchored at the paragraph, characters or as character. The TextPortionType "TextContent" indicate if there is a content anchored at a character or as a character. If you have a TextContent portion type, you know that there are shape objects anchored at a character or as a character.
My test documents indicate that I do get a XTextContent and the XTextRange can be collected via getAnchor()
. But how can I determine the type of content that I am collecting? The only method is getString()
. If the object was an embedded image, how do I collect its data?
I am using C++ but I believe a solution in Java would be portable.
Migrated From Answer
Due to poor formatting, this comment is posted as an answer.
Thanks for your response.
I intend to use the API.
I am trying the example of collecting GrahicObjects
from the document. By using a XGraphicObjectsSupplier
I can get a collection via getGraphicObjects()
. The object from the collection is Any
and printing the type via getValueTypeName()
gives XTextContent
.
The API describes that the collection holds a TextGraphicObject
"service". How do I grab an instance of it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的问题的答案会很复杂,但我会尽力让自己可以理解。
将文档导出为 XML 将
使用 SAX 更容易处理。 如果
使用 XML 方式,您必须
实现XDocumentHandler并读取
文档(可选过滤内容)
你不需要)。 其余的工作要么是 XSLT 转换,要么是大文档的 SAX。
如果您更喜欢仅使用 API,
你必须经常玩
XServiceInfo 和 UnoRuntime.queryInterface
Answers for your question would be complicated but I'll try to make myself understandable.
Exporting the document to XML would
be easier to process using SAX. If
using the XML way, you would have to
implement XDocumentHandler and read
the document(optionally filter what
you don't need). The rest of the work would be either XSLT transformations or SAX for big documents.
If you prefer using only the API,
you'll have to play a lot with
XServiceInfo and UnoRuntime.queryInterface
在java中:
in java: