如何单独抓取 Word 文档中的每一页文本(使用 .NET)?
我需要确定关键字出现在 Word 文档的哪些页面上。 我有一些工具可以获取文档的文本,但没有任何工具可以告诉我文本出现在哪些页面上。 有人为我提供一个好的起点吗? 我正在使用 .NET
谢谢!
编辑:附加约束:我不能使用任何互操作性的东西。
edit2:如果有人知道可以做到这一点的稳定库,那也会有帮助。 我用Aspose,但据我所知没有任何东西。
I need to determine which pages of a Word document that a keyword occurs on. I have some tools that can get me the text of the document, but nothing that tells me which pages the text occurs on. Does anyone have a good starting place for me? I'm using .NET
Thanks!
edit: Additional constraint: I can't use any of the Interop stuff.
edit2: If anybody knows of stable libraries that can do this, that'd also be helpful. I use Aspose, but as far as I know that doesn't have anything.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这就是我获取文本的方式,我相信您可以将选择范围设置为页面,然后您可以测试该文本,可能与您需要的有点倒退,但可能是一个开始的地方。
This is how I get the text out, I believe you can set set the selection range to a page, then you could test that text, might be a little backwards from what you need but could be a place to start.
您如何定义页面?
如果只计算节/硬分页符,它会很复杂,但可行。 如果您想计算软分页符,那么任务将变得非常非常困难并且有些毫无意义。 考虑到软分页着陆位置的确定是在运行时动态生成的,并且不存储在文件本身中。 它取决于很多因素,包括活动的打印机驱动程序(是的,它可以针对不同计算机上的同一个文件进行更改)、字体、字距、行距、边距等。
How are you defining a page?
If you only count section/hard page breaks it complex, but doable. If you want to count soft page breaks the task becomes very very difficult and somewhat meaningless. Consider that the determination of where soft-page breaks land is dynamically generated at run-time and is not stored in the file itself. It depends on a huge number of factors including the active printer driver (yes it can change for the same file on a different computer), fonts, kerning, line spacing, margins, etc, etc ,etc.
使用 Aspose 执行此操作的一种蹩脚方法是将 Word 文件转换为 PDF,然后抓取每个页面上的文本。
我对 Aspose 内部结构或它们在转换时如何定义软页面一无所知,但这是迄今为止我所得到的最好的。
One crappy way to do this with Aspose is to convert the Word file to a PDF and then grab text on each page.
I don't know anything about the Aspose internals or how they define their soft pages when converting, but this is the best I've got so far.
感谢您使用 Aspose.Words。
在公共 API 中,我们目前只有“流文档”信息,例如段落、表格、列表等。在内部,我们构建了一个页面布局模型,其中包含页面、文本块、文本行等类。 当然,文档模型和布局模型之间存在内部链接,并且可以找出哪个页面结束于何处以及所有内容。 通过公共 API 提供这些信息(嗯,仍然)是我们的首要任务。
您是否已在 Aspose.Words 支持论坛中记录了您的请求? 我们使用这些信息来维护投票系统,并将致力于首先获得更多选票的功能。
Thank you for using Aspose.Words.
In the public API we currently have only the "flow-document" information e.g. paragraphs, tables, lists etc. Internally, we build a page layout model that has classes like page, block of text, line of text and so on. There are internal links of course between the document model and the layout model and it is possible to find out which page ends where and all the stuff. Making this information available via the public API is (well, still) high on our priority list.
Have you logged your request in the Aspose.Words support forums? We use this info to maintain a voting system and will work on features that get more votes first.