搜索文档
我目前正在做一个项目,需要通过代码在word文档中搜索短语和单词。
基本上会上传一个Word文档,然后搜索一些单词。 最有效的方法是什么?
编辑:我更感兴趣的是使用什么来阅读文档(即MS Interop是最好的方法吗?),然后在搜索之前对其进行索引是否会非常有利(如果是的话如何?)。
编辑:搜索可能有数千个短语。
I am currently working on a project where I need to search for phrases and words in a word document through code.
Basically a Word document will be uploaded and then searched for some words.
What would be the most efficient way to do this?
Edit: I'm more interested in the what to use to read the document (i.e. is MS Interop the best way?) and then if it would be very advantageous to index it before searching (if so how?).
Edit: The search could potentially be for thousands of phrases.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
用 C# 打开 Word 文档
之后,只需使用
Contains
方法,或者类似的东西。这真的没那么难。您可能想查看带有小写和大写字母的情况。只需执行以下操作:(伪代码)
但我不太喜欢这个解决方案。它还在某种程度上取决于单词文档对象的外观。
Open a word document in C#
After that it's just a matter of using the
Contains
method, or something like that. It's really not that difficult. You might want to watch cases with lower- and uppercase letters.And just do something like this: (Pseudo Code)
I don't really like this solution though. It also kind of depends on what the word document object looks like.
正则表达式是查找模式的好方法。您可以在此处找到相关信息:
REGEX
Regex is a good way to find patterns. You can find information about the same here:
REGEX
基本上,您可以使用 MS 的一个名为
OpenXML SDK 2
打开 Word 文档(适用于 Word 2007 及更高版本)...该库无需安装 Word 即可工作...然后您可以提取文本并以您喜欢的方式进行搜索 - 例如与System.Text.RegularExpressions.Regex
< /a> ...当您提取文本时,您可以通过将带有上下文信息(位置等)的所有单词/短语存储在数据库中来对其进行索引,因此您只需要在以下情况下从数据库中
SELECT
:用户给你搜索短语 -索引的设计取决于您 - 您需要不区分大小写的搜索吗?等。另一种选择是使用 Solr/Lucene 用于索引和通过 API 访问索引,提供用于搜索的 UI...
Basically you would use a free library from MS called
OpenXML SDK 2
to open the word document (works with Word 2007 and up)... this library works without the need to have word installed... and then you can extract the text and search it anyway you like - for example withSystem.Text.RegularExpressions.Regex
...When you extract the text you can index it by storing all words/phrases with context information (posision etc.) in a DB so you would only need to
SELECT
from the DB when a user gives you phrase to search - the design of the index is up to you - will you need case-insensitive search ? etc.Another option would be to use Solr/Lucene for indexing and access that index via API provide the UI for the search...