搜索 MS Word 二进制文件以获取特定内容
我的数据库中存储了一些 .doc 二进制文件,我现在想搜索所有这些文件(不将它们转换为 .doc),以查看哪个文件包含“hello”一词。
有什么方法可以在二进制文件中进行此搜索吗?
I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以沿着使用商业工具的路线进行。 Aspose.Words 可以加载来自流的文档,并具有在文档中查找文本的各种方法。
如果您有来自数据库的流,那么您的代码将如下所示:
注意:此工具的优点是它不需要安装 Word 对象,并且可以使用内存中的流。
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
据我所知,并非没有很多痛苦。根据Wikipedia,微软在过去的几年里终于发布了.doc规范。因此,如果您有时间,您可以根据规范创建一个解析器,假设所有文档都采用相同版本的 .doc 格式。
当然,您可以在所有二进制数据中搜索您要查找的文本,假设实际文本存储为纯文本。但即使该假设成立,您如何确定您找到的纯文本是实际的文档文本,而不是也以纯文本形式存储的某些文档元数据?而且二进制数据与您的文本模式匹配的可能性总是很小。
如果您可以使用 Word 库,我会走这条路。如果没有,自制的解析器可能是您最不坏的选择。
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.