从 doc 和 docx 中提取文本
我想知道如何阅读 doc 或 docx 的内容。我使用的是 Linux VPS 和 PHP,但如果有使用其他语言的更简单的解决方案,请告诉我,只要它在 Linux 网络服务器下工作即可。
I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
在这里,我添加了从 .doc,.docx 文字文件获取文本的解决方案
如何从word文件.doc,docx php中提取文本
对于.doc
对于.docx
Here i have added the solution to get the text from .doc,.docx word files
How to extract text from word file .doc,docx php
For .doc
For .docx
这只是一个 .DOCX 解决方案。对于 .DOC 或 .PDF,您需要使用其他内容,例如 PDF 的 pdf2text.php
This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF
解析 .docx、.odt、.doc 和 .rtf 文档
我写了一个根据此处和其他地方的答案解析 docx、odt 和 rtf 文档的库。
我对 .docx 和 .odt 解析所做的主要改进是库处理描述文档的 XML 并尝试使其符合 HTML 标记,即 em 和 strong< /strong> 标签。这意味着,如果您将该库用于 CMS,则文本格式不会丢失。
您可以在此处
Parse .docx, .odt, .doc and .rtf documents
I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.
The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost
You can get it here
我的解决方案是 Antiword for .doc 和 docx2txt for .docx
假设您控制一台 Linux 服务器,下载每个服务器,解压然后安装。我在系统范围内安装了每一个:
反词:
make global_install
docx2txt:
make install
然后使用这些工具将文本提取到 php 中的字符串中:
docx2txt 需要 perl
no_freedom 的解决方案确实从 docx 文件中提取文本,但它可以删除空格。我测试的大多数文件都存在应分隔的单词之间没有空格的情况。当您想要对正在处理的文档进行全文搜索时,这不太好。
My solution is Antiword for .doc and docx2txt for .docx
Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:
Antiword:
make global_install
docx2txt:
make install
Then to use these tools to extract the text into a string in php:
docx2txt requires perl
no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.
尝试 ApachePOI。它适用于 Java。我想您在 Linux 上安装 Java 不会有任何困难。
Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.
我建议,使用 apache Tika 提取文本,您可以提取多种类型的文件内容,例如 .doc/.docx 和 pdf 等。
I would suggest, Extract text using apache Tika, you can extract multiple type of file content like .doc/.docx and pdf and many other.
我使用 docxtotxt 提取 docx 文件内容。我的代码如下:
I used docxtotxt to extract docx file content. My code is as follows:
我在 doc 到 txt 转换器功能中插入了一些改进,
现在它保存空行,并且 txt 文件逐行查找。
I insert little improvements in doc to txt converter function
Now it saves empty rows and txt file looks row by row .
您可以使用Apache Tika作为它提供的REST API的完整解决方案。
另一个不错的库是 RawText,因为它可以对图像进行 OCR,并且从任何文档中提取文本。它是非免费的,并且通过 REST API 运行。
使用 RawText 提取文件的示例代码:
You can use Apache Tika as complete solution it provides REST API.
Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.
The sample code extracting your file with RawText: