当前位置：文江博客话题详情

如何从 .doc 和 .doc 中提取纯文本.docx 文件？

发布于 2024-11-01 16:16:22 字数 1539 浏览 2 评论 0 原文

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

我们不允许提出寻求软件库、教程、工具、书籍或其他场外资源推荐的问题。您可以编辑问题，以便用事实和引文来回答。

9 年前已关闭。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空名 2024-11-08 16:16:22

如果您想要纯纯文本（我的要求），那么您所需要的就是

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

我在命令行 fu

它解压缩 docx 文件并获取实际文档，然后去除所有 xml 标签。显然所有格式都丢失了。

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

回复收藏 0 原文

白龙吟 2024-11-08 16:16:22

LibreOffice

一个选项是 libreoffice/openoffice 无头模式（确保首先关闭 libreoffice 的所有其他实例）：

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc

了解更多详细信息请参见例如此链接：http://ask。 libreoffice.org/en/question/2641/convert-to-command-line-parameter/

有关 libreoffice 过滤器列表，请参阅 http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters

自 openoffice 以来命令行语法有点太复杂，有一个方便的包装器可以使过程更容易： unoconv 。

Apache POI

另一个选择是 Apache POI — 一个受良好支持的 Java 库，与 antiword 不同，它可以读取、创建和转换 .doc、.docx、.xls、.xlsx、.ppt、 >.pptx 文件。

下面是将 .doc 或 .docx 文档转换为纯文本的最简单的 Java 代码：

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;

public class WordToTextConverter {
    public static void main(String[] args) {
        try {
            convertWordToText(args[0], args[1]);
        } catch (ArrayIndexOutOfBoundsException aiobe) {
            System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
        }
    }

    public static void convertWordToText(String src, String desc) {
        try {
            FileInputStream fs = new FileInputStream(src);
            final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            FileWriter fw = new FileWriter(desc);
            fw.write(extractor.getText());
            fw.flush();
            fs.close();
            fw.close();

        } catch (IOException | OpenXML4JException | XmlException e) {
            e.printStackTrace();
        }
    }
}


# Maven dependencies (pom.xml):

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>3.17</version>
    </dependency>
</dependencies>
</project>

注意：您需要将 apache poi 库添加到类路径中。在 ubuntu/debian 上，可以使用 sudo apt-get install libapache-poi-java 来安装库 - 这会将它们安装在 /usr/share/java 下。对于其他系统，您需要下载库并将存档解压到您应该放置的文件夹中使用而不是/usr/share/java。如果您使用 maven/gradle （推荐选项），则包含 org.apache.poi 依赖项如代码片段所示。

相同的代码适用于 .doc 和 .docx，因为将通过检查二进制流来选择所需的转换器实现。

编译上面的类（假设它位于默认包中，并且 apache poi jar 位于 /usr/share/java 下）：

javac -cp /usr/share/java/*:. WordToTextConverter.java

运行转换：

java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt

A 可克隆的 gradle 项目，它提取所有必要的依赖项并生成包装器 shell 脚本（使用 gradle installDist）。

LibreOffice

One option is libreoffice/openoffice in headless mode (make sure all other instances of libreoffice are closed first):

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc

For more details see e.g. this link: http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/

For a list of libreoffice filters see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters

Since the openoffice command line syntax is a bit too complicated, there is a handy wrapper which can make the process easier: unoconv.

Apache POI

Another option is Apache POI — a well supported Java library which unlike antiword can read, create and convert .doc, .docx, .xls, .xlsx, .ppt, .pptx files.

Here is the simplest possible Java code for converting a .doc or .docx document to plain text:

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;

public class WordToTextConverter {
    public static void main(String[] args) {
        try {
            convertWordToText(args[0], args[1]);
        } catch (ArrayIndexOutOfBoundsException aiobe) {
            System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
        }
    }

    public static void convertWordToText(String src, String desc) {
        try {
            FileInputStream fs = new FileInputStream(src);
            final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            FileWriter fw = new FileWriter(desc);
            fw.write(extractor.getText());
            fw.flush();
            fs.close();
            fw.close();

        } catch (IOException | OpenXML4JException | XmlException e) {
            e.printStackTrace();
        }
    }
}


# Maven dependencies (pom.xml):

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>3.17</version>
    </dependency>
</dependencies>
</project>

NOTE: You will need to add the apache poi libraries to the classpath. On ubuntu/debian the libraries can be installed with sudo apt-get install libapache-poi-java — this will install them under /usr/share/java. For other systems you'll need to download the library and unpack the archive to a folder that you should use instead of /usr/share/java. If you use maven/gradle (the recommended option), then include the org.apache.poi dependencies as shown in the code snippet.

The same code will work for both .doc and .docx as the required converter implementation will be chosen by inspecting the binary stream.

Compile the class above (assuming it's in the default package, and the apache poi jars are under /usr/share/java):

javac -cp /usr/share/java/*:. WordToTextConverter.java

Run the conversion:

java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt

A clonable gradle project which pulls all necessary dependencies and generates the wrapper shell script (with gradle installDist).

回复收藏 0 原文

为你拒绝所有暧昧 2024-11-08 16:16:22

尝试 Apache Tika。它使用基于 Java 的库（除其他外，Apache POI）。使用起来非常简单：

java -jar tika-app-1.4.jar --text ./my-document.doc

Try Apache Tika. It supports most document formats (every MS Office format, OpenOffice/LibreOffice formats, PDF, etc.) using Java-based libraries (among others, Apache POI). It's very simple to use:

java -jar tika-app-1.4.jar --text ./my-document.doc

回复收藏 0 原文

下壹個目標 2024-11-08 16:16:22

尝试“antiword”或“antiword-xp-rb”

我最喜欢的是反词：

http://www.winfield。 demo.nl/

这里有一个类似的项目，声称支持 docx：

https ://github.com/rainey/antiword-xp-rb/wiki

回复收藏 0 原文

披肩女神 2024-11-08 16:16:22

我发现 wv 比 catdoc 或 antiword 更好。它可以处理.docx并转换为文本或html。这是我添加到 .bashrc 中的一个函数，用于临时在终端中查看该文件。根据需要更改它。

# open word in less (ie worl document.doc)
worl() {
    DOC=$(mktemp /tmp/output.XXXXXXXXXX)
    wvText $1 $DOC
    less $DOC
    rm $DOC
}

I find wv to be better than catdoc or antiword. It can deal with .docx and convert to text or html. Here is a function I added to my .bashrc to temporarily view the file in the terminal. Change it as required.

# open word in less (ie worl document.doc)
worl() {
    DOC=$(mktemp /tmp/output.XXXXXXXXXX)
    wvText $1 $DOC
    less $DOC
    rm $DOC
}

回复收藏 0 原文

心如荒岛 2024-11-08 16:16:22

我最近处理了这个问题，发现 OpenOffice/LibreOffice 命令行工具在生产中不可靠（处理了数千个文档，同时处理了数十个文档）。

最终，我构建了一个轻量级包装器 DocRipper，它速度更快，可以从 .doc 中获取所有文本， .docx 和 .pdf，无需格式化。 DocRipper 使用 Antiword、grep 和 pdftotext 来抓取文本并返回它。

回复收藏 0 原文

~没有更多了~

关于作者

南烟

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

如何从 .doc 和 .doc 中提取纯文本.docx 文件？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

LibreOffice

Apache POI

LibreOffice

Apache POI

尝试“antiword”或“antiword-xp-rb”

Try "antiword" or "antiword-xp-rb"

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如何从 .doc 和 .doc 中提取纯文本.docx 文件？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

LibreOffice

Apache POI

LibreOffice

Apache POI

尝试“antiword”或“antiword-xp-rb”

Try "antiword" or "antiword-xp-rb"

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。