如何从 .doc 和 .doc 中提取纯文本.docx 文件?

发布于 2024-11-01 16:16:22 字数 1539 浏览 2 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

空名 2024-11-08 16:16:22

如果您想要纯纯文本(我的要求),那么您所需要的就是

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

我在 命令行 fu

它解压缩 docx 文件并获取实际文档,然后去除所有 xml 标签。显然所有格式都丢失了。

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

白龙吟 2024-11-08 16:16:22

LibreOffice

一个选项是 libreoffice/openoffice 无头模式(确保首先关闭 libreoffice 的所有其他实例):

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc

了解更多详细信息请参见例如此链接:http://ask。 libreoffice.org/en/question/2641/convert-to-command-line-parameter/

有关 libreoffice 过滤器列表,请参阅 http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters

自 openoffice 以来命令行语法有点太复杂,有一个方便的包装器可以使过程更容易: unoconv

Apache POI

另一个选择是 Apache POI — 一个受良好支持的 Java 库,与 antiword 不同,它可以读取、创建和转换 .doc.docx.xls.xlsx.ppt >.pptx 文件。

下面是将 .doc.docx 文档转换为纯文本的最简单的 Java 代码:

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;

public class WordToTextConverter {
    public static void main(String[] args) {
        try {
            convertWordToText(args[0], args[1]);
        } catch (ArrayIndexOutOfBoundsException aiobe) {
            System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
        }
    }

    public static void convertWordToText(String src, String desc) {
        try {
            FileInputStream fs = new FileInputStream(src);
            final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            FileWriter fw = new FileWriter(desc);
            fw.write(extractor.getText());
            fw.flush();
            fs.close();
            fw.close();

        } catch (IOException | OpenXML4JException | XmlException e) {
            e.printStackTrace();
        }
    }
}


# Maven dependencies (pom.xml):

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>3.17</version>
    </dependency>
</dependencies>
</project>

注意:您需要将 apache poi 库添加到类路径中。在 ubuntu/debian 上,可以使用 sudo apt-get install libapache-poi-java 来安装库 - 这会将它们安装在 /usr/share/java 下。对于其他系统,您需要下载库并将存档解压到您应该放置的文件夹中使用而不是/usr/share/java。如果您使用 maven/gradle (推荐选项),则包含 org.apache.poi 依赖项 如代码片段所示。

相同的代码适用于 .doc.docx,因为将通过检查二进制流来选择所需的转换器实现。

编译上面的类(假设它位于默认包中,并且 apache poi jar 位于 /usr/share/java 下):

javac -cp /usr/share/java/*:. WordToTextConverter.java

运行转换:

java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt 

A 可克隆的 gradle 项目,它提取所有必要的依赖项并生成包装器 shell 脚本(使用 gradle installDist) 。

LibreOffice

One option is libreoffice/openoffice in headless mode (make sure all other instances of libreoffice are closed first):

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc

For more details see e.g. this link: http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/

For a list of libreoffice filters see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters

Since the openoffice command line syntax is a bit too complicated, there is a handy wrapper which can make the process easier: unoconv.

Apache POI

Another option is Apache POI — a well supported Java library which unlike antiword can read, create and convert .doc, .docx, .xls, .xlsx, .ppt, .pptx files.

Here is the simplest possible Java code for converting a .doc or .docx document to plain text:

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;

public class WordToTextConverter {
    public static void main(String[] args) {
        try {
            convertWordToText(args[0], args[1]);
        } catch (ArrayIndexOutOfBoundsException aiobe) {
            System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
        }
    }

    public static void convertWordToText(String src, String desc) {
        try {
            FileInputStream fs = new FileInputStream(src);
            final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            FileWriter fw = new FileWriter(desc);
            fw.write(extractor.getText());
            fw.flush();
            fs.close();
            fw.close();

        } catch (IOException | OpenXML4JException | XmlException e) {
            e.printStackTrace();
        }
    }
}


# Maven dependencies (pom.xml):

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>3.17</version>
    </dependency>
</dependencies>
</project>

NOTE: You will need to add the apache poi libraries to the classpath. On ubuntu/debian the libraries can be installed with sudo apt-get install libapache-poi-java — this will install them under /usr/share/java. For other systems you'll need to download the library and unpack the archive to a folder that you should use instead of /usr/share/java. If you use maven/gradle (the recommended option), then include the org.apache.poi dependencies as shown in the code snippet.

The same code will work for both .doc and .docx as the required converter implementation will be chosen by inspecting the binary stream.

Compile the class above (assuming it's in the default package, and the apache poi jars are under /usr/share/java):

javac -cp /usr/share/java/*:. WordToTextConverter.java

Run the conversion:

java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt 

A clonable gradle project which pulls all necessary dependencies and generates the wrapper shell script (with gradle installDist).

为你拒绝所有暧昧 2024-11-08 16:16:22

尝试 Apache Tika。它使用基于 Java 的库(除其他外,Apache POI)。使用起来非常简单:

java -jar tika-app-1.4.jar --text ./my-document.doc

Try Apache Tika. It supports most document formats (every MS Office format, OpenOffice/LibreOffice formats, PDF, etc.) using Java-based libraries (among others, Apache POI). It's very simple to use:

java -jar tika-app-1.4.jar --text ./my-document.doc
下壹個目標 2024-11-08 16:16:22

尝试“antiword”或“antiword-xp-rb”

我最喜欢的是反词:

http://www.winfield。 demo.nl/

这里有一个类似的项目,声称支持 docx:

https ://github.com/rainey/antiword-xp-rb/wiki

Try "antiword" or "antiword-xp-rb"

My favorite is antiword:

http://www.winfield.demon.nl/

And here's a similar project which claims support for docx:

https://github.com/rainey/antiword-xp-rb/wiki

披肩女神 2024-11-08 16:16:22

我发现 wv 比 catdoc 或 antiword 更好。它可以处理.docx并转换为文本或html。这是我添加到 .bashrc 中的一个函数,用于临时在终端中查看该文件。根据需要更改它。

# open word in less (ie worl document.doc)
worl() {
    DOC=$(mktemp /tmp/output.XXXXXXXXXX)
    wvText $1 $DOC
    less $DOC
    rm $DOC
}

I find wv to be better than catdoc or antiword. It can deal with .docx and convert to text or html. Here is a function I added to my .bashrc to temporarily view the file in the terminal. Change it as required.

# open word in less (ie worl document.doc)
worl() {
    DOC=$(mktemp /tmp/output.XXXXXXXXXX)
    wvText $1 $DOC
    less $DOC
    rm $DOC
}
心如荒岛 2024-11-08 16:16:22

我最近处理了这个问题,发现 OpenOffice/LibreOffice 命令行工具在生产中不可靠(处理了数千个文档,同时处理了数十个文档)。

最终,我构建了一个轻量级包装器 DocRipper,它速度更快,可以从 .doc 中获取所有文本, .docx 和 .pdf,无需格式化。 DocRipper 使用 Antiword、grep 和 pdftotext 来抓取文本并返回它。

I recently dealt with this issue and found OpenOffice/LibreOffice commandline tools to be unreliable in production (thousands of docs processed, dozens concurrently).

Ultimately, I built a light-weight wrapper, DocRipper that is much faster and grabs all text from .doc, .docx and .pdf without formatting. DocRipper utilizes Antiword, grep and pdftotext to grab text and return it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文