当前位置：文江博客话题详情

从 doc 和 docx 中提取文本

发布于 2024-10-30 13:25:55 字数 100 浏览 3 评论 0原文

我想知道如何阅读 doc 或 docx 的内容。我使用的是 Linux VPS 和 PHP，但如果有使用其他语言的更简单的解决方案，请告诉我，只要它在 Linux 网络服务器下工作即可。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温馨耳语 2024-11-06 13:25:55

在这里，我添加了从 .doc,.docx 文字文件获取文本的解决方案

如何从word文件.doc,docx php中提取文本

对于.doc

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

对于.docx

private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

Here i have added the solution to get the text from .doc,.docx word files

How to extract text from word file .doc,docx php

For .doc

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

For .docx

private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

回复收藏 0 原文

澉约 2024-11-06 13:25:55

这只是一个 .DOCX 解决方案。对于 .DOC 或 .PDF，您需要使用其他内容，例如 PDF 的 pdf2text.php

function docx2text($filename) {
   return readZippedXML($filename, "word/document.xml");
 }

function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;

// Open received archive file
if (true === $zip->open($archiveFile)) {
    // If done, search for the data file in the archive
    if (($index = $zip->locateName($dataFile)) !== false) {
        // If found, read it to the string
        $data = $zip->getFromIndex($index);
        // Close archive file
        $zip->close();
        // Load XML from a string
        // Skip errors and warnings
        $xml = new DOMDocument();
    $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
        // Return data without XML formatting tags
        return strip_tags($xml->saveXML());
    }
    $zip->close();
}

// In case of failure return empty string
return "";
}

echo docx2text("test.docx"); // Save this contents to file

This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF

function docx2text($filename) {
   return readZippedXML($filename, "word/document.xml");
 }

function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;

// Open received archive file
if (true === $zip->open($archiveFile)) {
    // If done, search for the data file in the archive
    if (($index = $zip->locateName($dataFile)) !== false) {
        // If found, read it to the string
        $data = $zip->getFromIndex($index);
        // Close archive file
        $zip->close();
        // Load XML from a string
        // Skip errors and warnings
        $xml = new DOMDocument();
    $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
        // Return data without XML formatting tags
        return strip_tags($xml->saveXML());
    }
    $zip->close();
}

// In case of failure return empty string
return "";
}

echo docx2text("test.docx"); // Save this contents to file

回复收藏 0 原文

破晓 2024-11-06 13:25:55

解析 .docx、.odt、.doc 和 .rtf 文档

我写了一个根据此处和其他地方的答案解析 docx、odt 和 rtf 文档的库。

我对 .docx 和 .odt 解析所做的主要改进是库处理描述文档的 XML 并尝试使其符合 HTML 标记，即 em 和 strong< /strong> 标签。这意味着，如果您将该库用于 CMS，则文本格式不会丢失。

您可以在此处

回复收藏 0 原文

无语# 2024-11-06 13:25:55

我的解决方案是 Antiword for .doc 和 docx2txt for .docx

假设您控制一台 Linux 服务器，下载每个服务器，解压然后安装。我在系统范围内安装了每一个：

反词：make global_install
docx2txt: make install

然后使用这些工具将文本提取到 php 中的字符串中：

//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' . 
    escapeshellarg($docFilePath));

//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' . 
    escapeshellarg($docxFilePath) . ' -');

docx2txt 需要 perl

no_freedom 的解决方案确实从 docx 文件中提取文本，但它可以删除空格。我测试的大多数文件都存在应分隔的单词之间没有空格的情况。当您想要对正在处理的文档进行全文搜索时，这不太好。

My solution is Antiword for .doc and docx2txt for .docx

Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:

Antiword: make global_install
docx2txt: make install

Then to use these tools to extract the text into a string in php:

//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' . 
    escapeshellarg($docFilePath));

//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' . 
    escapeshellarg($docxFilePath) . ' -');

docx2txt requires perl

no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.

回复收藏 0 原文

囍孤女 2024-11-06 13:25:55

尝试 ApachePOI。它适用于 Java。我想您在 Linux 上安装 Java 不会有任何困难。

回复收藏 0 原文

甜是你 2024-11-06 13:25:55

我建议，使用 apache Tika 提取文本，您可以提取多种类型的文件内容，例如 .doc/.docx 和 pdf 等。

回复收藏 0 原文

黎夕旧梦 2024-11-06 13:25:55

我使用 docxtotxt 提取 docx 文件内容。我的代码如下：

if($extention == "docx")
{   
    $docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
    $content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl     
    '.escapeshellarg($docxFilePath) . ' -');
}

I used docxtotxt to extract docx file content. My code is as follows:

if($extention == "docx")
{   
    $docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
    $content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl     
    '.escapeshellarg($docxFilePath) . ' -');
}

回复收藏 0 原文

‖放下 2024-11-06 13:25:55

我在 doc 到 txt 转换器功能中插入了一些改进，

private function read_doc() {
    $line_array = array();
    $fileHandle = fopen( $this->filename, "r" );
    $line       = @fread( $fileHandle, filesize( $this->filename ) );
    $lines      = explode( chr( 0x0D ), $line );
    $outtext    = "";
    foreach ( $lines as $thisline ) {
        $pos = strpos( $thisline, chr( 0x00 ) );
        if (  $pos !== false )  {

        } else {
            $line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );

        }
    }

    return implode("\n",$line_array);
}

现在它保存空行，并且 txt 文件逐行查找。

I insert little improvements in doc to txt converter function

private function read_doc() {
    $line_array = array();
    $fileHandle = fopen( $this->filename, "r" );
    $line       = @fread( $fileHandle, filesize( $this->filename ) );
    $lines      = explode( chr( 0x0D ), $line );
    $outtext    = "";
    foreach ( $lines as $thisline ) {
        $pos = strpos( $thisline, chr( 0x00 ) );
        if (  $pos !== false )  {

        } else {
            $line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );

        }
    }

    return implode("\n",$line_array);
}

Now it saves empty rows and txt file looks row by row .

回复收藏 0 原文