如何在 Linux 服务器上抓取 MS Word 文档文本？

发布于 2024-10-03 19:10:07 字数 159 浏览 3 评论 0原文

有人问我是否要创建一个网站，让一些用户可以上传 Microsoft Word 文档，然后其他用户可以搜索包含某些关键字的上传文档。该站点将位于运行 PHP 和 MySQL 的 Linux 服务器上。我目前正在尝试找出是否以及如何从文档中删除此文本。如果有人能提出一个好的方法来做到这一点，我们将不胜感激。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

昔日梦未散 2024-10-10 19:10:07

从新的 docx 格式中抓取文本非常简单。该文件本身只是一个 zip 文件，如果你查看其中，你会发现一堆 xml 文件。文本包含在该 zip 文件内的 word/document.xml 中，并且所有实际用户输入的文本将出现在中。标签。如果您提取出现在中的所有文本，标签，你就会刮掉文档。

回复收藏 0 原文

め七分饶幸 2024-10-10 19:10:07

这是一个使用 catdoc 的好例子：

function catdoc_string($str)
{
    // requires catdoc

    // write to temp file
    $tmpfname = tempnam ('/tmp','doc');
    $handle = fopen($tmpfname,'w');
    fwrite($handle,$a);
    fclose($handle);

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');

    // remove temp file
    unlink($tmpfname);

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

function catdoc_file($fname)
{
    // requires catdoc

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

来源

Here's a good example using catdoc:

function catdoc_string($str)
{
    // requires catdoc

    // write to temp file
    $tmpfname = tempnam ('/tmp','doc');
    $handle = fopen($tmpfname,'w');
    fwrite($handle,$a);
    fclose($handle);

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');

    // remove temp file
    unlink($tmpfname);

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

function catdoc_file($fname)
{
    // requires catdoc

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

Source

回复收藏 0 原文

~没有更多了~