使用 PHP 查找 docx 文件中的换行符

发布于 2024-10-31 06:04:05 字数 1662 浏览 6 评论 0原文

我的 PHP 脚本成功读取 .docx 文件中的所有文本，但我无法弄清楚换行符应该在哪里，因此它使文本堆积在一起并且难以阅读（一大段）。我已经手动检查了所有 XML 文件以尝试找出答案，但我无法找出答案。

以下是我用来检索文件数据并返回纯文本的函数。

    public function read($FilePath)
{
    // Save name of the file
    parent::SetDocName($FilePath);

    $Data = $this->docx2text($FilePath);

    $Data = str_replace("<", "&lt;", $Data);
    $Data = str_replace(">", "&gt;", $Data);

    $Breaks = array("\r\n", "\n", "\r");
    $Data = str_replace($Breaks, '<br />', $Data);

    $this->Content = $Data;
}

function docx2text($filename) {
    return $this->readZippedXML($filename, "word/document.xml");
}

function readZippedXML($archiveFile, $dataFile)
{
    // Create new ZIP archive
    $zip = new ZipArchive;

    // Open received archive file
    if (true === $zip->open($archiveFile))
    {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false)
        {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);

            // Close archive file
            $zip->close();

            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $xmldata = $xml->saveXML();
            //$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
            // Return data without XML formatting tags
            return strip_tags($xmldata);
        }

        $zip->close();
    }

    // In case of failure return empty string
    return "";
}

原文

My PHP script successfully reads all text from a .docx file, but I cannot figure out where the line breaks should be so it makes the text bunched up and hard to read (one huge paragraph). I have manually gone over all of the XML files to try and figure it out but I cannot figure it out.

Here are the functions I use to retrieve the file data and return the plain text.

    public function read($FilePath)
{
    // Save name of the file
    parent::SetDocName($FilePath);

    $Data = $this->docx2text($FilePath);

    $Data = str_replace("<", "<", $Data);
    $Data = str_replace(">", ">", $Data);

    $Breaks = array("\r\n", "\n", "\r");
    $Data = str_replace($Breaks, '<br />', $Data);

    $this->Content = $Data;
}

function docx2text($filename) {
    return $this->readZippedXML($filename, "word/document.xml");
}

function readZippedXML($archiveFile, $dataFile)
{
    // Create new ZIP archive
    $zip = new ZipArchive;

    // Open received archive file
    if (true === $zip->open($archiveFile))
    {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false)
        {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);

            // Close archive file
            $zip->close();

            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $xmldata = $xml->saveXML();
            //$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
            // Return data without XML formatting tags
            return strip_tags($xmldata);
        }

        $zip->close();
    }

    // In case of failure return empty string
    return "";
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

通知家属抬走 2024-11-07 06:04:05

这实际上是一个非常简单的答案。您需要做的就是在 readZippedXML() 中添加这一行：

$xmldata = str_replace("</w:p>", "\r\n", $xmldata);

这是因为是单词用来标记段落结尾的词。例如

<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>

It is actually quite a simple answer. All you need to do is add this line in readZippedXML():

$xmldata = str_replace("</w:p>", "\r\n", $xmldata);

This is because </w:p> is what word uses to mark the end of a paragraph. E.g.

<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>

回复收藏 0 原文

安稳善良 2024-11-07 06:04:05

实际上，为什么不使用 OpenXML 呢？我认为它也适用于 PHP。然后您就不必深入了解 xml 文件的具体细节。

这是一个链接：
http://openxmldeveloper.org/articles/4606.aspx

回复收藏 0 原文

~没有更多了~

关于作者

南烟

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

使用 PHP 查找 docx 文件中的换行符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

使用 PHP 查找 docx 文件中的换行符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。