使用 PHP 查找 docx 文件中的换行符

发布于 2024-10-31 06:04:05 字数 1662 浏览 1 评论 0原文

我的 PHP 脚本成功读取 .docx 文件中的所有文本,但我无法弄清楚换行符应该在哪里,因此它使文本堆积在一起并且难以阅读(一大段)。我已经手动检查了所有 XML 文件以尝试找出答案,但我无法找出答案。

以下是我用来检索文件数据并返回纯文本的函数。

    public function read($FilePath)
{
    // Save name of the file
    parent::SetDocName($FilePath);

    $Data = $this->docx2text($FilePath);

    $Data = str_replace("<", "&lt;", $Data);
    $Data = str_replace(">", "&gt;", $Data);

    $Breaks = array("\r\n", "\n", "\r");
    $Data = str_replace($Breaks, '<br />', $Data);

    $this->Content = $Data;
}

function docx2text($filename) {
    return $this->readZippedXML($filename, "word/document.xml");
}

function readZippedXML($archiveFile, $dataFile)
{
    // Create new ZIP archive
    $zip = new ZipArchive;

    // Open received archive file
    if (true === $zip->open($archiveFile))
    {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false)
        {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);

            // Close archive file
            $zip->close();

            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $xmldata = $xml->saveXML();
            //$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
            // Return data without XML formatting tags
            return strip_tags($xmldata);
        }

        $zip->close();
    }

    // In case of failure return empty string
    return "";
} 

My PHP script successfully reads all text from a .docx file, but I cannot figure out where the line breaks should be so it makes the text bunched up and hard to read (one huge paragraph). I have manually gone over all of the XML files to try and figure it out but I cannot figure it out.

Here are the functions I use to retrieve the file data and return the plain text.

    public function read($FilePath)
{
    // Save name of the file
    parent::SetDocName($FilePath);

    $Data = $this->docx2text($FilePath);

    $Data = str_replace("<", "<", $Data);
    $Data = str_replace(">", ">", $Data);

    $Breaks = array("\r\n", "\n", "\r");
    $Data = str_replace($Breaks, '<br />', $Data);

    $this->Content = $Data;
}

function docx2text($filename) {
    return $this->readZippedXML($filename, "word/document.xml");
}

function readZippedXML($archiveFile, $dataFile)
{
    // Create new ZIP archive
    $zip = new ZipArchive;

    // Open received archive file
    if (true === $zip->open($archiveFile))
    {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false)
        {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);

            // Close archive file
            $zip->close();

            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            $xmldata = $xml->saveXML();
            //$xmldata = str_replace("</w:t>", "\r\n", $xmldata);
            // Return data without XML formatting tags
            return strip_tags($xmldata);
        }

        $zip->close();
    }

    // In case of failure return empty string
    return "";
} 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

通知家属抬走 2024-11-07 06:04:05

这实际上是一个非常简单的答案。您需要做的就是在 readZippedXML() 中添加这一行:

$xmldata = str_replace("</w:p>", "\r\n", $xmldata);

这是因为 是单词用来标记段落结尾的词。例如

<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>

It is actually quite a simple answer. All you need to do is add this line in readZippedXML():

$xmldata = str_replace("</w:p>", "\r\n", $xmldata);

This is because </w:p> is what word uses to mark the end of a paragraph. E.g.

<w:p>This is a paragraph.</w:p>
<w:p>And a second one.</w:p>
安稳善良 2024-11-07 06:04:05

实际上,为什么不使用 OpenXML 呢?我认为它也适用于 PHP。然后您就不必深入了解 xml 文件的具体细节。

这是一个链接:
http://openxmldeveloper.org/articles/4606.aspx

Actually, why don't you use OpenXML? I think it works with PHP too. And then you don't have to go down to the nitty gritty file xml details.

Here is a link:
http://openxmldeveloper.org/articles/4606.aspx

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文