如何使用 nokogiri 和 ruby​​zip 编辑 docx

发布于 2024-09-26 05:47:20 字数 687 浏览 11 评论 0原文

我使用 ruby​​zip 和 nokogiri 的组合来编辑 .docx 文件。我使用 ruby​​zip 解压缩 .docx 文件,然后使用 nokogiri 解析和更改 word/document.xml 文件的正文,但每次我最后关闭 ruby​​zip 时,它都会损坏文件,我无法打开它或修复它。我在桌面上解压 .docx 文件并检查 word/document.xml 文件,内容已更新为我更改的内容,但所有其他文件都混乱了。有人可以帮我解决这个问题吗?这是我的代码:

require 'rubygems'  
require 'zip/zip'  
require 'nokogiri'  
zip = Zip::ZipFile.open("test.docx")  
doc = zip.find_entry("word/document.xml")  
xml = Nokogiri::XML.parse(doc.get_input_stream)  
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}).first  
wt.content = "New Text"  
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}  
zip.close

I'm using a combination of rubyzip and nokogiri to edit a .docx file. I'm using rubyzip to unzip the .docx file and then using nokogiri to parse and change the body of the word/document.xml file but ever time I close rubyzip at the end it corrupts the file and I can't open it or repair it. I unzip the .docx file on desktop and check the word/document.xml file and the content is updated to what I changed it to but all the other files are messed up. Could someone help me with this issue? Here is my code:

require 'rubygems'  
require 'zip/zip'  
require 'nokogiri'  
zip = Zip::ZipFile.open("test.docx")  
doc = zip.find_entry("word/document.xml")  
xml = Nokogiri::XML.parse(doc.get_input_stream)  
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}).first  
wt.content = "New Text"  
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}  
zip.close

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜雨飘雪 2024-10-03 05:47:20

昨晚我在 ruby​​zip 中遇到了同样的腐败问题。我通过将所有内容复制到新的 zip 文件并根据需要替换文件来解决这个问题。

这是我的工作概念证明:

#!/usr/bin/env ruby

require 'rubygems'
require 'zip/zip' # rubyzip gem
require 'nokogiri'

class WordXmlFile
  def self.open(path, &block)
    self.new(path, &block)
  end

  def initialize(path, &block)
    @replace = {}
    if block_given?
      @zip = Zip::ZipFile.open(path)
      yield(self)
      @zip.close
    else
      @zip = Zip::ZipFile.open(path)
    end
  end

  def merge(rec)
    xml = @zip.read("word/document.xml")
    doc = Nokogiri::XML(xml) {|x| x.noent}
    (doc/"//w:fldSimple").each do |field|
      if field.attributes['instr'].value =~ /MERGEFIELD (\S+)/
        text_node = (field/".//w:t").first
        if text_node
          text_node.inner_html = rec[$1].to_s
        else
          puts "No text node for #{$1}"
        end
      end
    end
    @replace["word/document.xml"] = doc.serialize :save_with => 0
  end

  def save(path)
    Zip::ZipFile.open(path, Zip::ZipFile::CREATE) do |out|
      @zip.each do |entry|
        out.get_output_stream(entry.name) do |o|
          if @replace[entry.name]
            o.write(@replace[entry.name])
          else
            o.write(@zip.read(entry.name))
          end
        end
      end
    end
    @zip.close
  end
end

if __FILE__ == $0
  file = ARGV[0]
  out_file = ARGV[1] || file.sub(/\.docx/, ' Merged.docx')
  w = WordXmlFile.open(file) 
  w.force_settings
  w.merge('First_Name' => 'Eric', 'Last_Name' => 'Mason')
  w.save(out_file)
end

I ran into the same corruption problem with rubyzip last night. I solved it by copying everything to a new zip file, replacing files as necessary.

Here's my working proof of concept:

#!/usr/bin/env ruby

require 'rubygems'
require 'zip/zip' # rubyzip gem
require 'nokogiri'

class WordXmlFile
  def self.open(path, &block)
    self.new(path, &block)
  end

  def initialize(path, &block)
    @replace = {}
    if block_given?
      @zip = Zip::ZipFile.open(path)
      yield(self)
      @zip.close
    else
      @zip = Zip::ZipFile.open(path)
    end
  end

  def merge(rec)
    xml = @zip.read("word/document.xml")
    doc = Nokogiri::XML(xml) {|x| x.noent}
    (doc/"//w:fldSimple").each do |field|
      if field.attributes['instr'].value =~ /MERGEFIELD (\S+)/
        text_node = (field/".//w:t").first
        if text_node
          text_node.inner_html = rec[$1].to_s
        else
          puts "No text node for #{$1}"
        end
      end
    end
    @replace["word/document.xml"] = doc.serialize :save_with => 0
  end

  def save(path)
    Zip::ZipFile.open(path, Zip::ZipFile::CREATE) do |out|
      @zip.each do |entry|
        out.get_output_stream(entry.name) do |o|
          if @replace[entry.name]
            o.write(@replace[entry.name])
          else
            o.write(@zip.read(entry.name))
          end
        end
      end
    end
    @zip.close
  end
end

if __FILE__ == $0
  file = ARGV[0]
  out_file = ARGV[1] || file.sub(/\.docx/, ' Merged.docx')
  w = WordXmlFile.open(file) 
  w.force_settings
  w.merge('First_Name' => 'Eric', 'Last_Name' => 'Mason')
  w.save(out_file)
end
蓝眸 2024-10-03 05:47:20

我偶然发现了这篇文章,对 ruby​​ 或 nokogiri 一无所知,但是......

看来您错误地重新压缩了新内容。
我不知道 ruby​​zip,但你需要一种方法来告诉它更新条目 word/document.xml
然后重新保存/重新压缩文件。

看起来您只是用新数据覆盖该条目,当然,新数据的大小会有所不同,并且完全搞砸了 zip 文件的其余部分。

我在这篇文章中给出了一个Excel的例子 解析文本文件并创建一个 excel 报告

即使我使用不同的 zip 库和 VB,它也可能有用(我仍然在做你想做的事情,我的代码大约是一半)

这里是部分适用

Using z As ZipFile = ZipFile.Read(xlStream.BaseStream) 
'Grab Sheet 1 out of the file parts and read it into a string. 
Dim myEntry As ZipEntry = z("xl/worksheets/sheet1.xml") 
Dim msSheet1 As New MemoryStream 
myEntry.Extract(msSheet1) 
msSheet1.Position = 0 
Dim sr As New StreamReader(msSheet1) 
Dim strXMLData As String = sr.ReadToEnd 

'Grab the data in the empty sheet and swap out the data that I want  
Dim str2 As XElement = CreateSheetData(tbl) 
Dim strReplace As String = strXMLData.Replace("<sheetData/>", str2.ToString) 
z.UpdateEntry("xl/worksheets/sheet1.xml", strReplace) 
'This just rezips the file with the new data it doesnt save to disk 
z.Save(fiRet.FullName) 
End Using 

I stumbled accross the post and know nothing about ruby or nokogiri but ...

It looks like you are reziping the new content incorrectly.
I don't know about rubyzip, but you need a way to tell it to update the entry word/document.xml
and then resave/rezip the file.

It looks like you are just overwriting the entry with new data wich of course is going to be a different size and totally screw up the rest of the zip file.

I give an example for excel in this post Parse text file and create an excel report

which may be of use even though i am using a different zip library and VB (Im still doing exactly what you are trying to do, my code is about half way down)

here is the part that applies

Using z As ZipFile = ZipFile.Read(xlStream.BaseStream) 
'Grab Sheet 1 out of the file parts and read it into a string. 
Dim myEntry As ZipEntry = z("xl/worksheets/sheet1.xml") 
Dim msSheet1 As New MemoryStream 
myEntry.Extract(msSheet1) 
msSheet1.Position = 0 
Dim sr As New StreamReader(msSheet1) 
Dim strXMLData As String = sr.ReadToEnd 

'Grab the data in the empty sheet and swap out the data that I want  
Dim str2 As XElement = CreateSheetData(tbl) 
Dim strReplace As String = strXMLData.Replace("<sheetData/>", str2.ToString) 
z.UpdateEntry("xl/worksheets/sheet1.xml", strReplace) 
'This just rezips the file with the new data it doesnt save to disk 
z.Save(fiRet.FullName) 
End Using 
如梦亦如幻 2024-10-03 05:47:20

根据 官方 Github 文档,您应该使用write_buffer 而不是 open。链接中还有一个代码示例。

According to the official Github documentation, you should Use write_buffer instead open. There's also a code example at the link.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文