Ruby 的 Truncate 无法清理 MS Word 代码

发布于 2024-09-08 12:18:08 字数 4830 浏览 10 评论 0原文

很好奇是否有人注意到这一点,但我有一个所见即所得,用户偶尔会从单词中粘贴到其中。有一个词消毒剂,但不是每个人都是天才。

如果我在其他地方解析该文本,结果是正确的。但如果我截断它,就会出现 msword 代码。

有谁知道为什么截断会取消此 ||有谁知道如何同时消毒和截断?

更新:

这是我截断后显示的 msword 的示例:

≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪O:Office Document Settings>  ≪Br /> ≪O:Allow Png/>  ≪Br /> ≪/O:Office Document Settings>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Word Document>  ≪Br /> ≪W:Zoom>0≪/W:Zoom>  ≪Br /> ≪W:Track Moves>False≪/W:Track Moves>  ≪Br /> ≪W:Track Formatting/>  ≪Br /> ≪W:Punctuation Kerning/>  ≪Br /> ≪W:Drawing Grid Horizontal Spacing>18 Pt≪/W:Drawing Grid Horizontal Spacing>  ≪Br /> ≪W:Drawing Grid Vertical Spacing>18 Pt≪/W:Drawing Grid Vertical Spacing>  ≪Br /> ≪W:Display Horizontal Drawing Grid Every>0≪/W:Display Horizontal Drawing Grid Every>  ≪Br /> ≪W:Display Vertical Drawing Grid Every>0≪/W:Display Vertical Drawing Grid Every>  ≪Br /> ≪W:Validate Against Schemas/>  ≪Br /> ≪W:Save If Xml Invalid>False≪/W:Save If Xml Invalid>  ≪Br /> ≪W:Ignore Mixed Content>False≪/W:Ignore Mixed Content>  ≪Br /> ≪W:Always Show Placeholder Text>False≪/W:Always Show Placeholder Text>  ≪Br /> ≪W:Compatibility>  ≪Br /> ≪W:Break Wrapped Tables/>  ≪Br /> ≪W:Dont Grow Autofit/>  ≪Br /> ≪W:Dont Autofit Constrained Tables/>  ≪Br /> ≪W:Dont Vert Align In Txbx/>  ≪Br /> ≪/W:Compatibility>  ≪Br /> ≪/W:Word Document>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Latent Styles Def Locked State="False" Latent Style Count="276">  ≪Br /> ≪/W:Latent Styles>  ≪Br />≪/Xml>≪![Endif] >  ≪! {Cke Protected}%3 C!%2 D%2 D%7 Bcke Protected%7 D%253 C!%252 D%252 D%257 Bcke Protected%257 D%25253 C!%25252 D%25252 D%25257 Bcke Protected%25257 D%2525253 C!%2525252 D%2525252 D%2525257 Bcke Protected%2525257 D%252525253 C!%252525252 D%252525252 D%252525257 Bcke Protected%252525257 D%25252525253 C!%25252525252 D%25252525252 D%25252525257 Bcke Protected%25252525257 D%2525252525253 C!%2525252525252 D%2525252525252 D%2525252525250 A%25252525252520%2525252525252 F*%25252525252520 Font%25252525252520 Definitions%25252525252520*%2525252525252 F%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Times%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%252525252525200%252525252525205%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%2525252525253 B%2525252525250 A%25252525252509mso Font Charset%2525252525253 A0%2525252525253 B%2525252525250 A%25252525252509mso Generic Font Family%2525252525253 Aauto%2525252525253 B%2525252525250 A%25252525252509mso Font Pitch%2525252525253 Avariable%2525252525253 B%2525252525250 A%25252525252509mso Font Signature%2525252525253 A3%252525252525200%252525252525200%252525252525200%252525252525201%252525252525200%2525252525253 B%2525252525257 D%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Verdana%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%2525252525252011%252525252525206%252525252525204%25

整个内容大约有 600 个字符长。这是前 200 个左右:

“Excellent” – The New York Times            

“4 Stars”  - The Star-Ledger                                                                       

“Best Romantic Restaurant” – Suburban Essex

“Best View” – OpenTable



In December 1986, the Knowles opened Highlawn after months of restoration to the former open-air “casino” which had, along with the now-prosperous park, been neglected for several years.

这是我在 Stackoverflow 的帮助下制作的自定义清理程序:

def sanitized_text(text)
  sanitized = text.gsub(/≪[^>]*>/, '')
end

此清理程序的问题在于,在我截断为 125 个字符后,它会返回空白。我将其扩展到 600 个字符,并得到一行,这是另一个 msword 条件语句。

更新: 这是生成 msword 内容的代码。

 = truncate(organization.about_us, 125)

请注意,当我刚刚输入此内容时:

 = organization.about_us

结果很好,但当然没有被截断。

我还应该添加这是 Ruby 1.8.7 /rails 2.3.5

Curious if anyone ever noticed this, but I have a WYSIWYG that users occassionally paste from word into. There is a word sanitizer, but not everyone's a genius.

If I parse that text somewhere else, it comes out right. But if I truncate it, then the msword code appears.

Does anyone know why truncate unsanitizes this || does anyone know how to sanitize and truncate at the same time?

UPDATE:

Here's an example of the msword being displayed after I truncate :

≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪O:Office Document Settings>  ≪Br /> ≪O:Allow Png/>  ≪Br /> ≪/O:Office Document Settings>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Word Document>  ≪Br /> ≪W:Zoom>0≪/W:Zoom>  ≪Br /> ≪W:Track Moves>False≪/W:Track Moves>  ≪Br /> ≪W:Track Formatting/>  ≪Br /> ≪W:Punctuation Kerning/>  ≪Br /> ≪W:Drawing Grid Horizontal Spacing>18 Pt≪/W:Drawing Grid Horizontal Spacing>  ≪Br /> ≪W:Drawing Grid Vertical Spacing>18 Pt≪/W:Drawing Grid Vertical Spacing>  ≪Br /> ≪W:Display Horizontal Drawing Grid Every>0≪/W:Display Horizontal Drawing Grid Every>  ≪Br /> ≪W:Display Vertical Drawing Grid Every>0≪/W:Display Vertical Drawing Grid Every>  ≪Br /> ≪W:Validate Against Schemas/>  ≪Br /> ≪W:Save If Xml Invalid>False≪/W:Save If Xml Invalid>  ≪Br /> ≪W:Ignore Mixed Content>False≪/W:Ignore Mixed Content>  ≪Br /> ≪W:Always Show Placeholder Text>False≪/W:Always Show Placeholder Text>  ≪Br /> ≪W:Compatibility>  ≪Br /> ≪W:Break Wrapped Tables/>  ≪Br /> ≪W:Dont Grow Autofit/>  ≪Br /> ≪W:Dont Autofit Constrained Tables/>  ≪Br /> ≪W:Dont Vert Align In Txbx/>  ≪Br /> ≪/W:Compatibility>  ≪Br /> ≪/W:Word Document>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Latent Styles Def Locked State="False" Latent Style Count="276">  ≪Br /> ≪/W:Latent Styles>  ≪Br />≪/Xml>≪![Endif] >  ≪! {Cke Protected}%3 C!%2 D%2 D%7 Bcke Protected%7 D%253 C!%252 D%252 D%257 Bcke Protected%257 D%25253 C!%25252 D%25252 D%25257 Bcke Protected%25257 D%2525253 C!%2525252 D%2525252 D%2525257 Bcke Protected%2525257 D%252525253 C!%252525252 D%252525252 D%252525257 Bcke Protected%252525257 D%25252525253 C!%25252525252 D%25252525252 D%25252525257 Bcke Protected%25252525257 D%2525252525253 C!%2525252525252 D%2525252525252 D%2525252525250 A%25252525252520%2525252525252 F*%25252525252520 Font%25252525252520 Definitions%25252525252520*%2525252525252 F%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Times%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%252525252525200%252525252525205%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%2525252525253 B%2525252525250 A%25252525252509mso Font Charset%2525252525253 A0%2525252525253 B%2525252525250 A%25252525252509mso Generic Font Family%2525252525253 Aauto%2525252525253 B%2525252525250 A%25252525252509mso Font Pitch%2525252525253 Avariable%2525252525253 B%2525252525250 A%25252525252509mso Font Signature%2525252525253 A3%252525252525200%252525252525200%252525252525200%252525252525201%252525252525200%2525252525253 B%2525252525257 D%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Verdana%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%2525252525252011%252525252525206%252525252525204%25

The whole thing is about 600 characters long. This is the first 200 or so :

“Excellent” – The New York Times            

“4 Stars”  - The Star-Ledger                                                                       

“Best Romantic Restaurant” – Suburban Essex

“Best View” – OpenTable



In December 1986, the Knowles opened Highlawn after months of restoration to the former open-air “casino” which had, along with the now-prosperous park, been neglected for several years.

Here's a custom sanitizer I made with the help of Stackoverflow :

def sanitized_text(text)
  sanitized = text.gsub(/≪[^>]*>/, '')
end

The trouble with this sanitizer is that it returns empty white space after I truncate to 125 characters. I expanded it to 600 characters, and I get a single line that is another msword conditional statement.

Update:
This is the code that produces the msword content.

 = truncate(organization.about_us, 125)

Note that when I just put this :

 = organization.about_us

It comes out fine, but of course not truncated.

I should also add this is Ruby 1.8.7 / rails 2.3.5

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

沒落の蓅哖 2024-09-15 12:18:08

截断 HTML 始终是一个真正的麻烦,因为您最终可能会分割标签和实体。如果没有正确的 UTF-8 处理,您还会面临将两个字节字符切成两半的风险。

另一件需要注意的事情是过于贪婪的正则表达式:

def sanitized_text(text)
  sanitized = text.gsub(/≪[^>]*?>/, '')
end

*?将捕获最小匹配,其中 * 将捕获最大匹配。

例如:

<A><B>

这可以分为“<”、“A>”如果你最终表达错误。

编辑:我尝试重现此内容,但没有成功。

在此示例中,使用粘贴并清理的文本,一切似乎都正常。

# app/controllers/example_controller.rb
class ExampleController < ApplicationController
  def index
    @text = '≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪O:Office Document Settings>  ≪Br /> ≪O:Allow Png/>  ≪Br /> ≪/O:Office Document Settings>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Word Document>  ≪Br /> ≪W:Zoom>0≪/W:Zoom>  ≪Br /> ≪W:Track Moves>False≪/W:Track Moves>  ≪Br /> ≪W:Track Formatting/>  ≪Br /> ≪W:Punctuation Kerning/>  ≪Br /> ≪W:Drawing Grid Horizontal Spacing>18 Pt≪/W:Drawing Grid Horizontal Spacing>  ≪Br /> ≪W:Drawing Grid Vertical Spacing>18 Pt≪/W:Drawing Grid Vertical Spacing>  ≪Br /> ≪W:Display Horizontal Drawing Grid Every>0≪/W:Display Horizontal Drawing Grid Every>  ≪Br /> ≪W:Display Vertical Drawing Grid Every>0≪/W:Display Vertical Drawing Grid Every>  ≪Br /> ≪W:Validate Against Schemas/>  ≪Br /> ≪W:Save If Xml Invalid>False≪/W:Save If Xml Invalid>  ≪Br /> ≪W:Ignore Mixed Content>False≪/W:Ignore Mixed Content>  ≪Br /> ≪W:Always Show Placeholder Text>False≪/W:Always Show Placeholder Text>  ≪Br /> ≪W:Compatibility>  ≪Br /> ≪W:Break Wrapped Tables/>  ≪Br /> ≪W:Dont Grow Autofit/>  ≪Br /> ≪W:Dont Autofit Constrained Tables/>  ≪Br /> ≪W:Dont Vert Align In Txbx/>  ≪Br /> ≪/W:Compatibility>  ≪Br /> ≪/W:Word Document>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Latent Styles Def Locked State="False" Latent Style Count="276">  ≪Br /> ≪/W:Latent Styles>  ≪Br />≪/Xml>≪![Endif] >  ≪! {Cke Protected}%3 C!%2 D%2 D%7 Bcke Protected%7 D%253 C!%252 D%252 D%257 Bcke Protected%257 D%25253 C!%25252 D%25252 D%25257 Bcke Protected%25257 D%2525253 C!%2525252 D%2525252 D%2525257 Bcke Protected%2525257 D%252525253 C!%252525252 D%252525252 D%252525257 Bcke Protected%252525257 D%25252525253 C!%25252525252 D%25252525252 D%25252525257 Bcke Protected%25252525257 D%2525252525253 C!%2525252525252 D%2525252525252 D%2525252525250 A%25252525252520%2525252525252 F*%25252525252520 Font%25252525252520 Definitions%25252525252520*%2525252525252 F%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Times%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%252525252525200%252525252525205%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%2525252525253 B%2525252525250 A%25252525252509mso Font Charset%2525252525253 A0%2525252525253 B%2525252525250 A%25252525252509mso Generic Font Family%2525252525253 Aauto%2525252525253 B%2525252525250 A%25252525252509mso Font Pitch%2525252525253 Avariable%2525252525253 B%2525252525250 A%25252525252509mso Font Signature%2525252525253 A3%252525252525200%252525252525200%252525252525200%252525252525201%252525252525200%2525252525253 B%2525252525257 D%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Verdana%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%2525252525252011%252525252525206%252525252525204%2'
  end
end

# app/helpers/example_helper.rb
module ExampleHelper
  def sanitized_text(text)
    text.gsub(/≪[^>]*>/, '')
  end
end

视图本身与您所拥有的差不多:

<!-- app/views/example/index.html.erb -->
<body>
  <strong>Original</strong>
  <div>
    <%= sanitized_text(@text) %>
  </div>
  <strong>Truncated</strong>
  <div>
    <%= truncate(sanitized_text(@text), :length => 125) %>
  </div>
  <strong>Truncated With Deprecated Option</strong>
  <div>
    <%= truncate(sanitized_text(@text), 125) %>
  </div>
</body>

这是在 OS X 上使用 Ruby 1.8.7p174、Rails 2.3.5 使用 WEBrick 进行测试的。

Truncating HTML is always a real hassle because you can end up splitting tags and entities. Without proper UTF-8 handling, you also run the risk of chopping a two byte character in half.

Another thing to watch out for is overly greedy regular expressions:

def sanitized_text(text)
  sanitized = text.gsub(/≪[^>]*?>/, '')
end

The *? will capture the minimum that matches, where * will capture the largest match.

For instance:

<A><B>

This can be grouped into "<", "A><B", and ">" if you end up with the wrong expression.

Edit: I've tried to reproduce this and had no luck.

With this example, using your text pasted in and sanitized, everything appears to be okay.

# app/controllers/example_controller.rb
class ExampleController < ApplicationController
  def index
    @text = '≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪O:Office Document Settings>  ≪Br /> ≪O:Allow Png/>  ≪Br /> ≪/O:Office Document Settings>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Word Document>  ≪Br /> ≪W:Zoom>0≪/W:Zoom>  ≪Br /> ≪W:Track Moves>False≪/W:Track Moves>  ≪Br /> ≪W:Track Formatting/>  ≪Br /> ≪W:Punctuation Kerning/>  ≪Br /> ≪W:Drawing Grid Horizontal Spacing>18 Pt≪/W:Drawing Grid Horizontal Spacing>  ≪Br /> ≪W:Drawing Grid Vertical Spacing>18 Pt≪/W:Drawing Grid Vertical Spacing>  ≪Br /> ≪W:Display Horizontal Drawing Grid Every>0≪/W:Display Horizontal Drawing Grid Every>  ≪Br /> ≪W:Display Vertical Drawing Grid Every>0≪/W:Display Vertical Drawing Grid Every>  ≪Br /> ≪W:Validate Against Schemas/>  ≪Br /> ≪W:Save If Xml Invalid>False≪/W:Save If Xml Invalid>  ≪Br /> ≪W:Ignore Mixed Content>False≪/W:Ignore Mixed Content>  ≪Br /> ≪W:Always Show Placeholder Text>False≪/W:Always Show Placeholder Text>  ≪Br /> ≪W:Compatibility>  ≪Br /> ≪W:Break Wrapped Tables/>  ≪Br /> ≪W:Dont Grow Autofit/>  ≪Br /> ≪W:Dont Autofit Constrained Tables/>  ≪Br /> ≪W:Dont Vert Align In Txbx/>  ≪Br /> ≪/W:Compatibility>  ≪Br /> ≪/W:Word Document>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Latent Styles Def Locked State="False" Latent Style Count="276">  ≪Br /> ≪/W:Latent Styles>  ≪Br />≪/Xml>≪![Endif] >  ≪! {Cke Protected}%3 C!%2 D%2 D%7 Bcke Protected%7 D%253 C!%252 D%252 D%257 Bcke Protected%257 D%25253 C!%25252 D%25252 D%25257 Bcke Protected%25257 D%2525253 C!%2525252 D%2525252 D%2525257 Bcke Protected%2525257 D%252525253 C!%252525252 D%252525252 D%252525257 Bcke Protected%252525257 D%25252525253 C!%25252525252 D%25252525252 D%25252525257 Bcke Protected%25252525257 D%2525252525253 C!%2525252525252 D%2525252525252 D%2525252525250 A%25252525252520%2525252525252 F*%25252525252520 Font%25252525252520 Definitions%25252525252520*%2525252525252 F%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Times%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%252525252525200%252525252525205%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%2525252525253 B%2525252525250 A%25252525252509mso Font Charset%2525252525253 A0%2525252525253 B%2525252525250 A%25252525252509mso Generic Font Family%2525252525253 Aauto%2525252525253 B%2525252525250 A%25252525252509mso Font Pitch%2525252525253 Avariable%2525252525253 B%2525252525250 A%25252525252509mso Font Signature%2525252525253 A3%252525252525200%252525252525200%252525252525200%252525252525201%252525252525200%2525252525253 B%2525252525257 D%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Verdana%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%2525252525252011%252525252525206%252525252525204%2'
  end
end

# app/helpers/example_helper.rb
module ExampleHelper
  def sanitized_text(text)
    text.gsub(/≪[^>]*>/, '')
  end
end

The view itself is pretty much what you have:

<!-- app/views/example/index.html.erb -->
<body>
  <strong>Original</strong>
  <div>
    <%= sanitized_text(@text) %>
  </div>
  <strong>Truncated</strong>
  <div>
    <%= truncate(sanitized_text(@text), :length => 125) %>
  </div>
  <strong>Truncated With Deprecated Option</strong>
  <div>
    <%= truncate(sanitized_text(@text), 125) %>
  </div>
</body>

This was on OS X with Ruby 1.8.7p174, Rails 2.3.5 using WEBrick to test.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文