JSoup - 属性内的引用

发布于 2024-12-13 04:10:23 字数 708 浏览 2 评论 0 原文

我正在使用 JSoup 尝试从几个网站构建有效的 XML。大多数时候它都工作得非常好，但最近我遇到了一些 JSoup 似乎无法修复的不良 HTML 情况。

<meta name="saploTags" content="Tag1,Tag2,Tag3," Tag4,Tag5,Tag6"/>

结果

<meta name="saploTags" content="Tag1,Tag2,Tag3," tag4,tag5,tag6"="" />

当我稍后尝试对生成的 XML 建立索引时，这会导致问题。有人有什么建议吗？最好我会以某种方式转义或删除最左边和最右边引号之间的所有内容，以防止数据丢失（例如 content="Tag1,Tag2,Tag3,Tag4,Tag5,Tag6"。否则如果 JSoup 就可以了在第一个“结束引用”之后切断，忽略最后一个标签，例如 content="Tag1,Tag2,Tag3"

（我发现的类似问题是例如这会导致类似的问题）

是否可以通过以下方式解决此问题jsoup，还是我已经走到了死胡同？

/问候，马格努斯

原文

I'm using JSoup in an attempt to built valid XML from a couple of websites. Most of the time it has worked phenomenally well, but recently I've encountered some cases of bad HTML that JSoup can't seem to fix.

<meta name="saploTags" content="Tag1,Tag2,Tag3," Tag4,Tag5,Tag6"/>

Results in

<meta name="saploTags" content="Tag1,Tag2,Tag3," tag4,tag5,tag6"="" />

This causes problems later on when I'm trying to index the resulting XML. Does anyone have any suggestions what to do? Preferably I'd have everything between the leftmost and rightmost quotation marks escaped or removed in some way in order to prevent data loss (like content="Tag1,Tag2,Tag3,Tag4,Tag5,Tag6". Otherwise it would be ok if JSoup cut off after its first "end quote", disregarding the last tags, like content="Tag1,Tag2,Tag3".

(Similar problems that I've found is e.g. <img src=".." alt="This text contains the quote "The quote" and here's some more text"/> which causes similar problems)

Is it possible to get around this with jsoup, or have I reached a dead end?

/Regards, Magnus

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

℉服软 2024-12-20 04:10:23

这根本就是无效的 XML 或 HTML。如果要将这些双引号视为属性值的一部分，则应将它们转换为字符引用。即使您可以将解析器设置得非常宽松，它也无法解决这个问题，因为不再清楚属性内容在哪里结束。

尝试自动修复这个问题似乎相当困难。有各种各样的极端情况会对任何解决方案造成严重破坏。这应该如何解释，例如：

<element attribute="this isn't "quite" the=correct way="to=" do things"" />

看看 SO 代码格式化程序如何与之斗争。

即使您自己理解这一点也很困难，更不用说编写一个可以理解什么是属性内容或不是属性内容的工具了。

简单的方法？只是不接受无效的 HTML。它已经足够宽松了，大多数解析器允许小写和大写元素名称，结束标签并不总是强制的等等。如果人们仍然设法生成无效的 HTML，那么对他们来说就太糟糕了。

That's quite simply not valid XML nor HTML. Those double quotes should be turned into character references if they're to be considered as part of the attribute value. Even if you could set a parser to be very lenient, it's not gonna be able to solve this because it is no longer clear where the attribute content ends.

Trying to automatically fix this seems rather difficult. There's all sorts of corner cases that'll wreak havoc on any sort of solution. How's this supposed to be interpreted, for example:

<element attribute="this isn't "quite" the=correct way="to=" do things"" />

Look at how the SO code formatter struggles with it.

Even making sense of this yourself is difficult, let alone writing a tool that's gonna make sense of what is or isn't attribute content.

Simple approach? Just don't accept invalid HTML. It's lenient enough as it is, with most parsers allowing lower case and upper case element names, closing tags not always being mandatory etc. If people still manage to generate invalid HTML, then too bad for them.

回复收藏 0 原文

~没有更多了~