如何使用 REBOL 解析 HTML 标签内部?
我有一个已加载加载/标记的网页。我需要从中解析出一堆东西,但有些数据在标签中。我有什么想法可以解析它吗?这是到目前为止我所得到(和尝试过)的示例:
REBOL []
mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:[email protected]>[email protected]</a></td>
<td>9999</td>
}
rules: [
some [
; The expression below /will/ work, but is useless because of specificity.
; <td> <a href=mailto:[email protected]> s: string! </a> (print s/1) </td> |
; The expression below will not work, because <a> doesn't match <a mailto=...>
; <td> <a> s: string! </a> (print s/1) </td> |
<td> s: string! (print s/1) </td> |
tag! | string! ; Catch any leftovers.
]
]
解析加载/标记混乱规则
这会产生:
Bob Sockaway
9999
我想看到更多类似的内容:
Bob Sockaway
[email protected]
9999
有什么想法吗?谢谢!
笔记!无论如何,我想出了一个很好的简单规则集,它将获得所需的结果:
rules: [
some [
<td> any [tag!] s: string! (print s/1) any [tag!] </td> |
tag! | string! ; Catch any leftovers.
]
]
I have a web page that I've loaded with load/markup. I need to parse a bunch of stuff out of it, but some of the data is in the tags. Any ideas of how I can parse it? Here's a sample of what I've got (and tried) so far:
REBOL []
mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:[email protected]>[email protected]</a></td>
<td>9999</td>
}
rules: [
some [
; The expression below /will/ work, but is useless because of specificity.
; <td> <a href=mailto:[email protected]> s: string! </a> (print s/1) </td> |
; The expression below will not work, because <a> doesn't match <a mailto=...>
; <td> <a> s: string! </a> (print s/1) </td> |
<td> s: string! (print s/1) </td> |
tag! | string! ; Catch any leftovers.
]
]
parse load/markup mess rules
This produces:
Bob Sockaway
9999
I would like to see something more like:
Bob Sockaway
[email protected]
9999
Any thoughts? Thanks!
Note! For what it's worth, I came up with a good simple ruleset that will get the desired results:
rules: [
some [
<td> any [tag!] s: string! (print s/1) any [tag!] </td> |
tag! | string! ; Catch any leftovers.
]
]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当使用
LOAD/MARKUP
处理mess
时,您会得到以下结果(并且我已使用类型进行格式化和注释):您的输出模式与表单
[ 的系列相匹配细绳! ]
但不是[ 形式的内容标签!细绳!标签! ]
。回避标题中提出的问题,您可以通过多种方式解决这个特殊的困境。一种可能是维护您是否在 TD 标签内的计数,并在计数非零时打印任何字符串:这会产生您要求的输出:
但您本质上还想知道是否可以转换为字符串在同一组规则中从块解析进行解析(无需跳入开放代码)。我研究了“混合解析”,看起来它可能是 Rebol 3 中解决的一个功能。尽管如此,我还是无法让它在实践中发挥作用。所以我问了我自己的问题。
如何混合字符串解析并在同一规则中进行块解析?
When
mess
is processed withLOAD/MARKUP
you get this (and I've formatted + commented with the types):Your output pattern matches series of the form
[<td> string! </td>]
but not things of the form[<td> tag! string! tag! </td>]
. Sidestepping the question posed in your title, you could solve this particular dilemma several ways. One might be to maintain a count of whether you are inside a TD tag and print any strings when the count is non-zero:This produces the output you asked for:
But you also wanted to know, essentially, whether you can transition into string parsing from block parsing in the same set of rules (without jumping into open code). I looked into it "mixed parsing" looks like it may be a feature addressed in Rebol 3. Still, I couldn't get it to work in practice. So I asked a question of my own.
How to mix together string parsing and block parsing in the same rule?
我想我找到了一个非常好的解决方案。如果您有许多需要其属性的不同标签,则可能必须对其进行概括。
我正在寻找查询标签的 id 属性!:
在标签的解析规则中!,我这样做了:
要查看更多标签,我会使用案例。也许这会更好地设置 _qid
我最终需要解析另一个标签,这是一个很好的通用模式
I think I found a pretty good solution. It may have to be generalized if you had lots of different tags whose attributes you need.
I was looking for the id attribute of the query tag!:
In the parse rule for tag!, I did this:
More tags to look at, I'd use case. And maybe this would be better to set _qid
I ended up needing to parse another tag and this is a nice general pattern