如何使用 REBOL 解析 HTML 标签内部?

发布于 2024-11-15 23:37:43 字数 1691 浏览 3 评论 0原文

我有一个已加载加载/标记的网页。我需要从中解析出一堆东西,但有些数据在标签中。我有什么想法可以解析它吗?这是到目前为止我所得到(和尝试过)的示例:

REBOL []

mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:[email protected]>[email protected]</a></td>
<td>9999</td>
}

rules: [
    some [
        ; The expression below /will/ work, but is useless because of specificity.
        ; <td> <a href=mailto:[email protected]> s: string! </a> (print s/1) </td> | 

        ; The expression below will not work, because <a> doesn't match <a mailto=...>
        ; <td> <a> s: string! </a> (print s/1) </td> |

        <td> s: string! (print s/1) </td> |

        tag! | string! ; Catch any leftovers.
    ]
]

解析加载/标记混乱规则

这会产生:

Bob Sockaway
9999

我想看到更多类似的内容:

Bob Sockaway
[email protected]
9999

有什么想法吗?谢谢!

笔记!无论如何,我想出了一个很好的简单规则集,它将获得所需的结果:

rules: [
    some [
        <td> any [tag!] s: string! (print s/1) any [tag!] </td> |
        tag! | string! ; Catch any leftovers.
    ]
]

I have a web page that I've loaded with load/markup. I need to parse a bunch of stuff out of it, but some of the data is in the tags. Any ideas of how I can parse it? Here's a sample of what I've got (and tried) so far:

REBOL []

mess: {
<td>Bob Sockaway</td>
<td><a href=mailto:[email protected]>[email protected]</a></td>
<td>9999</td>
}

rules: [
    some [
        ; The expression below /will/ work, but is useless because of specificity.
        ; <td> <a href=mailto:[email protected]> s: string! </a> (print s/1) </td> | 

        ; The expression below will not work, because <a> doesn't match <a mailto=...>
        ; <td> <a> s: string! </a> (print s/1) </td> |

        <td> s: string! (print s/1) </td> |

        tag! | string! ; Catch any leftovers.
    ]
]

parse load/markup mess rules

This produces:

Bob Sockaway
9999

I would like to see something more like:

Bob Sockaway
[email protected]
9999

Any thoughts? Thanks!

Note! For what it's worth, I came up with a good simple ruleset that will get the desired results:

rules: [
    some [
        <td> any [tag!] s: string! (print s/1) any [tag!] </td> |
        tag! | string! ; Catch any leftovers.
    ]
]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

娇纵 2024-11-22 23:37:44

当使用 LOAD/MARKUP 处理 mess 时,您会得到以下结果(并且我已使用类型进行格式化和注释):

[
    ; string!
    "^/" 

    ; tag! string! tag!
    <td> "Bob Sockaway" </td>

    ; string!
    "^/"

    ; tag! tag!
    ;     string!
    ; tag! tag!
    <td> <a href=mailto:[email protected]>
        "[email protected]"
    </a> </td>

    ; (Note: you didn't put the anchor's href in quotes above...)

    ; string!
    "^/"

    ; tag! string! tag!
    <td> "9999" </td> 

    ; string!
    "^/"
]

您的输出模式与表单 [ 的系列相匹配细绳! ] 但不是 [ 形式的内容标签!细绳!标签! ]。回避标题中提出的问题,您可以通过多种方式解决这个特殊的困境。一种可能是维护您是否在 TD 标签内的计数,并在计数非零时打印任何字符串:

rules: [
    (td-count: 0)
    some [
        ; if we see an open TD tag, increment a counter
        <td> (++ td-count)
        |
        ; if we see a close TD tag, decrement a counter
        </td> (-- td-count)
        |
        ; capture parse position in s if we find a string
        ; and if counter is > 0 then print the first element at
        ; the parse position (e.g. the string we just found) 
        s: string! (if td-count > 0 [print s/1])
        |
        ; if we find any non-TD tags, match them so the
        ; parser will continue along but don't run any code
        tag!
    ]
]

这会产生您要求的输出:

Bob Sockaway
[email protected]
9999

但您本质上还想知道是否可以转换为字符串在同一组规则中从块解析进行解析(无需跳入开放代码)。我研究了“混合解析”,看起来它可能是 Rebol 3 中解决的一个功能。尽管如此,我还是无法让它在实践中发挥作用。所以我问了我自己的问题。

如何混合字符串解析并在同一规则中进行块解析?

When mess is processed with LOAD/MARKUP you get this (and I've formatted + commented with the types):

[
    ; string!
    "^/" 

    ; tag! string! tag!
    <td> "Bob Sockaway" </td>

    ; string!
    "^/"

    ; tag! tag!
    ;     string!
    ; tag! tag!
    <td> <a href=mailto:[email protected]>
        "[email protected]"
    </a> </td>

    ; (Note: you didn't put the anchor's href in quotes above...)

    ; string!
    "^/"

    ; tag! string! tag!
    <td> "9999" </td> 

    ; string!
    "^/"
]

Your output pattern matches series of the form [<td> string! </td>] but not things of the form [<td> tag! string! tag! </td>]. Sidestepping the question posed in your title, you could solve this particular dilemma several ways. One might be to maintain a count of whether you are inside a TD tag and print any strings when the count is non-zero:

rules: [
    (td-count: 0)
    some [
        ; if we see an open TD tag, increment a counter
        <td> (++ td-count)
        |
        ; if we see a close TD tag, decrement a counter
        </td> (-- td-count)
        |
        ; capture parse position in s if we find a string
        ; and if counter is > 0 then print the first element at
        ; the parse position (e.g. the string we just found) 
        s: string! (if td-count > 0 [print s/1])
        |
        ; if we find any non-TD tags, match them so the
        ; parser will continue along but don't run any code
        tag!
    ]
]

This produces the output you asked for:

Bob Sockaway
[email protected]
9999

But you also wanted to know, essentially, whether you can transition into string parsing from block parsing in the same set of rules (without jumping into open code). I looked into it "mixed parsing" looks like it may be a feature addressed in Rebol 3. Still, I couldn't get it to work in practice. So I asked a question of my own.

How to mix together string parsing and block parsing in the same rule?

遗失的美好 2024-11-22 23:37:44

我想我找到了一个非常好的解决方案。如果您有许多需要其属性的不同标签,则可能必须对其进行概括。

我正在寻找查询标签的 id 属性!:

<query id="5">

在标签的解析规则中!,我这样做了:

  | set t tag! (
    p: make block! t 
    if p/1 = 'query [_qid: to-integer p/3]
  )

要查看更多标签,我会使用案例。也许这会更好地设置 _qid

to-integer select p 'id=

我最终需要解析另一个标签,这是一个很好的通用模式

switch p/1 [
  field [_fid: to-integer p/id= _field_type: p/field_type=]
  query [_qid: to-integer p/id=]
]

I think I found a pretty good solution. It may have to be generalized if you had lots of different tags whose attributes you need.

I was looking for the id attribute of the query tag!:

<query id="5">

In the parse rule for tag!, I did this:

  | set t tag! (
    p: make block! t 
    if p/1 = 'query [_qid: to-integer p/3]
  )

More tags to look at, I'd use case. And maybe this would be better to set _qid

to-integer select p 'id=

I ended up needing to parse another tag and this is a nice general pattern

switch p/1 [
  field [_fid: to-integer p/id= _field_type: p/field_type=]
  query [_qid: to-integer p/id=]
]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文