如何使用Enlive从指定标签中抓取数据?

发布于 2024-12-10 13:05:03 字数 1543 浏览 0 评论 0原文

有人可以解释一下如何从 标签中抓取内容,其中 具有内容值(实际上在这种情况下我需要 < 的内容;b> 标签用于匹配操作)“Row1 标题”,但在处理过程中没有抓取 标签(或其任何内容)?这是我的测试 HTML:

<table class="table_class"> 
                    <tbody> 
                       <tr> 
                         <th>
                           <b>
                              Row1 title
                           </b>
                         </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                       <tr> 
                         <th>                                
                              Row2 title                                
                          </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                    </tbody>
</table>

我想要提取的数据应该来自这些标签:

                     <td>2.660.784</td> 
                     <td>2.944.552</td> 
                     <td>Correct, has 3 td elements</td> 

我已经设法创建返回表的全部内容的函数,但我想排除 结果中的节点,并仅返回来自 节点的数据,我可以将其内容用于进一步解析。谁能帮我解决这个问题吗?

could someone explain me how to scrape content from <td> tags where the <th> has content value (actually in this case I need content of <b> tag for matching operation) "Row1 title", but without scraping <th> tag (or any of its content) in process? Here is my test HTML:

<table class="table_class"> 
                    <tbody> 
                       <tr> 
                         <th>
                           <b>
                              Row1 title
                           </b>
                         </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                       <tr> 
                         <th>                                
                              Row2 title                                
                          </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                    </tbody>
</table>

Data which I want to extract should come from these tags:

                     <td>2.660.784</td> 
                     <td>2.944.552</td> 
                     <td>Correct, has 3 td elements</td> 

I have managed to create function which returns entire content of the table, but I would like to exclude the <th> node from result, and to return only data from <td> nodes, which content I can use for further parsing. Can anyone help me with this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

墨落成白 2024-12-17 13:05:03

使用 enlive 这样的东西

(ns tutorial.so-scrape
  (:require [net.cgrand.enlive-html :as html])

(defn parse-tds [url] 
 (html/select (html/html-resource (java.net.URL. url)) [:table :td])) 

应该会给你一个所有 td 节点的序列,其形式为 {:tag :td :attrs {...} :content (...)}。我不知道 enlive 使您可以直接获取这些节点的内容。我可能是错的。

然后,您可以提取序列的内容,以获取类似于
的内容
(for [line ws-content] (apply str (:content line)))

关于问题您昨天发布了(我假设您仍在使用该页面)-我在那里提供的解决方案有点复杂-但它也很灵活。例如,如果您像这样更改 tag-type 函数

(defn tag-type [node]
  (case (:tag node) 
   :td    ::TerminalNode
   ::IgnoreNode)

(将除 :td 之外的所有节点的返回值更改为 ::IgnoreNode 那么它只是为您提供了 :td 的内容序列,这可能接近您想要的内容,如果您需要更多帮助,请告诉我

(作为回复) 。到下面的评论)
我认为单独使用 enlive 不可能根据 :content 选择节点 - 但你当然可以使用 Clojure 来做到这一点。

例如,你可以做一些类似

(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
  (:content line))

可以工作的事情。 (您可能需要稍微调整 (:content line) 形式..

With enlive something like this

(ns tutorial.so-scrape
  (:require [net.cgrand.enlive-html :as html])

(defn parse-tds [url] 
 (html/select (html/html-resource (java.net.URL. url)) [:table :td])) 

should give you a sequence of all the td nodes, something of the form {:tag :td :attrs {...} :content (...)}. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.

You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))

In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type function like this

(defn tag-type [node]
  (case (:tag node) 
   :td    ::TerminalNode
   ::IgnoreNode)

(change the return value of all nodes to ::IgnoreNode except for :td then it just gives you a sequence of the content of the :tds which is probably close to what you want. Let me know if you need more help.

EDIT (in reply to comments below)
I don't think selecting nodes based on their :content is possible with enlive alone - but you can certainly do so with Clojure.

for example you could do something like

(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
  (:content line))

could work. (you might have to tweak the (:content line) form a little..

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文