如何使用Enlive从指定标签中抓取数据？

发布于 2024-12-10 13:05:03 字数 1543 浏览 0 评论 0原文

有人可以解释一下如何从标签中抓取内容，其中具有内容值（实际上在这种情况下我需要 < 的内容;b> 标签用于匹配操作）“Row1 标题”，但在处理过程中没有抓取标签（或其任何内容）？这是我的测试 HTML：

<table class="table_class"> 
                    <tbody> 
                       <tr> 
                         <th>
                           <b>
                              Row1 title
                           </b>
                         </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                       <tr> 
                         <th>                                
                              Row2 title                                
                          </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                    </tbody>
</table>

我想要提取的数据应该来自这些标签：

                     <td>2.660.784</td> 
                     <td>2.944.552</td> 
                     <td>Correct, has 3 td elements</td>

我已经设法创建返回表的全部内容的函数，但我想排除结果中的节点，并仅返回来自节点的数据，我可以将其内容用于进一步解析。谁能帮我解决这个问题吗？

原文

could someone explain me how to scrape content from <td> tags where the <th> has content value (actually in this case I need content of <b> tag for matching operation) "Row1 title", but without scraping <th> tag (or any of its content) in process? Here is my test HTML:

<table class="table_class"> 
                    <tbody> 
                       <tr> 
                         <th>
                           <b>
                              Row1 title
                           </b>
                         </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                       <tr> 
                         <th>                                
                              Row2 title                                
                          </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                    </tbody>
</table>

Data which I want to extract should come from these tags:

                     <td>2.660.784</td> 
                     <td>2.944.552</td> 
                     <td>Correct, has 3 td elements</td>

I have managed to create function which returns entire content of the table, but I would like to exclude the <th> node from result, and to return only data from <td> nodes, which content I can use for further parsing. Can anyone help me with this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨落成白 2024-12-17 13:05:03

使用 enlive 这样的东西

(ns tutorial.so-scrape
  (:require [net.cgrand.enlive-html :as html])

(defn parse-tds [url] 
 (html/select (html/html-resource (java.net.URL. url)) [:table :td]))

应该会给你一个所有 td 节点的序列，其形式为 {:tag :td :attrs {...} :content (...)}。我不知道 enlive 使您可以直接获取这些节点的内容。我可能是错的。

然后，您可以提取序列的内容，以获取类似于
的内容
(for [line ws-content] (apply str (:content line)))

关于问题您昨天发布了（我假设您仍在使用该页面）-我在那里提供的解决方案有点复杂-但它也很灵活。例如，如果您像这样更改 tag-type 函数

(defn tag-type [node]
  (case (:tag node) 
   :td    ::TerminalNode
   ::IgnoreNode)

（将除 :td 之外的所有节点的返回值更改为 ::IgnoreNode 那么它只是为您提供了 :td 的内容序列，这可能接近您想要的内容，如果您需要更多帮助，请告诉我

（作为回复）。到下面的评论）
我认为单独使用 enlive 不可能根据 :content 选择节点 - 但你当然可以使用 Clojure 来做到这一点。

例如，你可以做一些类似

(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
  (:content line))

可以工作的事情。（您可能需要稍微调整 (:content line) 形式..

With enlive something like this

(ns tutorial.so-scrape
  (:require [net.cgrand.enlive-html :as html])

(defn parse-tds [url] 
 (html/select (html/html-resource (java.net.URL. url)) [:table :td]))

should give you a sequence of all the td nodes, something of the form {:tag :td :attrs {...} :content (...)}. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.

You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))

In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type function like this

(defn tag-type [node]
  (case (:tag node) 
   :td    ::TerminalNode
   ::IgnoreNode)

(change the return value of all nodes to ::IgnoreNode except for :td then it just gives you a sequence of the content of the :tds which is probably close to what you want. Let me know if you need more help.

EDIT (in reply to comments below)
I don't think selecting nodes based on their :content is possible with enlive alone - but you can certainly do so with Clojure.

for example you could do something like

(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
  (:content line))

could work. (you might have to tweak the (:content line) form a little..

回复收藏 0 原文

~没有更多了~

关于作者

┊风居住的梦幻卍

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

如何使用Enlive从指定标签中抓取数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如何使用Enlive从指定标签中抓取数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。