Scrapy、python、Xpath如何匹配html中的各个项目

发布于 2024-10-24 15:44:06 字数 1059 浏览 4 评论 0原文

我是 Xpath 的新手，尝试使用以下格式抓取网站：

<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_date </div>
    <div class="middle"> listed_value </div>
</div>
<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_date </div>
</div>
<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_value </div>
</div>

Listed_value & 的存在列出的日期是可选的。

我需要将每个 tittle_name 与相应的列出日期、列出值（如果可用）分组，然后将到达记录插入 MySQL。

我正在使用 scrapy shell 它给出了一些基本的例子，如

listings = hxs.select('//div[@class=\'top\']')
for listing in listings:
    tittle_name = listing.select('/a//text()').extract()
    date_values = listing.select('//div[@class=\'middle\']')

上面代码给了我 tittle_name 列表和可用列出日期、列出值的列表，但如何匹配它们？（我们不能按索引，因为格式不对称）。

谢谢。

原文

I am new to Xpath, trying to scrapy website with below format:

<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_date </div>
    <div class="middle"> listed_value </div>
</div>
<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_date </div>
</div>
<div class="top">
    <a> tittle_name </a>
    <div class="middle"> listed_value </div>
</div>

The presences of listed_value & listed_date are optional.

I need to group each tittle_name with respective listed_date, listed_value (if available) then insert reach record to MySQL.

I am using scrapy shell which gives some basic examples like

listings = hxs.select('//div[@class=\'top\']')
for listing in listings:
    tittle_name = listing.select('/a//text()').extract()
    date_values = listing.select('//div[@class=\'middle\']')

Above code give me list of tittle_name and list of available listed_date, listed_value, but how to match them? (we cannot go by index because the format is not symmetric).

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤寂小茶 2024-10-31 15:44:06

请注意，这些 XPath 表达式是绝对的：

/a//text()

//div[@class=\'middle\']

您需要像这样的相对 XPath 表达式：

a

div[@class=\'middle\']

其次。在 (X)HTML 等混合内容模型中选择文本节点并不是一个好主意。您应该使用正确的 DOM 方法或 string() 函数提取字符串值。（在最后一种情况下，您需要评估每个节点的表达式，因为隐式节点集转换为单例节点集）

Do note that those XPath expressions are absolute:

/a//text()

//div[@class=\'middle\']

You would need relative XPath expression like these:

a

div[@class=\'middle\']

Second. It's not a good idea to select text nodes in a mixed content model like (X)HTML. You should extract the string value with the proper DOM method or with string() function. (In the last case, you would need to eval the expression for each node because the implicit node set casting into singleton node set)

回复收藏 0 原文