Scrapy、python、Xpath如何匹配html中的各个项目
我是 Xpath 的新手,尝试使用以下格式抓取网站:
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
<div class="middle"> listed_value </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_value </div>
</div>
Listed_value & 的存在列出的日期是可选的。
我需要将每个 tittle_name 与相应的列出日期、列出值(如果可用)分组,然后将到达记录插入 MySQL。
我正在使用 scrapy shell 它给出了一些基本的例子,如
listings = hxs.select('//div[@class=\'top\']')
for listing in listings:
tittle_name = listing.select('/a//text()').extract()
date_values = listing.select('//div[@class=\'middle\']')
上面代码给了我 tittle_name 列表和可用列出日期、列出值的列表,但如何匹配它们? (我们不能按索引,因为格式不对称)。
谢谢。
I am new to Xpath, trying to scrapy website with below format:
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
<div class="middle"> listed_value </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_value </div>
</div>
The presences of listed_value & listed_date are optional.
I need to group each tittle_name with respective listed_date, listed_value (if available) then insert reach record to MySQL.
I am using scrapy shell which gives some basic examples like
listings = hxs.select('//div[@class=\'top\']')
for listing in listings:
tittle_name = listing.select('/a//text()').extract()
date_values = listing.select('//div[@class=\'middle\']')
Above code give me list of tittle_name and list of available listed_date, listed_value, but how to match them? (we cannot go by index because the format is not symmetric).
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
请注意,这些 XPath 表达式是绝对的:
您需要像这样的相对 XPath 表达式:
其次。在 (X)HTML 等混合内容模型中选择文本节点并不是一个好主意。您应该使用正确的 DOM 方法或
string()
函数提取字符串值。 (在最后一种情况下,您需要评估每个节点的表达式,因为隐式节点集转换为单例节点集)Do note that those XPath expressions are absolute:
You would need relative XPath expression like these:
Second. It's not a good idea to select text nodes in a mixed content model like (X)HTML. You should extract the string value with the proper DOM method or with
string()
function. (In the last case, you would need to eval the expression for each node because the implicit node set casting into singleton node set)好吧,由于该网站没有指定
div[@class='middle']
中的内容是日期还是值,因此您必须编写自己的代码来决定这一点。我想这些日期有一些特定的格式,您可以将其与一些分析相匹配,也许使用正则表达式。
您能否更具体地说明
listed_date
和listed_value
的可能值是什么?Well, since the website doesn't specify whether something in a
div[@class='middle']
is a date or a value, you'll have to code your own way of deciding this.I guess the dates have some specific format that you could match with some analysis, maybe using a regular expression.
Can you maybe be more specific on what are possible values for
listed_date
andlisted_value
?