XPath:通过当前节点属性选择当前和下一个节点的文本
如果这是一个重复的问题,我很抱歉,但我在 SO 或其他地方找不到另一个问题来处理我需要的内容。这是我的问题:
我正在使用 scrapy
从
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
</span><br/><br/<br/>
该页面上的几乎所有代码都类似于上面的块。
从所有这些中,我需要抓住:
- ANT101H5 生物人类学和考古学简介
- 排除:ANT100Y5
- 先决条件:ANT102H5
问题是 Exclusion:
位于 和
ANT100Y5
位于以下 内。
我似乎无法从源代码中获取它们。目前,我有一些代码尝试(但失败)获取 ANT100Y5
,如下所示:
hxs = HtmlXPathSelector(response)
sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")
我很感激任何有关此问题的帮助,即使它是“您因没有看到其他问题而盲目”这完美地回答了这个问题”(在这种情况下,我自己将投票结束这个问题)。我实在是无计可施了。
预先感谢
编辑:在@Dimitre建议的更改后完成原始代码
我正在使用以下代码:
class regcalSpider(BaseSpider):
name = "disc"
allowed_domains = ['www.utm.utoronto.ca']
start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select("/*/p/text()[1] | \
(//span[@class='title2'])[1]/text() | \
(//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
(//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text()")
for site in sites:
item = RegcalItem()
item['title'] = site.select("a/text()").extract()
item['link'] = site.select("a/@href").extract()
item['desc'] = site.select("text()").extract()
items.append(item)
return items
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
这给了我这个结果:
[{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []}]
这不是我需要的输出。我做错了什么?请记住,如上所述,我正在 this 上运行此脚本。
If this is a repeat question, I apologize, but I can't find another question either on SO or elsewhere that seems to handle what I need. Here is my question:
I'm using scrapy
to get some information out of this webpage. For clarity, following is a block of the source code from that webpage, which is of interest to me:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
</span><br/><br/<br/>
Almost all of the code on that page looks like the above block.
From all of this, I need to grab:
- ANT101H5 Introduction to Biological Anthropology and Archaeology
- Exclusion: ANT100Y5
- Prerequisite: ANT102H5
The problem is that Exclusion:
is inside a <span class="title2">
and ANT100Y5
is inside the following <a>
.
I don't seem to be able to grab both of them out of this source code. Currently, I have code that attempts (and fails) to grab ANT100Y5
which looks like:
hxs = HtmlXPathSelector(response)
sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")
I'd appreciate any help with this, even if it's a "you're blind for not seeing this other SO question which answers this perfectly" (in which case, myself will vote to close this). I really am that much at my wits end.
Thanks in advance
EDIT: Complete original code after changes suggested by @Dimitre
I'm using the following code:
class regcalSpider(BaseSpider):
name = "disc"
allowed_domains = ['www.utm.utoronto.ca']
start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select("/*/p/text()[1] | \
(//span[@class='title2'])[1]/text() | \
(//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
(//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text()")
for site in sites:
item = RegcalItem()
item['title'] = site.select("a/text()").extract()
item['link'] = site.select("a/@href").extract()
item['desc'] = site.select("text()").extract()
items.append(item)
return items
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
Which gives me this result:
[{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []}]
This is not the output that I need. What am I doing wrong? Keep in mind that I'm running this script on this, as mentioned.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
选择您所引用的三个节点并不困难(使用 Flack 等技术)。困难的是(a)选择它们而不同时选择您不想要的其他内容,以及(b)使您的选择足够稳健,以便在输入略有不同时仍然选择它们。我们必须假设您并不确切知道输入中的内容 - 如果您知道,则无需编写 XPath 表达式来查找。
你告诉我们你想要抓住的三件事。但是你选择这三件事而不选择其他东西的标准是什么?对您正在寻找的东西了解多少?
您已将您的问题表达为 XPath 问题,但我会以不同的方式解决它。我首先使用 XSLT 将您显示的输入转换为具有更好结构的内容。特别是,我会尝试将不在
元素内的所有同级元素包装到
元素中,处理每组连续元素以
结尾作为段落。使用 XSLT 2.0 中的
构造可以轻松完成此操作。It's not difficult to select the three nodes you refer to (using techniques such as those of Flack). What's difficult is (a) selecting them without also selecting other things that you don't want, and (b) making your selection robust enough that it still selects them if the input is slightly different. We have to assume that you don't know exactly what's in the input - if you did, you wouldn't need to write an XPath expression to find out.
You've told us three things that you want to grab. But what are your criteria for selecting these three things, and not selecting something else? How much is known about what you are looking for?
You've expressed your problem as an XPath problem, but I would tackle it differently. I would start by transforming the input you have shown to something with better structure, using XSLT. In particular, I would try to wrap all the sibling elements that aren't within a
<p>
element into<p>
elements, treating each group of successive elements ending in<br>
as a paragraph. That can be done without too much difficulty using the<xsl:for-each-group group-ending-with>
construct in XSLT 2.0.我的答案很像 @Flack 的答案:
拥有此 XML 文档(更正了所提供的文档,以关闭大量未关闭的
并将所有内容包装在一个顶部中元素):此 XPath 表达式:
计算时生成所需的字符串(周围的引号不在结果中。我添加它们以显示生成的确切字符串):
此 XPath 表达式:
计算时产生以下想要的结果:
此 XPath 表达式:
计算时产生以下想要的结果:
注意:在这种特殊情况下,缩写
//< /code> 是不需要的,事实上这个缩写应该尽可能避免,因为它会导致表达式的计算速度变慢,在许多情况下导致完整的(子)树遍历。我有意使用“//”,因为提供的 XML 片段没有为我们提供 XML 文档的完整结构。另外,这演示了如何正确索引使用
//
的结果(注意周围的括号)——有助于防止在尝试执行此操作时出现非常频繁的错误更新: OP 请求了一个选择所有所需文本节点的 XPath 表达式 - 这里是:
当应用到与上面相同的 XML 文档时,文本节点的串联正是所需要的:
这个结果可以是通过运行以下 XSLT 转换来确认:
当此转换应用于同一个 XML 文档(在此答案中前面指定)时,会生成所需的正确结果:
最后< /strong>:以下单个 XPath 表达式精确选择 HTML 页面中所有需要的文本节点,以及提供的链接(将其整理为格式良好的 XML 后):
My answers are quite like those of @Flack:
Having this XML document (corrected the provided one in closing numerous unclosed
<br>
s and in wrapping everything in a single top element):This XPath expression:
when evaluated produces the wanted string (the surrounding quotes are not in the result. I added them to show the exact string produced):
This XPath expression:
when evaluated produces the following wanted result:
This XPath expression:
when evaluated produces the following wanted result:
Note: In this particular case the abbreviation
//
is not needed and in fact this abbreviation should always when possible be avoided, because it leads to slower evaluation of the expression, causing in many cases a complete (sub) tree traversal. I am using '//' intentionally, because the provided XML fragment doesn't give us the full structure of the XML document. Also, This demonstrates how to correctly index the results of using//
(note the surrounding brackets) -- helping to prevent a very frequent mistake in trying to do soUPDATE: The OP has requested a single XPath expression that selects all the required text nodes -- here it is:
When applied on the same XML document as above, the concatenation of the text nodes is exactly what is required:
This result can be confirmed by running the following XSLT transformation:
when this transformation is applied on the same XML document (specified previously in this answer), the wanted, correct result is produced:
Finally: The following single XPath expression selects exactly all wanted text node in the HTML page, with the provided link (after tidying it to become well-formed XML):