XPath：通过当前节点属性选择当前和下一个节点的文本

发布于 2024-10-20 04:16:55 字数 3477 浏览 5 评论 0原文

如果这是一个重复的问题，我很抱歉，但我在 SO 或其他地方找不到另一个问题来处理我需要的内容。这是我的问题：

我正在使用 scrapy 从

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                        <span class='distribution'>(SCI)</span></p> 

<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
            onMouseover="showtip(this,event,'24 Lectures')"
            onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
            onMouseover="showtip(this,event,'12 Tutorials')"
            onMouseout="hidetip()">12T</span>]<br> 

<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>

<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 
</span><br/><br/<br/>

该页面上的几乎所有代码都类似于上面的块。

从所有这些中，我需要抓住：

ANT101H5 生物人类学和考古学简介
排除：ANT100Y5
先决条件：ANT102H5

问题是 Exclusion: 位于和 ANT100Y5 位于以下内。

我似乎无法从源代码中获取它们。目前，我有一些代码尝试（但失败）获取 ANT100Y5 ，如下所示：

hxs = HtmlXPathSelector(response)
    sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")

我很感激任何有关此问题的帮助，即使它是“您因没有看到其他问题而盲目”这完美地回答了这个问题”（在这种情况下，我自己将投票结束这个问题）。我实在是无计可施了。

预先感谢

编辑：在@Dimitre建议的更改后完成原始代码

我正在使用以下代码：

class regcalSpider(BaseSpider):
    name = "disc"
    allowed_domains = ['www.utm.utoronto.ca']
    start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']

    def parse(self, response):
            items = []
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("/*/p/text()[1] | \
                              (//span[@class='title2'])[1]/text() | \
                              (//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
                              (//span[@class='title2'])[2]/text() | \
                              (//span[@class='title2'])[2]/following-sibling::a[1]/text()")

            for site in sites:
                    item = RegcalItem()
                    item['title'] = site.select("a/text()").extract()
                    item['link'] = site.select("a/@href").extract()
                    item['desc'] = site.select("text()").extract()
                    items.append(item)
            return items

            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)

这给了我这个结果：

[{"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []}]

这不是我需要的输出。我做错了什么？请记住，如上所述，我正在 this 上运行此脚本。

原文

If this is a repeat question, I apologize, but I can't find another question either on SO or elsewhere that seems to handle what I need. Here is my question:

I'm using scrapy to get some information out of this webpage. For clarity, following is a block of the source code from that webpage, which is of interest to me:

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                        <span class='distribution'>(SCI)</span></p> 

<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
            onMouseover="showtip(this,event,'24 Lectures')"
            onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
            onMouseover="showtip(this,event,'12 Tutorials')"
            onMouseout="hidetip()">12T</span>]<br> 

<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>

<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 
</span><br/><br/<br/>

Almost all of the code on that page looks like the above block.

From all of this, I need to grab:

ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
Prerequisite: ANT102H5

The problem is that Exclusion: is inside a <span class="title2"> and ANT100Y5 is inside the following <a>.

I don't seem to be able to grab both of them out of this source code. Currently, I have code that attempts (and fails) to grab ANT100Y5 which looks like:

hxs = HtmlXPathSelector(response)
    sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")

I'd appreciate any help with this, even if it's a "you're blind for not seeing this other SO question which answers this perfectly" (in which case, myself will vote to close this). I really am that much at my wits end.

Thanks in advance

EDIT: Complete original code after changes suggested by @Dimitre

I'm using the following code:

class regcalSpider(BaseSpider):
    name = "disc"
    allowed_domains = ['www.utm.utoronto.ca']
    start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']

    def parse(self, response):
            items = []
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("/*/p/text()[1] | \
                              (//span[@class='title2'])[1]/text() | \
                              (//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
                              (//span[@class='title2'])[2]/text() | \
                              (//span[@class='title2'])[2]/following-sibling::a[1]/text()")

            for site in sites:
                    item = RegcalItem()
                    item['title'] = site.select("a/text()").extract()
                    item['link'] = site.select("a/@href").extract()
                    item['desc'] = site.select("text()").extract()
                    items.append(item)
            return items

            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)

Which gives me this result:

[{"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []}]

This is not the output that I need. What am I doing wrong? Keep in mind that I'm running this script on this, as mentioned.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉城凉梦凉人心 2024-10-27 04:16:55

.1。 ANT101H5生物人类学和考古学概论

p[@class='titlestyle']/text()

.2。排除：ANT100Y5

concat(
    span/span[@class='title2'][1],
    span/span[@class='title2'][1]/following-sibling::a[1]
    )

.3。先决条件：ANT102H5

concat(
    span/span[@class='title2'][2],
    span/span[@class='title2'][2]/following-sibling::a[1]
    )

.1. ANT101H5 Introduction to Biological Anthropology and Archaeology

p[@class='titlestyle']/text()

.2. Exclusion: ANT100Y5

concat(
    span/span[@class='title2'][1],
    span/span[@class='title2'][1]/following-sibling::a[1]
    )

.3. Prerequisite: ANT102H5

concat(
    span/span[@class='title2'][2],
    span/span[@class='title2'][2]/following-sibling::a[1]
    )

回复收藏 0 原文

高跟鞋的旋律 2024-10-27 04:16:55

选择您所引用的三个节点并不困难（使用 Flack 等技术）。困难的是（a）选择它们而不同时选择您不想要的其他内容，以及（b）使您的选择足够稳健，以便在输入略有不同时仍然选择它们。我们必须假设您并不确切知道输入中的内容 - 如果您知道，则无需编写 XPath 表达式来查找。

你告诉我们你想要抓住的三件事。但是你选择这三件事而不选择其他东西的标准是什么？对您正在寻找的东西了解多少？

您已将您的问题表达为 XPath 问题，但我会以不同的方式解决它。我首先使用 XSLT 将您显示的输入转换为具有更好结构的内容。特别是，我会尝试将不在

元素内的所有同级元素包装到

元素中，处理每组连续元素以结尾作为段落。使用 XSLT 2.0 中的构造可以轻松完成此操作。

回复收藏 0 原文

┈┾☆殇 2024-10-27 04:16:55

我的答案很像 @Flack 的答案：

拥有此 XML 文档（更正了所提供的文档，以关闭大量未关闭的并将所有内容包装在一个顶部中元素）：

<body>
    <p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
        <span class='distribution'>(SCI)</span>
    </p>
    <span class='normaltext'> Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [
        <span class='Helpcourse' onMouseover="showtip(this,event,'24 Lectures')" onMouseout="hidetip()">24L</span>, 
        <span class='Helpcourse' onMouseover="showtip(this,event,'12 Tutorials')" onMouseout="hidetip()">12T</span>]
        <br/>
        <span class='title2'>Exclusion: </span>
        <a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a>
        <br/>
        <span class='title2'>Prerequisite: </span>
        <a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a>
        <br/>
    </span>
    <br/>
    <br/>
    <br/>
</body>

此 XPath 表达式：

normalize-space(/*/p/text()[1])

计算时生成所需的字符串（周围的引号不在结果中。我添加它们以显示生成的确切字符串）：

"ANT101H5 Introduction to Biological Anthropology and Archaeology"

此 XPath 表达式：

concat((//span[@class='title2'])[1],
            (//span[@class='title2'])[1]
                   /following-sibling::a[1]
            )

计算时产生以下想要的结果：

"Exclusion: ANT100Y5"

此 XPath 表达式：

concat((//span[@class='title2'])[2],
            (//span[@class='title2'])[2]
                   /following-sibling::a[1]
            )

计算时产生以下想要的结果：

"Prerequisite: ANT102H5"

注意：在这种特殊情况下，缩写 //< /code> 是不需要的，事实上这个缩写应该尽可能避免，因为它会导致表达式的计算速度变慢，在许多情况下导致完整的（子）树遍历。我有意使用“//”，因为提供的 XML 片段没有为我们提供 XML 文档的完整结构。另外，这演示了如何正确索引使用 // 的结果（注意周围的括号）——有助于防止在尝试执行此操作时出现非常频繁的错误

更新： OP 请求了一个选择所有所需文本节点的 XPath 表达式 - 这里是：

/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()

当应用到与上面相同的 XML 文档时，文本节点的串联正是所需要的：

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

这个结果可以是通过运行以下 XSLT 转换来确认：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()
   "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于同一个 XML 文档（在此答案中前面指定）时，会生成所需的正确结果：

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

最后< /strong>：以下单个 XPath 表达式精确选择 HTML 页面中所有需要的文本节点，以及提供的链接（将其整理为格式良好的 XML 后）：

  (//p[@class='titlestyle'])[2]/text()[1]
|
  (//span[@class='title2'])[2]/text()
|
  (//span[@class='title2'])[2]/following-sibling::a[1]/text()
|
  (//span[@class='title2'])[3]/text()
|
  (//span[@class='title2'])[3]/following-sibling::a[1]/text()

My answers are quite like those of @Flack:

Having this XML document (corrected the provided one in closing numerous unclosed <br>s and in wrapping everything in a single top element):

<body>
    <p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
        <span class='distribution'>(SCI)</span>
    </p>
    <span class='normaltext'> Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [
        <span class='Helpcourse' onMouseover="showtip(this,event,'24 Lectures')" onMouseout="hidetip()">24L</span>, 
        <span class='Helpcourse' onMouseover="showtip(this,event,'12 Tutorials')" onMouseout="hidetip()">12T</span>]
        <br/>
        <span class='title2'>Exclusion: </span>
        <a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a>
        <br/>
        <span class='title2'>Prerequisite: </span>
        <a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a>
        <br/>
    </span>
    <br/>
    <br/>
    <br/>
</body>

This XPath expression:

normalize-space(/*/p/text()[1])

when evaluated produces the wanted string (the surrounding quotes are not in the result. I added them to show the exact string produced):

"ANT101H5 Introduction to Biological Anthropology and Archaeology"

This XPath expression:

concat((//span[@class='title2'])[1],
            (//span[@class='title2'])[1]
                   /following-sibling::a[1]
            )

when evaluated produces the following wanted result:

"Exclusion: ANT100Y5"

This XPath expression:

concat((//span[@class='title2'])[2],
            (//span[@class='title2'])[2]
                   /following-sibling::a[1]
            )

when evaluated produces the following wanted result:

"Prerequisite: ANT102H5"

Note: In this particular case the abbreviation // is not needed and in fact this abbreviation should always when possible be avoided, because it leads to slower evaluation of the expression, causing in many cases a complete (sub) tree traversal. I am using '//' intentionally, because the provided XML fragment doesn't give us the full structure of the XML document. Also, This demonstrates how to correctly index the results of using // (note the surrounding brackets) -- helping to prevent a very frequent mistake in trying to do so

UPDATE: The OP has requested a single XPath expression that selects all the required text nodes -- here it is:

/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()

When applied on the same XML document as above, the concatenation of the text nodes is exactly what is required:

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

This result can be confirmed by running the following XSLT transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()
   "/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the same XML document (specified previously in this answer), the wanted, correct result is produced:

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

Finally: The following single XPath expression selects exactly all wanted text node in the HTML page, with the provided link (after tidying it to become well-formed XML):

  (//p[@class='titlestyle'])[2]/text()[1]
|
  (//span[@class='title2'])[2]/text()
|
  (//span[@class='title2'])[2]/following-sibling::a[1]/text()
|
  (//span[@class='title2'])[3]/text()
|
  (//span[@class='title2'])[3]/following-sibling::a[1]/text()

回复收藏 0 原文

~没有更多了~