当前位置：文江博客话题详情

XPath：通过当前节点属性选择当前和下一个节点的文本

发布于 2024-10-20 12:44:49 字数 2767 浏览 1 评论 0 原文

首先，这是来自我的上一个问题。我再次发布此内容是因为我在原始帖子中接受其答案的人建议我这样做他认为这个问题以前没有得到适当的界定。尝试 2 如下：

我正在尝试从此网页获取信息。为了清楚起见，以下是页面源代码块的选择：

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                    <span class='distribution'>(SCI)</span></p> 
<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
        onMouseover="showtip(this,event,'24 Lectures')"
        onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
        onMouseover="showtip(this,event,'12 Tutorials')"
        onMouseout="hidetip()">12T</span>]<br> 
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br> 
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>

从上面的示例块中，我想提取以下信息：

ANT101H5 生物人类学和考古学简介
排除：ANT100Y5
先决条件：ANT102H5

我想从网页获取所有此类信息（请记住，某些课程可能另外列出了“并存条件”）以及或可能根本没有列出任何先决条件/共同要求或排除情况）。

我一直在尝试为这个任务编写一个合适的 xpath 表达式，但我似乎无法让它恰到好处。

到目前为止，在 if Dimitre Novatchev 的帮助下，我已经能够使用以下表达式：

sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
                    (//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
                    (//span[@class='title2'])[3]/following-sibling::a[1]/text()")

但是，它产生以下输出似乎仅获取页面上第一课程的信息：

[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n                        "},
 {"desc": "Exclusion: "},
 {"desc": "ANT100Y5"},
 {"desc": "Prerequisite: "},
 {"desc": "ANT102H5"}]

需要明确的是，此输出仅在获取有关第一门课程的正确信息时才是正确的。我需要该网页上列出的所有课程的正确信息。

我已经很接近了，但我似乎无法弄清楚最后一步。

我将不胜感激任何帮助...提前致谢

原文

First of all, this is a spawn from my previous question. I have posted this again because I was advised to do so by the person whose answer I accepted in the original post as he felt that the question was not properly defined before. Here goes attempt 2:

I am trying to get information out of this webpage. For clarity, following is a selection of a block of the page source:

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                    <span class='distribution'>(SCI)</span></p> 
<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
        onMouseover="showtip(this,event,'24 Lectures')"
        onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
        onMouseover="showtip(this,event,'12 Tutorials')"
        onMouseout="hidetip()">12T</span>]<br> 
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br> 
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>

From the sample block above, I would like to extract the following information:

ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
Prerequisite: ANT102H5

I would like to get all such information from the webpage (keep in mind that some courses may have an additionally listed "Corequisite" as well or may not have any pre/co requisites or exclusions listed at all).

I have been trying to write an appropriate xpath expression for this task, but I seem to not be able to get it just right.

Thus far, with the help if Dimitre Novatchev, I have been able to use the following expression:

sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
                    (//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
                    (//span[@class='title2'])[3]/following-sibling::a[1]/text()")

However, it produces the following output, which seems to get the information for only the first course on the page:

[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n                        "},
 {"desc": "Exclusion: "},
 {"desc": "ANT100Y5"},
 {"desc": "Prerequisite: "},
 {"desc": "ANT102H5"}]

Just to be absolutely clear, this output is correct only insofar as that it gets the correct information regarding the first course. I need the correct information like this for all courses listed on that webpage.

I'm so close but I don't seem to be able to figure out that last step.

I'd appreciate any help... thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情绪少女 2024-10-27 12:44:49

为所有课程选择相关数据所需的单个 XPath 表达式非常混乱，因此我在这里采用另一种方法，可以使用该方法（如果有必要）生成该单个 XPath 表达式：

<强>这个简单的XSLT转换：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="p[@class='titlestyle']">
  <xsl:text>
===================
</xsl:text>
  <xsl:value-of select="text()[1]"/>
 </xsl:template>

 <xsl:template match=
  "span/span[@class='title2'][not(position() >1)]">
   <xsl:text>
</xsl:text>
   <xsl:value-of select="."/>
   <xsl:value-of select="following-sibling::a[1]"/>

   <xsl:if test="not(following-sibling::a)">
    <xsl:value-of select="following-sibling::text()[1]"/>
   </xsl:if>
   <xsl:text>
</xsl:text>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

应用在页面上时：http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html（整理为格式良好的 XML 文档），生成想要的结果：

===================
Anthropology
===================
ANT101H5 Introduction to Biological Anthropology and Archaeology

Exclusion: ANT100Y5

===================
ANT102H5 Introduction to Sociocultural and Linguistic Anthropology

Exclusion: ANT100Y5

===================
ANT200Y5 World Archaeology and Prehistory

Prerequisite: 101H5

===================
ANT203Y5 Biological Anthropology

Prerequisite: 101H5

===================
ANT204Y5 Sociocultural Anthropology

Prerequisite: 101H5

===================
ANT205H5 Introduction to Forensic Anthropology

Prerequisite: 101H5

===================
ANT206Y5 Culture and Communication: Introduction to Linguistic Anthropology

Exclusion: ANT206H5

===================
ANT241Y5 Aboriginal Peoples of North America

===================
ANT299Y5 Research Opportunity Program

===================
ANT304H5 Anthropology and Aboriginal Peoples

Exclusion: ANT304Y5

===================
ANT306H5 Forensic Anthropology Field School

Prerequisite: ANT205H5

===================
ANT308H5 Case Studies in Archaeological Botany and Zoology

Prerequisite: ANT200Y5

===================
ANT309H5 Southeast Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT310H5 Complex Societies

Prerequisite: ANT200Y5

===================
ANT312H5 Archaeological Analysis

Prerequisite: ANT200Y5

===================
ANT313H5 China, Korea and Japan in Prehistory

Prerequisite: ANT200Y5

===================
ANT314H5 Archaeological Theory

Exclusion: ANT411H5

===================
ANT316H5 South Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT317H5 Archaeology of Eastern North America

Prerequisite: ANT200Y5

===================
ANT318H5 Archaeological Fieldwork

Prerequisite: ANT200Y5

===================
ANT320H5 Archaeological Approaches to Technology

Prerequisite: ANT200Y5

===================
ANT322H5 Anthropology of Youth Culture

Exclusion: ANT204Y5

===================
ANT327H5 Agricultural Origins:  The Second Revolution

Prerequisite: ANT200Y5

===================
ANT331H5 The Biology of Human Sexuality

Exclusion: ANT330H5

===================
ANT332H5 Human Origins

Exclusion: ANT332Y5

===================
ANT333H5 Human Origins II

Exclusion: ANT332Y5

===================
ANT334H5 Human Osteology

Exclusion: ANT334Y5

===================
ANT335H5 Anthropology of Gender

Exclusion: ANT331Y5

===================
ANT336H5 Molecular Anthropology

Prerequisite: ANT203Y5

===================
ANT338H5 Laboratory Methods in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT339Y5 Human Adaptation through Biological and Cultural Means

Prerequisite: ANT203Y5

===================
ANT340H5 Osteological Theory

Exclusion: ANT334Y5

===================
ANT350H5 Globalization and the Changing World of Work

Prerequisite: ANT204Y5

===================
ANT351H5 Money, Markets, Gifts: Topics in Economic Anthropology

Prerequisite: ANT204Y5

===================
ANT352H5 Power, Authority, and Legitimacy: Topics in Political Anthropology

Prerequisite: ANT204Y5

===================
ANT358H5 Ethnographic Methods

Prerequisite: ANT204Y5

===================
ANT360H5 Anthropology of Religion

Exclusion: ANT209Y5

===================
ANT361H5 Anthropology of Sub-Saharan Africa

Exclusion: ANT212Y5

===================
ANT362H5 Language in Culture and Society

Prerequisite: ANT204Y5

===================
ANT363H5 Magic, Witchcraft and Science

Prerequisite: ANT360H5

===================
ANT364H5 Lab in Social Interaction

Prerequisite: ANT206H5

===================
ANT365H5 Semiotic Anthropology

Prerequisite: ANT204Y5

===================
ANT368H5 World Religions and Ecology

Exclusion: RLG311H5

===================
ANT369H5 Religious Violence and Nonviolence

Exclusion: RLG317H5

===================
ANT397H5 Independent Study

Prerequisite: Permission of Faculty Advisor


===================
ANT398Y5 Independent Reading

Prerequisite: Permission of Faculty Advisor


===================
ANT399Y5 Research Opportunity Program

Prerequisite: P.I.


===================
ANT401H5 Vocal and Visual Communication

Prerequisite: ANT102H5

===================
ANT414H5 People and Plants in Prehistory

Prerequisite: ANT200Y5

===================
ANT415H5 Faunal Archaeo-Osteology

Exclusion: ANT415Y5

===================
ANT416H5 Advanced Archaeological Analysis

Prerequisite: ANT312H5

===================
ANT418H5 Advanced Archaeological Fieldwork

Prerequisite: ANT318H5

===================
ANT430H5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I


===================
ANT430Y5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I. 


===================
ANT431Y5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT431H5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT432H5 Special Seminar in Anthropology

Prerequisite: P.I.


===================
ANT433H5 Genes, Language, Artifact and Mind

Prerequisite: ANT200Y5

===================
ANT434H5 Palaeopathology

Prerequisite: ANT334Y5

===================
ANT438H5 The Development of Thought in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT439Y5 Advanced Forensic Anthropology

Prerequisite: ANT205H5

===================
ANT441H5 Advanced Bioarchaeology

Prerequisite: ANT334H5

===================
ANT457H5 Anthropology and the Environment

Prerequisite: ANT102H5

===================
ANT458H5 Anthropology of Crime, Law and Order

Exclusion: ANT204Y5

===================
ANT459H5 The Ethnography of Speaking

Prerequisite: ANT206Y5

===================
ANT460H5 Theory in Sociocultural Anthropology

Prerequisite: ANT204Y5

===================
ANT461H5 Emergent Topics in Socio-Cultural &  Linguistic Anthropology

Prerequisite: ANT204Y5

===================
ANT498H5 Advanced Independent Study

Prerequisite: P.I.


===================
ANT499Y5 Advanced Independent Research

Prerequisite: P.I.

The required single XPath expression to select the relevant data for all courses is quite messy, so here I am taking another approach, which can be used (if necessary at all) to produce that single XPath expression:

This simple XSLT transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="p[@class='titlestyle']">
  <xsl:text>
===================
</xsl:text>
  <xsl:value-of select="text()[1]"/>
 </xsl:template>

 <xsl:template match=
  "span/span[@class='title2'][not(position() >1)]">
   <xsl:text>
</xsl:text>
   <xsl:value-of select="."/>
   <xsl:value-of select="following-sibling::a[1]"/>

   <xsl:if test="not(following-sibling::a)">
    <xsl:value-of select="following-sibling::text()[1]"/>
   </xsl:if>
   <xsl:text>
</xsl:text>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

when applied on the page at: http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html (tidied up to become a well-formed XML document), produces the wanted result:

===================
Anthropology
===================
ANT101H5 Introduction to Biological Anthropology and Archaeology

Exclusion: ANT100Y5

===================
ANT102H5 Introduction to Sociocultural and Linguistic Anthropology

Exclusion: ANT100Y5

===================
ANT200Y5 World Archaeology and Prehistory

Prerequisite: 101H5

===================
ANT203Y5 Biological Anthropology

Prerequisite: 101H5

===================
ANT204Y5 Sociocultural Anthropology

Prerequisite: 101H5

===================
ANT205H5 Introduction to Forensic Anthropology

Prerequisite: 101H5

===================
ANT206Y5 Culture and Communication: Introduction to Linguistic Anthropology

Exclusion: ANT206H5

===================
ANT241Y5 Aboriginal Peoples of North America

===================
ANT299Y5 Research Opportunity Program

===================
ANT304H5 Anthropology and Aboriginal Peoples

Exclusion: ANT304Y5

===================
ANT306H5 Forensic Anthropology Field School

Prerequisite: ANT205H5

===================
ANT308H5 Case Studies in Archaeological Botany and Zoology

Prerequisite: ANT200Y5

===================
ANT309H5 Southeast Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT310H5 Complex Societies

Prerequisite: ANT200Y5

===================
ANT312H5 Archaeological Analysis

Prerequisite: ANT200Y5

===================
ANT313H5 China, Korea and Japan in Prehistory

Prerequisite: ANT200Y5

===================
ANT314H5 Archaeological Theory

Exclusion: ANT411H5

===================
ANT316H5 South Asian Archaeology

Prerequisite: ANT200Y5

===================
ANT317H5 Archaeology of Eastern North America

Prerequisite: ANT200Y5

===================
ANT318H5 Archaeological Fieldwork

Prerequisite: ANT200Y5

===================
ANT320H5 Archaeological Approaches to Technology

Prerequisite: ANT200Y5

===================
ANT322H5 Anthropology of Youth Culture

Exclusion: ANT204Y5

===================
ANT327H5 Agricultural Origins:  The Second Revolution

Prerequisite: ANT200Y5

===================
ANT331H5 The Biology of Human Sexuality

Exclusion: ANT330H5

===================
ANT332H5 Human Origins

Exclusion: ANT332Y5

===================
ANT333H5 Human Origins II

Exclusion: ANT332Y5

===================
ANT334H5 Human Osteology

Exclusion: ANT334Y5

===================
ANT335H5 Anthropology of Gender

Exclusion: ANT331Y5

===================
ANT336H5 Molecular Anthropology

Prerequisite: ANT203Y5

===================
ANT338H5 Laboratory Methods in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT339Y5 Human Adaptation through Biological and Cultural Means

Prerequisite: ANT203Y5

===================
ANT340H5 Osteological Theory

Exclusion: ANT334Y5

===================
ANT350H5 Globalization and the Changing World of Work

Prerequisite: ANT204Y5

===================
ANT351H5 Money, Markets, Gifts: Topics in Economic Anthropology

Prerequisite: ANT204Y5

===================
ANT352H5 Power, Authority, and Legitimacy: Topics in Political Anthropology

Prerequisite: ANT204Y5

===================
ANT358H5 Ethnographic Methods

Prerequisite: ANT204Y5

===================
ANT360H5 Anthropology of Religion

Exclusion: ANT209Y5

===================
ANT361H5 Anthropology of Sub-Saharan Africa

Exclusion: ANT212Y5

===================
ANT362H5 Language in Culture and Society

Prerequisite: ANT204Y5

===================
ANT363H5 Magic, Witchcraft and Science

Prerequisite: ANT360H5

===================
ANT364H5 Lab in Social Interaction

Prerequisite: ANT206H5

===================
ANT365H5 Semiotic Anthropology

Prerequisite: ANT204Y5

===================
ANT368H5 World Religions and Ecology

Exclusion: RLG311H5

===================
ANT369H5 Religious Violence and Nonviolence

Exclusion: RLG317H5

===================
ANT397H5 Independent Study

Prerequisite: Permission of Faculty Advisor


===================
ANT398Y5 Independent Reading

Prerequisite: Permission of Faculty Advisor


===================
ANT399Y5 Research Opportunity Program

Prerequisite: P.I.


===================
ANT401H5 Vocal and Visual Communication

Prerequisite: ANT102H5

===================
ANT414H5 People and Plants in Prehistory

Prerequisite: ANT200Y5

===================
ANT415H5 Faunal Archaeo-Osteology

Exclusion: ANT415Y5

===================
ANT416H5 Advanced Archaeological Analysis

Prerequisite: ANT312H5

===================
ANT418H5 Advanced Archaeological Fieldwork

Prerequisite: ANT318H5

===================
ANT430H5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I


===================
ANT430Y5 Special Problems in Biological Anthropology and Archaeology

Prerequisite: P.I. 


===================
ANT431Y5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT431H5 Special Problems in Sociocultural or Linguistic Anthropology

Prerequisite: P.I.


===================
ANT432H5 Special Seminar in Anthropology

Prerequisite: P.I.


===================
ANT433H5 Genes, Language, Artifact and Mind

Prerequisite: ANT200Y5

===================
ANT434H5 Palaeopathology

Prerequisite: ANT334Y5

===================
ANT438H5 The Development of Thought in Biological Anthropology

Prerequisite: ANT203Y5

===================
ANT439Y5 Advanced Forensic Anthropology

Prerequisite: ANT205H5

===================
ANT441H5 Advanced Bioarchaeology

Prerequisite: ANT334H5

===================
ANT457H5 Anthropology and the Environment

Prerequisite: ANT102H5

===================
ANT458H5 Anthropology of Crime, Law and Order

Exclusion: ANT204Y5

===================
ANT459H5 The Ethnography of Speaking

Prerequisite: ANT206Y5

===================
ANT460H5 Theory in Sociocultural Anthropology

Prerequisite: ANT204Y5

===================
ANT461H5 Emergent Topics in Socio-Cultural &  Linguistic Anthropology

Prerequisite: ANT204Y5

===================
ANT498H5 Advanced Independent Study

Prerequisite: P.I.


===================
ANT499Y5 Advanced Independent Research

Prerequisite: P.I.

回复收藏 0 原文

话少心凉 2024-10-27 12:44:49

尝试使用类似 [position() mod ; 的内容代替 [] = ]

偏移量是您感兴趣的每个节点之间的距离。 @class='titlestyle' 和 @class='title2' 可能会有所不同。

ites = hxs.select("(//p[@class='titlestyle'])[position() mod <offset to next to match> = 2]/text()[1] | (//span[@class='title2'])[position() mod <offset to next to match> = 2]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 2]/following-sibling::a[1]/text() | (//span[@class='title2'])[position() mod <offset to next to match> = 3]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 3]/following-sibling::a[1]/text()")

编辑：根据要求。

一次执行每个单独的 xpath，而不限制其位置。
这是一个手动事实调查练习，用于确定在 xpath 中使用的最终值。

返回与以下 xpath 匹配的所有节点（这是第一个）。

ites = hxs.select("(//p[@class='titlestyle'])/text()[1]")

ites 将包含一些您想要用于该类的内容和一些您不需要的内容。

您已经为此确定了第二个节点是您想要的第一个节点。现在计算到 ites 中您希望此规则匹配的下一个的距离。这就是我们所说的<下一个匹配的偏移量>。

现在对每个剩余的 xpath 搜索重复上述操作。

将 hxs.select("") 视为过滤器，当它遍历 xml 时，将返回与您的 xpath 匹配的所有内容。

这是一个示例 http://zvon.org/xxl/XPathTutorial/Output/example22.html

Try instead of [<int>] use something like [position() mod <offset> = <base>]

Offset being the distance between each node you are interested. It may be different for @class='titlestyle' and @class='title2'.

ites = hxs.select("(//p[@class='titlestyle'])[position() mod <offset to next to match> = 2]/text()[1] | (//span[@class='title2'])[position() mod <offset to next to match> = 2]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 2]/following-sibling::a[1]/text() | (//span[@class='title2'])[position() mod <offset to next to match> = 3]/text() | \
                    (//span[@class='title2'])[position() mod <offset to next to match> = 3]/following-sibling::a[1]/text()")

EDIT: As requested.

One at a time perform each inidividual xpath without constraining on its position.
This is a manual fact finding excercise to determine the final values to use in the xpath.

Return all nodes matching the following xpath (this is the first one).

ites = hxs.select("(//p[@class='titlestyle'])/text()[1]")

ites will contain some you want for the class and some that you do not.

You have already determined for this one the 2nd is the first node you want. Now count the distance to the next one in ites that you want this rule match on. This is what we can refer to as <offset to next to match>.

Now repeat the above for each of the remaining xpath searches.

Think of hxs.select("") as filter and as it walks the xml every single thing that matches your xpath will be returned.

Here is an example http://zvon.org/xxl/XPathTutorial/Output/example22.html

回复收藏 0 原文

~没有更多了~