首先,这是来自 我的上一个问题。我再次发布此内容是因为我在原始帖子中接受其答案的人建议我这样做他认为这个问题以前没有得到适当的界定。尝试 2 如下:
我正在尝试从此网页获取信息。为了清楚起见,以下是页面源代码块的选择:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
从上面的示例块中,我想提取以下信息:
-
ANT101H5 生物人类学和考古学简介
-
排除:ANT100Y5
-
先决条件:ANT102H5
我想从网页获取所有此类信息(请记住,某些课程可能另外列出了“并存条件”)以及或可能根本没有列出任何先决条件/共同要求或排除情况)。
我一直在尝试为这个任务编写一个合适的 xpath 表达式,但我似乎无法让它恰到好处。
到目前为止,在 if Dimitre Novatchev 的帮助下,我已经能够使用以下表达式:
sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
(//span[@class='title2'])[3]/following-sibling::a[1]/text()")
但是,它产生以下输出似乎仅获取页面上第一课程的信息:
[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n "},
{"desc": "Exclusion: "},
{"desc": "ANT100Y5"},
{"desc": "Prerequisite: "},
{"desc": "ANT102H5"}]
需要明确的是,此输出仅在获取有关第一门课程的正确信息时才是正确的。我需要该网页上列出的所有课程的正确信息。
我已经很接近了,但我似乎无法弄清楚最后一步。
我将不胜感激任何帮助...提前致谢
First of all, this is a spawn from my previous question. I have posted this again because I was advised to do so by the person whose answer I accepted in the original post as he felt that the question was not properly defined before. Here goes attempt 2:
I am trying to get information out of this webpage. For clarity, following is a selection of a block of the page source:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
From the sample block above, I would like to extract the following information:
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
Prerequisite: ANT102H5
I would like to get all such information from the webpage (keep in mind that some courses may have an additionally listed "Corequisite" as well or may not have any pre/co requisites or exclusions listed at all).
I have been trying to write an appropriate xpath expression for this task, but I seem to not be able to get it just right.
Thus far, with the help if Dimitre Novatchev, I have been able to use the following expression:
sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
(//span[@class='title2'])[3]/following-sibling::a[1]/text()")
However, it produces the following output, which seems to get the information for only the first course on the page:
[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n "},
{"desc": "Exclusion: "},
{"desc": "ANT100Y5"},
{"desc": "Prerequisite: "},
{"desc": "ANT102H5"}]
Just to be absolutely clear, this output is correct only insofar as that it gets the correct information regarding the first course. I need the correct information like this for all courses listed on that webpage.
I'm so close but I don't seem to be able to figure out that last step.
I'd appreciate any help... thanks in advance
发布评论
评论(2)
为所有课程选择相关数据所需的单个 XPath 表达式非常混乱,因此我在这里采用另一种方法,可以使用该方法(如果有必要)生成该单个 XPath 表达式:
<强>这个简单的XSLT转换:
应用在页面上时:http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html(整理为格式良好的 XML 文档),生成想要的结果:
The required single XPath expression to select the relevant data for all courses is quite messy, so here I am taking another approach, which can be used (if necessary at all) to produce that single XPath expression:
This simple XSLT transformation:
when applied on the page at: http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html (tidied up to become a well-formed XML document), produces the wanted result:
尝试使用类似
[position() mod; 的内容代替 ]
[]
=偏移量是您感兴趣的每个节点之间的距离。 @class='titlestyle' 和 @class='title2' 可能会有所不同。
编辑:根据要求。
一次执行每个单独的 xpath,而不限制其位置。
这是一个手动事实调查练习,用于确定在 xpath 中使用的最终值。
返回与以下 xpath 匹配的所有节点(这是第一个)。
ites
将包含一些您想要用于该类的内容和一些您不需要的内容。您已经为此确定了第二个节点是您想要的第一个节点。现在计算到
ites
中您希望此规则匹配的下一个的距离。这就是我们所说的<下一个匹配的偏移量>
。现在对每个剩余的 xpath 搜索重复上述操作。
将 hxs.select("") 视为过滤器,当它遍历 xml 时,将返回与您的 xpath 匹配的所有内容。
这是一个示例 http://zvon.org/xxl/XPathTutorial/Output/example22.html
Try instead of
[<int>]
use something like[position() mod <offset> = <base>]
Offset being the distance between each node you are interested. It may be different for @class='titlestyle' and @class='title2'.
EDIT: As requested.
One at a time perform each inidividual xpath without constraining on its position.
This is a manual fact finding excercise to determine the final values to use in the xpath.
Return all nodes matching the following xpath (this is the first one).
ites
will contain some you want for the class and some that you do not.You have already determined for this one the 2nd is the first node you want. Now count the distance to the next one in
ites
that you want this rule match on. This is what we can refer to as<offset to next to match>
.Now repeat the above for each of the remaining xpath searches.
Think of hxs.select("") as filter and as it walks the xml every single thing that matches your xpath will be returned.
Here is an example http://zvon.org/xxl/XPathTutorial/Output/example22.html