我目前使用 YQL 使用 xpath 提取维基百科的内容。
我当前使用的 xpath 表达式是 \\p
。该表达式选择所有段落节点,去掉所有子节点,例如
、 ;
、
等。
因此,我得到了像这样的维基百科足球页面的输出。 链接此处
在此输出中,链接被删除。
来自免费百科全书维基百科
.其他用途请参见
或足球、、、、、、和。许多不同的游戏中的一些被称为
足球。从左上到右下:
所有这些都在不同程度上涉及用脚传球来得分。这
世界范围内最受欢迎的这些运动是,通常被称为
只是“足球”或“足球”。不合格,该词适用于
无论哪种形式的足球在该地区最受欢迎
其中出现该词的,包括 、 、 、 、 、 等相关
游戏。足球的这些变体被称为足球“代码”。
........................以及更多
预期输出
来自免费百科全书维基百科
有关其他用途,请参阅足球(消歧义)。
许多不同的游戏中的一些被称为足球。从左上角到
右下:英式足球或英式足球、澳大利亚规则
足球、国际规则足球、橄榄球联合会、橄榄球联赛和
美式足球。
足球运动都在不同程度上涉及踢球
脚进球。全球最受欢迎的这些运动
是足球协会,通常被称为“足球”或
“足球”。不合格,足球这个词适用于任何形式的比赛
足球是该地区最流行的词
出现,包括美式足球、澳式足球、
加拿大足球、盖尔足球、橄榄球联盟、橄榄球联盟1 和
其他相关游戏。足球的这些变体被称为
足球“代码”。
(粗体字是有链接的)
那么如何提取段落及其子节点?我是 xpath 的新手
I currently use YQL to extract the contents of Wikipedia using xpath.
The xpath expression I currently use is \\p
.This expression selects all the paragraph nodes stripping out all the child nodes like <a>
, <sup>
, <strong>
etc..
Due to this i get an output for the Wikipedia football page like this. Link here
In this output the links are stripped.
From Wikipedia, the free encyclopedia
.For other uses, see
or soccer, , , , , and .Some of the many different games known as
football. From top left to bottom right:
all involve, to varying degrees, a ball with the foot to score a . The
most popular of these sports worldwide is , more commonly known as
just "football" or "soccer". Unqualified, the word applies to
whichever form of football is the most popular in the regional context
in which the word appears, including , , , , , and other related
games. These variations of football are known as football "codes".
.....................and more
Expected output
From Wikipedia, the free encyclopedia
For other uses, see Football (disambiguation).
Some of the many different games known as football. From top left to
bottom right: Association football or soccer, Australian rules
football, International rules football, rugby union, rugby league, and
American football.
Football sports all involve, to varying degrees, kicking a ball with
the foot to score a goal. The most popular of these sports worldwide
is association football, more commonly known as just "football" or
"soccer". Unqualified, the word football applies to whichever form of
football is the most popular in the regional context in which the word
appears, including American football, Australian rules football,
Canadian football, Gaelic football, rugby league, rugby union1 and
other related games. These variations of football are known as
football "codes".
(bolder words are one which has links)
So how to extract the paragraph along with its child nodes ?? I am new to xpath
发布评论
评论(1)
正确的答案是
//p/descendant-or-self::*
以便拥有父节点和子节点。The right answer is
//p/descendant-or-self::*
in order to have parent and child nodes.