如何在使用XPath选择器时刮擦整个信息

发布于 2025-02-10 13:07:16 字数 822 浏览 2 评论 0原文

我遇到了一个问题,即在使用XPath选择器时无法获得所有信息。该线处于开发人员模式。这是

<address class="location-row-address" data-qa-target="provider-office-address">
230 W 13th St Ste 1b<!-- 
--> <!-- 
-->New York<!-- 
-->, <!--
-->NY<!-- 
--> <!-- 
-->10011<!--
--> 
</address>

我使用的XPATH选择器是

response.xpath('//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address/text()').get()

我获得的结果是

230 W 13th St Ste 1b

结果

230 W 13th St Ste 1b New York, NY 10011

我期望我正在使用刮擦的 。谢谢。感谢您的帮助。

编辑: 我面临的上述问题已解决。我使用String()方法和get()从元素节点获取所有字符串。

response.xpath('string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)').get()

I encountered a problem where I could not get all the information while using the XPath selector. The line is in developer mode. Is this

<address class="location-row-address" data-qa-target="provider-office-address">
230 W 13th St Ste 1b<!-- 
--> <!-- 
-->New York<!-- 
-->, <!--
-->NY<!-- 
--> <!-- 
-->10011<!--
--> 
</address>

The XPath selector that I use is

response.xpath('//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address/text()').get()

The result I am getting is

230 W 13th St Ste 1b

The result I am expecting is

230 W 13th St Ste 1b New York, NY 10011

I am using scrapy for scraping. Thank you. Your help is appreciated.

Edit:
The above problem I was facing was solved. I used the string() method and get() to get all the strings from the element node.

response.xpath('string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)').get()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

可爱暴击 2025-02-17 13:07:16

您的XPATH表达式返回所有文本节点,即地址元素的孩子。有几个文本节点,其中有评论节点将它们分开!

回到Python Land,您在结果上调用get()方法,该方法仅返回nodeset的 first 节点。

.get()总是返回单个结果;如果有几场比赛,
返回了第一场比赛的内容;如果没有比赛,则没有
返回。 .getAll()返回一个带有所有结果的列表。
https://docs.scrapy.org/en/latest/latest/topics/selectors。 html

如果您调用getall()方法,您将检索字符串列表,并且可以将它们串联以产生所需的文本。但是,一个更简单的方法是使用XPATH函数String获取address> address> element的“字符串值”。在XPATH 1.0规格中,它以此方式定义了元素节点的字符串值:

元素节点的字符串值是
元素节点的所有文本节点后代的字符串值
文档订单。
https://www.www.w.w3.org/19999 /rec-XPATH-19991116/#元素节点

将此功能应用于地址 element将返回您一个字符串值,然后您可以使用get()使用该功能访问它废纸方法:

response.xpath(
   'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)'
).get()

Your XPath expression returns all the text nodes which are children of the address element. There are several text nodes, with comment nodes separating them!

Back in Python land, you are calling the get() method on the result which returns you only the first node of the nodeset.

.get() always returns a single result; if there are several matches,
content of a first match is returned; if there are no matches, None is
returned. .getall() returns a list with all results.
https://docs.scrapy.org/en/latest/topics/selectors.html

If you called the getall() method you would retrieve a list of strings, and you could concatenate them to produce the text you want. But a simpler method is to use the XPath function string to get the "string value" of the address element. In the XPath 1.0 spec it defines the string value of an element node this way:

The string-value of an element node is the concatenation of the
string-values of all text node descendants of the element node in
document order.
https://www.w3.org/TR/1999/REC-xpath-19991116/#element-nodes

Applying this function to the address element will return you a single string value, which you can then access using the get() method in Scrapy:

response.xpath(
   'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)'
).get()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文