如何使用Python在Wikipedia页面上获取页面创建日期？

发布于 2025-02-10 07:53:15 字数 133 浏览 0 评论 0原文

我有一个问题，如何在表中获取特定文本。在此示例中，Wikipedia页面上的页面创建日期。例如，在此链接

中= info

我正在使用BeautifulSoup，但是我仍然遇到麻烦，因为其余文字在那里。我只需要页面创建日期。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花伊自在美 2025-02-17 07:53:15

grep'ing返回一条HTML的一行：

$ curl -s 'https://en.wikipedia.org/wiki/United_States?action=info' |
    grep --color 'Date of page creation'

在这种情况下，按线迭代并使用正则态度就足够了。

但是，让我们坚持BS4，这是工作的好工具。

1。迭代

只需循环soup.find_all（'td'）标签
直到找到一个td.text
匹配的“页面创建日期”。
然后要求 next 标签，
和td.text具有您想要的时间戳。

2。搜索标签

利用“ mw-pageinfo-first timper” ID
在＆lt; tr＆gt;行上，告诉BS4寻找。
阅读并丢弃＆lt; td＆gt;。
阅读另一个基准，然后返回其td.text时间戳。

Grep'ing returns a single line of HTML:

$ curl -s 'https://en.wikipedia.org/wiki/United_States?action=info' |
    grep --color 'Date of page creation'

In this case iterating line by line and using a regex would suffice.

But let's stick with BS4, a good tool for the job.

1. iterate

Just loop over the soup.find_all('td') tags
until you find one having td.text
matching "Date of page creation".
Then ask for the next tag,
and td.text has the timestamp you want.

2. search for tag

Take advantage of the "mw-pageinfo-firsttime" id
on the <tr> row, telling BS4 to look for that.
Read and discard a <td>.
Read another datum and return its td.text timestamp.

回复收藏 0 原文

奶气 2025-02-17 07:53:15

有一些桌子，但只有一张表有有关日期和时间的信息。幸运的是，日期的行具有一个独特的ID，使工作变得容易。
因此，通过ID查找TR，并通过.Text属性获取其内容，或者先从此行获取第二个单元格，然后获取其内容。

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/United_States?action=info'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

first_time = soup.find(id='mw-pageinfo-firsttime').find_all('td')[1]
last_time = soup.find(id='mw-pageinfo-lasttime').find_all('td')[1]

print(first_time.text)
print(last_time.text)

there are some tables but just one table has information about the date and time. Fortunately, the row of date has a unique id that makes work easy.
so find the tr by the id and get its content by .text property or first get the second cell from this row and then get its content.

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/United_States?action=info'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

first_time = soup.find(id='mw-pageinfo-firsttime').find_all('td')[1]
last_time = soup.find(id='mw-pageinfo-lasttime').find_all('td')[1]

print(first_time.text)
print(last_time.text)

回复收藏 0 原文

~没有更多了~