如何使用Python在Wikipedia页面上获取页面创建日期?
我有一个问题,如何在表中获取特定文本。在此示例中,Wikipedia页面上的页面创建日期。例如,在此链接
中= info
我正在使用BeautifulSoup,但是我仍然遇到麻烦,因为其余文字在那里。我只需要页面创建日期。
I have a problem, with how to get a specific text in a table. In this example the date of page creation on the Wikipedia page. For example in this link
https://en.wikipedia.org/wiki/United_States?action=info
I'm using beautifulsoup, but I'm still having trouble because the rest of the text is there. I just need the date of page creation only.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
grep'ing返回一条HTML的一行:
在这种情况下,按线迭代并使用正则态度就足够了。
但是,让我们坚持BS4,这是工作的好工具。
1。迭代
只需循环
soup.find_all('td')
标签直到找到一个
td.text
匹配的“页面创建日期”。
然后要求 next 标签,
和
td.text
具有您想要的时间戳。2。搜索标签
利用“ mw-pageinfo-first timper” ID
在
< tr>
行上,告诉BS4寻找 。阅读并丢弃
< td>
。阅读另一个基准,然后返回其
td.text
时间戳。Grep'ing returns a single line of HTML:
In this case iterating line by line and using a regex would suffice.
But let's stick with BS4, a good tool for the job.
1. iterate
Just loop over the
soup.find_all('td')
tagsuntil you find one having
td.text
matching "Date of page creation".
Then ask for the next tag,
and
td.text
has the timestamp you want.2. search for tag
Take advantage of the "mw-pageinfo-firsttime" id
on the
<tr>
row, telling BS4 to look for that.Read and discard a
<td>
.Read another datum and return its
td.text
timestamp.有一些桌子,但只有一张表有有关日期和时间的信息。幸运的是,日期的行具有一个独特的ID,使工作变得容易。
因此,通过ID查找
TR
,并通过.Text
属性获取其内容,或者先从此行获取第二个单元格,然后获取其内容。there are some tables but just one table has information about the date and time. Fortunately, the row of date has a unique id that makes work easy.
so find the
tr
by the id and get its content by.text
property or first get the second cell from this row and then get its content.