当前位置：文江博客话题详情

如何解析特定的 wiki 页面&自动化吗？

发布于 2024-11-01 04:00:42 字数 361 浏览 3 评论 0原文

我正在尝试制作一个网络应用程序，需要解析一个特定的维基百科页面和页面。提取一些以表格格式存储在页面上的信息。然后，提取的数据需要存储到数据库中。

我以前还没有真正做过这样的事情。我应该使用什么脚本语言来执行此操作？我读了一点书&看起来 Python（使用 urllib2 和 BeautifulSoup）应该可以完成这项工作，但这是解决问题的最佳方法吗？

我知道我也可以使用 WikiMedia api，但是使用 python 来解决一般解析问题是个好主意吗？

另外，维基百科页面上的表格数据可能会发生变化，所以我需要每天解析。我如何自动执行此脚本？另外，有什么想法可以在不使用 svn 等外部工具的情况下进行版本控制，以便在需要时可以轻松恢复更新吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

溺孤伤于心 2024-11-08 04:00:42

我应该使用什么脚本语言来执行此操作？

Python 会做，正如你标记你的问题一样。

看起来 Python（使用 urllib2 和 BeautifulSoup）应该可以完成这项工作，但这是否是解决问题的最佳方法。

这是可行的。我个人会使用lxml.etree。另一种方法是以原始格式获取页面，然后您将执行不同的解析任务。

我知道我也可以使用 WikiMedia api，但是使用 python 来解决一般解析问题是个好主意吗？

这似乎是一个陈述和一个不相关的论证问题。主观上，如果我正在解决你所问的问题，我会使用 python。

此外，维基百科页面上的表格数据可能会发生变化，因此我需要每天进行解析。如何自动执行此脚本？

Unix cron 作业。

还有什么想法可以在不使用 svn 等外部工具的情况下进行版本控制，以便在需要时可以轻松恢复更新？

Subversion 存储库可以与您编写的脚本在同一台计算机上运行。或者，您可以使用分布式版本控制系统，例如git。

奇怪的是，您没有提到您打算如何处理这些数据。

回复收藏 0 原文

池予 2024-11-08 04:00:42

是的，Python 是网页抓取的绝佳选择。

维基百科经常更新内容，但很少更新结构。如果表具有唯一性（例如 ID），那么您可以更自信地提取数据。

以下是使用此库抓取维基百科的简单示例：

from webscraping import common, download, xpath
html = download.Download().fetch('http://en.wikipedia.org/wiki/Stackoverflow')
attributes = {}
for tr in xpath.search(html, '//table//tr'):
    th = xpath.get(tr, '/th')
    if th:
        td = xpath.get(tr, '/td')
        attributes[common.clean(th)] = common.clean(td)
print attributes

以下是输出：

{'Commercial?': 'Yes', 'Available language(s)': 'English', 'URL': 'stackoverflow.com', 'Current status': 'Online', 'Created by': 'Joel Spolsky and Jeff Atwood', 'Registration': 'Optional; Uses OpenID', 'Owner': 'Stack Exchange, Inc.', 'Alexa rank': '160[1]', 'Type of site': 'Question & Answer', 'Launched': 'August 2008'}

yes Python is an excellent choice for web scraping.

Wikipedia updates the content often but the structure rarely. If the table has something unique like an ID then you can get extract the data more confidently.

Here is a simple example to scrape wikipedia using this library:

from webscraping import common, download, xpath
html = download.Download().fetch('http://en.wikipedia.org/wiki/Stackoverflow')
attributes = {}
for tr in xpath.search(html, '//table//tr'):
    th = xpath.get(tr, '/th')
    if th:
        td = xpath.get(tr, '/td')
        attributes[common.clean(th)] = common.clean(td)
print attributes

And here is the output:

{'Commercial?': 'Yes', 'Available language(s)': 'English', 'URL': 'stackoverflow.com', 'Current status': 'Online', 'Created by': 'Joel Spolsky and Jeff Atwood', 'Registration': 'Optional; Uses OpenID', 'Owner': 'Stack Exchange, Inc.', 'Alexa rank': '160[1]', 'Type of site': 'Question & Answer', 'Launched': 'August 2008'}

回复收藏 0 原文

~没有更多了~