挖掘维基百科用于文本挖掘的映射关系

发布于 2024-11-10 13:05:19 字数 221 浏览 7 评论 0原文

我计划开发一个基于网络的应用程序,它可以抓取维基百科以查找关系并将其存储在数据库中。通过关系,我的意思是搜索一个名字,比如“比尔盖茨”,找到他的页面,下载它并从页面中提取各种信息并将其存储在数据库中。信息可能包括他的出生日期、他的公司和其他一些信息。但我需要知道是否有任何方法可以从页面中找到这些独特的数据,以便我可以将它们存储在数据库中。任何具体的书籍或算法将不胜感激。另外提及优秀的开源库也会有所帮助。

谢谢

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.

Thank You

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

且行且努力 2024-11-17 13:05:19

如果您还没有,您应该看一下 DBpedia。许多类别的 wiki 文章都有“信息框”来存放您所描述的信息类型,并且他们用它创建了一个数据库:

http://en.wikipedia.org/wiki/DBpedia

您还可以利用 Metaweb 的Freebase (重叠,我相信甚至可能集成来自 DBpedia 的信息。)他们有一个 API用于查询他们的图形数据库,并且有一个名为 freebase-python

更新:Freebase 已不复存在;它们被 Google 收购并最终并入 Google 知识图谱。有API,但我认为他们没有类似正式同步的功能Freebase 拥有维基百科等公共资源。我个人对结果感到失望。 :-/

至于自然语言处理位,如果您确实在该问题上取得了进展,您可能会将这些数据库视为您所获取的任何信息的存储库。

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:

http://en.wikipedia.org/wiki/DBpedia

You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.

UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/

As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

絕版丫頭 2024-11-17 13:05:19

你提到了Python和开源,所以我会研究NLTK(自然语言工具包)。文本挖掘和自然语言处理是您可以使用愚蠢的算法(例如模式匹配)做很多事情的事情之一,但如果您想更进一步并做一些更复杂的事情 - 即。尝试提取以灵活方式存储的信息或尝试找到可能有趣但先验未知的信息,那么应该研究自然语言处理。

NLTK 旨在用于教学,因此它是一个工具包。这种方法非常适合Python。还有几本书是关于它的。奥莱利的书也以开放许可证在线出版。请参阅 NLTK.org

You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.

NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org

Bonjour°[大白 2024-11-17 13:05:19

Jvc,现有的 python 模块可以完成您上面提到的所有操作。

为了从网页中提取信息,我喜欢使用 Selenium,http://seleniumhq.org/projects/ide/。基本上,您可以使用许多标识符(id、Xpath 等)本地化和检索任何网页上的信息。

然而,就像 winwaed 所说,如果您只是简单地进行“模式匹配”,它可能会不够灵活,特别是因为某些网站使用动态代码,这意味着标识符可能会随着页面的每次重新加载而发生变化。但是,可以通过在代码中添加正则表达式(即 (.*))来解决此问题。观看此 YouTube 视频,http://www.youtube.com/watch?v=Ap_DlSrT -iE。尽管他使用 BeautifulSoup 来抓取网站 - 您可以看到他如何使用正则表达式从页面中提取信息。

另外,我不确定您正在使用什么类型的数据库,但是 pyodbc,http://code .google.com/p/pyodbc/,可以使用 SQL 类型,也可以使用 Microsoft Access 等主流数据库。

因此,我的建议是使用 Selenium 来查找网页上的信息,使用 pyodbc 来存储和检索信息,以及在标识符动态时使用正则表达式。

Jvc, there are existing python modules that can do everything you mentioned above.

For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).

However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.

Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.

So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文