按主题搜索并从维基百科的文章中提取关键字
我正在用java做一个项目,其中我必须处理维基百科转储文件。我正在寻找一个库来提取维基百科文章中的关键字...基本上我想阅读维基百科 xml 转储中的每个标签页面并将其与主题和类别列表进行比较,如果正确,则选择它并添加到我的结果。我对阅读转储或编写维基百科结果不感兴趣,只是我想知道任何允许我按维基百科文章标题和文本中的主题进行搜索的库...例如...如果输入是“狗”我想要关于狗的维基百科文章,如果可能的话,狗类别下的任何页面。
维基百科是否指定通用库并不重要。我需要将维基文本作为参数并收到关键字列表,包括类别...我发现一些维基百科库可以正常工作,例如 Wikipedia-Miner 或 Java Wikipedia Library< /a> 但对于第一个,我需要安装 mysql,并且我想分析文本而不将其保存到数据库中。
任何形式的帮助或建议都会受到欢迎。 :)
I'm doing a project in java in which I have to process a wikipedia dump file. I'm looking for a library to extract keywords in wikipedia articles... Basically I want to read every tag page in the wikipedia xml dump and compare it with a list of topics and categories and if it is correct , to choose it and add to my results. I'm not interested in read the dump or write wikipedia results, only I want to know about any library that let me to search by topics in the titles and text of a wikipedia article... For example... If the input is "dog" i want the wikipedia articles about dog and if is possible any page under dogs categories.
It doesn't matter if a library for general purpose and not is specified for wikipedia. I need to put the wikitext as argument and received a list of keywords, including categories... I've found some wikipedia libraries that works fine like Wikipedia-Miner or the Java Wikipedia Library but with the first I need to have installed mysql and I want to analyze the text without saving it into a database.
Any kind of help or suggestion is well-received. :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来这是你最好的选择:Java Wikipedia Library
It looks like this is your best bet: Java Wikipedia Library