使用 Lucene 索引维基百科
是否可以使用 Lucene Benchmark 来索引维基百科转储?我希望能够在最新的英语维基百科页面转储上执行短语查询。我正在尝试寻找示例用例,但还没有找到。
我下载了最新的英文转储,名为: enwiki-latest-pages-articles.xml.bz2
然后我在终端中运行命令: java org.apache.lucene.benchmark.utils.ExtractWikipedia -i ~/enwiki-latest-pages-articles.xml.bz2
我相信将页面提取到标有“enwiki”的目录中
现在我需要的基准测试中还有其他内容吗运行以便索引维基? README.enwiki 并没有真正给我一套清晰的说明,事实上我什至不确定我是否应该运行 ExtractWikipedia 类。
Is it possible to use Lucene Benchmark to index a wikipedia dump? I want to be able to execute phrase queries on the latest english wikipedia page dump. I'm trying to look for example use cases but I haven't found any.
I downloaded the latest english dump, named:
enwiki-latest-pages-articles.xml.bz2
Then I ran the command in the terminal:
java org.apache.lucene.benchmark.utils.ExtractWikipedia -i ~/enwiki-latest-pages-articles.xml.bz2
which I believe extracted the pages into a directory labeled "enwiki"
Now is there something else in benchmarks that I need to run in order to index the wiki? The README.enwiki does not really give me a clear set of instructions, in fact I'm not even sure if I was supposed to run the ExtractWikipedia class or not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
只需运行“ant”即可;我在 Lucene 邮件列表上发布了更彻底的答案,但这基本上就是它的要点。 build.xml 文件有一堆用于运行基准测试的目标。
Just run "ant"; I posted a more thorough answer on the Lucene mailing list, but that is basically the gist of it. The build.xml file has a bunch of targets for running benchmarks.
维基媒体基金会一直致力于开发名为 DiffDb 的新项目。使用 Hadoop,我们创建两个修订版之间的差异,并且所有这些差异都使用 Lucene 进行索引。您可以在 github 上找到代码:
英文维基百科的结果索引是1.4Tb,但你可以做非常酷的查询,例如谁在 2005 年 4 月添加了 foo,谁删除了超过 10k 字节等。
The Wikimedia Foundation has been working on new project called DiffDb. Using Hadoop we create the diff between two revisions and all those diffs are indexed using Lucene. You can find the code at github:
The resulting index for just the English Wikipedia is 1.4Tb but you can do really cool queries, such as who added foo in april 2005, who removed more than 10k in bytes, etc etc.