I don't know exactly what you are looking for: method, library, tool?
If you want to compute your large datasets really fast with distributed computing, you should check out MapReduce, e.g. using Hadoop on Amazon EC2/S3 services.
我的建议是:尝试在本地工作站上运行默认的 Solr 实例(这是一种点击并运行 jar 类型的交易)。您很快就会知道 Solr/Lucene 是否适合您,或者您是否必须通过 Hadoop 等自定义代码。
Lucene can easily scale to what you need. Solr will probably be easier to set up, and hadoop is most likely overkill for only a few million data points.
Something you need to think about is what definition of "how intersected" you want to use. If all the sets have the same size I suppose it's easy, but Jaccard distance might make more sense in other contexts; Lucene's default scoring is often good too.
My advice would be: try running the default Solr instance on your local workstation (it's a cllick-and-run jar type of deal). You'll know pretty quickly whether Solr/Lucene will work for you or if you'll have to custom code your own thing via Hadoop etc.
发布评论
评论(2)
我不知道你到底在寻找什么:方法、库、工具?
如果您想通过分布式计算快速计算大型数据集,您应该查看 MapReduce,例如在 Hadoop ="nofollow">亚马逊 EC2/S3服务。
I don't know exactly what you are looking for: method, library, tool?
If you want to compute your large datasets really fast with distributed computing, you should check out MapReduce, e.g. using Hadoop on Amazon EC2/S3 services.
Lucene 可以轻松扩展以满足您的需要。 Solr 可能会更容易设置,而 hadoop 对于只有几百万个数据点来说很可能是多余的。
您需要考虑的是您想要使用“如何相交”的定义。如果所有集合都具有相同的大小,我认为这很容易,但杰卡德距离在其他情况下可能更有意义; Lucene 的默认评分通常也不错。
我的建议是:尝试在本地工作站上运行默认的 Solr 实例(这是一种点击并运行 jar 类型的交易)。您很快就会知道 Solr/Lucene 是否适合您,或者您是否必须通过 Hadoop 等自定义代码。
Lucene can easily scale to what you need. Solr will probably be easier to set up, and hadoop is most likely overkill for only a few million data points.
Something you need to think about is what definition of "how intersected" you want to use. If all the sets have the same size I suppose it's easy, but Jaccard distance might make more sense in other contexts; Lucene's default scoring is often good too.
My advice would be: try running the default Solr instance on your local workstation (it's a cllick-and-run jar type of deal). You'll know pretty quickly whether Solr/Lucene will work for you or if you'll have to custom code your own thing via Hadoop etc.