调整文本搜索以进行图形/分子比较算法

发布于 2024-10-11 23:58:59 字数 1502 浏览 5 评论 0原文

我正在寻找一个用于非传统类型文本搜索的文本搜索引擎，我想要关于哪种工具（Lucene、Sphinx、Xapian 或其他工具）最适合我的建议，以及关于从哪里开始的指导。

我有用图（原子和键）表示的分子。我有办法枚举最多的所有子图尺寸 k。从技术角度来说，输入是 SMILES ，输出是规范的 SMARTS 以及每个子图/的次数智能发生。

例如，如果输入分子为“CCO< /a>” 则规范结果为 {"C": 2, "O": 1, "CC": 1, "OC": 1, "CCO": 1} 并且如果分子为 "SCO" 那么规范结果是 {"C" ：1，“S”：1，“O”：1，“CS”：1，“OC”：1，“SCO”：1}。这些都是小例子。对于真正的分子，我得到了大约 500 个“单词”，看起来像“CC(C)O”、“CCCOCC”、“cn”和“cccc(c)O”。

将分子视为特征字符串加计数的集合意味着我应该能够使用文本搜索工具在文本级别进行比较，并希望它们在化学级别上有意义。

例如，我可以使用余弦相似性或许与 tf-idf 权重并通过寻找相似的子模式来找到相似的分子。对于上面的“CCO”和“SCO”示例，余弦相似度为 (2*1+1*1+1*1)/sqrt(2*2+1*1+1*1+1*1+1* 1)/sqrt(6*(1*1)) = 4/sqrt(8*6) = 0.58。

另一个例子，如果我想找到包含“CCS”子结构的分子，那么我可以根据计数进行快速倒排索引搜索（分子必须至少有2个“C”，至少1个“CS”，等等）在解决NP子图同构问题之前。也就是说，基于文本的方法可以充当过滤器来拒绝明显的不匹配。

我正在尝试找出现有的文本解决方案，但这有点令人畏惧。我不需要停用词，我不需要词干，我不关心词序；我不需要很多现有的功能。我确实需要保留词向量的能力，因为了解“C”出现 2 次还是 3 次很重要。

哪个文本搜索引擎最适合我？它看起来像 Lucene，尤其是 Mahout 的工作。您能否推荐查看文档的哪些部分或相关教程？我发现的那些用于全文搜索，具有词干和我不需要的其他功能。

原文

I'm looking for a text search engine for a non-traditional sort of text search and I want advice on which tool (Lucene, Sphinx, Xapian, or something else) is most appropriate for me, plus pointers on where to get started.

I have molecules represented as graphs (atoms and bond). I have a way to enumerate all subgraphs of up to size k. Being technical, the inputs are SMILES and the output is canonical SMARTS and the number of times each subgraph/SMARTS occurs.

For example, if the input molecule is "CCO" then the canonical results are {"C": 2, "O": 1, "CC": 1, "OC": 1, "CCO": 1} and if the molecule is "SCO" then the canonical results are {"C": 1, "S": 1, "O": 1, "CS": 1, "OC": 1, "SCO": 1}. These are tiny examples. For real molecule I got around 500 "words", which look like "CC(C)O", "CCCOCC", "cn" and "cccc(c)O".

Looking at molecules as a collections of characteristic strings plus counts means I should be able to use a text search tool to do comparisons at the text level, with the hopes that they are meaningful at the chemistry level.

For examples, I can use cosine similarity perhaps with tf-idf weight and find similar molecules by looking for similar subpatterns. With the "CCO" and "SCO" examples above, the cosine similarity is (2*1+1*1+1*1)/sqrt(2*2+1*1+1*1+1*1+1*1)/sqrt(6*(1*1)) = 4/sqrt(8*6) = 0.58.

For another example, if I want to find molecules which contain a "CCS" substructure then I can do a fast inverted index search based on the counts (the molecules must have at least 2 "C"s, at least 1 "CS", and so on) before tackling the NP subgraph isomorphism problem. That is, text-based methods can act as a filter to reject obvious mismatches.

I'm trying to figure out the text solutions which exist but it's a bit daunting. I don't need stop words, I don't need stemming, I don't care about word order; I don't need quite a number of the features which exist. I do need the ability to keep word vectors, since it's important to know if "C" appears 2 time or 3.

Which text search engine is most appropriate for me? It looks like Lucene, especially with the work in Mahout. Can you recommend which parts of the documentation to look at or relevant tutorials? The ones I've found are meant for full-text searches, with stemming and the other features I don't need.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

溺渁∝ 2024-10-18 23:58:59

编辑：我现在可能更好地理解了这一点。
您想要比较以字符串表示的图形。字符串具有可能重复的“单词”。
您可以使用 Lucene，在这种情况下，我赞同使用 Solr 的建议。
基本上，每个 Solr 文档都包含一个字段；该字段将包含字符串，我建议您展开该字符串：编写 C C 而不是 C:2。
如果使用空格分隔单词，则可以使用 WhiteSpaceAnalyzer。如果您使用其他分隔符，您可能需要编写一个自定义分析器，这并不难做到。

这是个好主意吗？我不知道。原因如下：

Lucene（和 Solr）不使用余弦相似度本身，而是 Lucene 相似度，它混合了余弦、TF/IDF 和布尔评分，并进行了一些特定的修改。这适用于大多数文本用例，但可能与您需要的不同。
您需要比较不同搜索的点击率吗？如果这样做，那么使用 Solr 就很难做到，因为它将每次搜索标准化为最大值 1。

我建议您尝试使用 Solr 来获取数据库的一小部分样本。如果 Solr 适合你，那很好。
如果没有，shingling 和 min-hash 可能是可行的方法。 Rajaraman 和 Ullman 的《挖掘海量数据集》是一本关于这些主题的最新免费书籍。
我建议你读一下。它涵盖了在海量数据中搜索相似字符串。
我想区别在于：你需要一个相对较大的十字路口吗？如果是这样，请使用 shingling 和 min-hash。如果没有，也许 Solr 就足够了。

回复收藏 0 原文

给不了的爱 2024-10-18 23:58:59

嗯...不太清楚什么是 SMARTS，或者化学相似性实际上是如何发挥作用的。如果要使用lucene，首先考虑使用solr。由于您的数据位于图表中，因此您可以使用 solr 组件查看 neo4j。另外，这个问题是否与接近重复的文档更密切相关？为了帮助实现这一点，有许多算法 LSH、Spotsigs、shingling 和 simhash。希望我能提供更多帮助。

回复收藏 0 原文

葬花如无物 2024-10-18 23:58:59

不要使用Lucene。或者Solr。内部模型已经过时并且拼凑在一起；尽管他们做得很好。查找具有最低标准的引擎（如果您想在文本引擎内映射）BM25F 完全支持。如果我追求它并且我想要可扩展性和性能以及低成本的支持社区，坦率地说我会选择 SQL Server 和多维数据集。SQL Server 的许可可能会成为一个完全的障碍。祝你好运。

回复收藏 0 原文

~没有更多了~