当前位置：文江博客话题详情

跨数组搜索和匹配元素

发布于 2024-07-12 13:55:40 字数 341 浏览 14 评论 0原文

我有两张桌子。

在一张表中有两列，一列包含 ID，另一列包含大约 300-500 字长的文档摘要。大约有500行。

另一个表只有一列和 >18000 行。该列的每个单元格都包含一个不同的首字母缩写词，例如 NGF、EPO、TPO 等。

我对一个脚本感兴趣，该脚本将扫描表 1 的每个摘要并识别其中存在的一个或多个首字母缩写词，这些缩写词也存在于表 2。

最后，程序将创建一个单独的表，其中第一列包含表 1 第一列的内容（即 ID）以及在与该 ID 关联的文档中找到的首字母缩略词。

具有 Python、Perl 或任何其他脚本语言专业知识的人可以提供帮助吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

楠木可依 2024-07-19 13:55:40

在我看来，您正在尝试连接摘要中出现首字母缩略词的两个表。即（伪 SQL）：

SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)

给定所需的语义，您可以使用最直接的方法：

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

joins = []

for id, abstract in documents:
    for word in abstract.split():
        try:
            index = acronyms.index(word)
            joins.append((id, index))
        except ValueError:
            pass # word not an acronym

这是一个简单的实现；然而，它有 n 立方的运行时间，因为 acronyms.index 执行线性搜索（我们最大的数组，不少于）。我们可以通过首先构建首字母缩略词的哈希索引来改进算法：

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))    
joins = []

for id, abstract in documents:
    for word in abstract.split():
        try
            joins.append((id, index[word]))
        except KeyError:
            pass # word not an acronym

当然，您可能需要考虑使用实际的数据库。这样您就不必手动实现连接。

It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):

SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)

Given the desired semantics you can use the most straight forward approach:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

joins = []

for id, abstract in documents:
    for word in abstract.split():
        try:
            index = acronyms.index(word)
            joins.append((id, index))
        except ValueError:
            pass # word not an acronym

This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))    
joins = []

for id, abstract in documents:
    for word in abstract.split():
        try
            joins.append((id, index[word]))
        except KeyError:
            pass # word not an acronym

Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.

回复收藏 0 原文