跨数组搜索和匹配元素

发布于 2024-07-12 13:55:40 字数 341 浏览 7 评论 0原文

我有两张桌子。

在一张表中有两列,一列包含 ID,另一列包含大约 300-500 字长的文档摘要。 大约有500行。

另一个表只有一列和 >18000 行。 该列的每个单元格都包含一个不同的首字母缩写词,例如 NGF、EPO、TPO 等。

我对一个脚本感兴趣,该脚本将扫描表 1 的每个摘要并识别其中存在的一个或多个首字母缩写词,这些缩写词也存在于表 2。

最后,程序将创建一个单独的表,其中第一列包含表 1 第一列的内容(即 ID)以及在与该 ID 关联的文档中找到的首字母缩略词。

具有 Python、Perl 或任何其他脚本语言专业知识的人可以提供帮助吗?

I have two tables.

In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.

The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.

I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.

Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.

Can some one with expertise in Python, Perl or any other scripting language help?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

楠木可依 2024-07-19 13:55:40

在我看来,您正在尝试连接摘要中出现首字母缩略词的两个表。 即(伪 SQL):

SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)

给定所需的语义,您可以使用最直接的方法:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

joins = []

for id, abstract in documents:
    for word in abstract.split():
        try:
            index = acronyms.index(word)
            joins.append((id, index))
        except ValueError:
            pass # word not an acronym

这是一个简单的实现; 然而,它有 n 立方的运行时间,因为 acronyms.index 执行线性搜索(我们最大的数组,不少于)。 我们可以通过首先构建首字母缩略词的哈希索引来改进算法:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))    
joins = []

for id, abstract in documents:
    for word in abstract.split():
        try
            joins.append((id, index[word]))
        except KeyError:
            pass # word not an acronym

当然,您可能需要考虑使用实际的数据库。 这样您就不必手动实现连接。

It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):

SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)

Given the desired semantics you can use the most straight forward approach:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

joins = []

for id, abstract in documents:
    for word in abstract.split():
        try:
            index = acronyms.index(word)
            joins.append((id, index))
        except ValueError:
            pass # word not an acronym

This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:

acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]

index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))    
joins = []

for id, abstract in documents:
    for word in abstract.split():
        try
            joins.append((id, index[word]))
        except KeyError:
            pass # word not an acronym

Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.

幻想少年梦 2024-07-19 13:55:40

非常感谢您的快速回复。
我假设伪 SQL 解决方案适用于 MYSQL 等。但是它在 Microsoft ACCESS 中不起作用。

我认为第二个和第三个是针对Python的。 我可以提供首字母缩略词和文档作为输入文件吗?
巴布鲁

Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.

the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru

风月客 2024-07-19 13:55:40

它在 Access 中不起作用,因为表的访问方式不同(例如缩写词。[id])

It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文