在大数据集中搜索多个字符串

发布于 2024-12-01 17:26:35 字数 905 浏览 3 评论 0原文

我正在处理两组数据：

第一组在单列中有 500 个成员（所有字符串）集合 2 在两列（制表符分隔）中包含 30,000 个成员，其中第 1 列是数字 ID 号，第 2 列是分隔的字符串列表（10,000 种可能性）。

我需要在 Set 2 的第 2 列中搜索 Set 1 中的字符串，并标记（或隔离）Set 2 中至少有 1 个匹配项的所有行。点击也需要非常具体（即，我想抓取“杰克有一个网球”，而不是“我喜欢杰克有一个网球袋”）。如果有帮助的话，我可以找出我需要特别避免的所有字符串。

我从 excel ("if(isnumber(search(")) 开始，但发现一个人可以创建的嵌套 if 语句的数量是有限的。我也使用 grep 取得了一些成功，但我意识到它正在隔离我专门编写的行需要避免（“我喜欢杰克有一个网球袋”类型的行，

我开始认为 python 是可行的方法，但我不确定其他人是否有代码）。有建议吗？

这是第二组的样本（ID 被欺骗以保护匿名）以便更好地了解我正在处理的内容：

1230    DEVELOPMENTAL DELAY, LANGUAGE DELAY, MOTOR DELAY  
2257    MULTIPLE CONGENITAL ANOMALIES  
2344    MICROCEPHALY, AUTISM SPECTRUM DISORDER, SHORT STATURE  
3342    DEVELOPMENTAL DELAY, SEIZURE DISORDER, ATAXIA  
7651    CONGENITAL ANOMALY, UNSPECIFIED  
7869    FAMILY HISTORY OF AUTISM SPECTRUM DISORDER

在这种情况下，我将搜索术语“自闭症谱系障碍”，并且我想要 ID 2344，而不是 ID 7869。

感谢您的帮助！

原文

I am working with two sets of data:

Set 1 has 500 members in a single column (all strings)
Set 2 has 30,000 members in two columns (tab delimited) where column 1 is a numerical ID number and column 2 is a , delimited list of strings (10,000 possibilities).

I need to search column 2 of Set 2 for the strings in set 1 and tag (or isolate) all of the rows in Set 2 with at least 1 match. Hits also need to be very specific (ie, I want to grab "Jake has a tennis ball" but not "I like that Jake has a tennis ball bag"). If it helps, I can figure out all of the strings that I need to specifically avoid.

I began in excel ("if(isnumber(search(") but discovered that there is a limit to the number of nested if statements one can make. I also had some success using grep but I realized that it was isolating lines that I specifically need to avoid (the "I like that Jake has a tennis ball bag" type lines).

I'm starting to think that python is the way to go, but I'm not sure how to code it. Does anyone else have any suggestions?

here is a sample from set 2 (ID's spoofed to protect anonymity) to give a better idea of what I'm working with:

1230    DEVELOPMENTAL DELAY, LANGUAGE DELAY, MOTOR DELAY  
2257    MULTIPLE CONGENITAL ANOMALIES  
2344    MICROCEPHALY, AUTISM SPECTRUM DISORDER, SHORT STATURE  
3342    DEVELOPMENTAL DELAY, SEIZURE DISORDER, ATAXIA  
7651    CONGENITAL ANOMALY, UNSPECIFIED  
7869    FAMILY HISTORY OF AUTISM SPECTRUM DISORDER

In this scenario, I would be searching for the term "AUTISM SPECTRUM DISORDER" and I want ID 2344 but not ID 7869.

Thanks for the help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆で 2024-12-08 17:26:35

如果您只想与 Set 1 精确匹配，您可以通过查找来完成此操作。
在集合 2 的第 1 行第 3 列中输入此公式，其中 SET1 是 SET1 的范围地址
=IF(ISERROR(MATCH($a1,SET1,0),"未命中","命中")
然后将公式向下复制 30000 行（双击单元格东南角的小黑方块）。
（如果您可以对 Set 1 进行排序，还有一个更有效的变体）

回复收藏 0 原文

土豪 2024-12-08 17:26:35

不确定你如何保存你的集合，但不知何故你必须在一些Python数据结构中转换/导入它们。看起来一个非常简单的方法就是将 set1 保存在 python 集合（或列表）中，将 set2 保存在字典中，以 ids 作为键，以字符串作为值。然后通过键和适当的值搜索进行简单的迭代将产生您想要的结果。这应该足够了

set1 = set( ... ) # your 500 members must be here 
set2 = dict( ... ) # your 30000 other ids-strings stuff

# and a simple function to make queries

def handle_query( query_string ):
    for id in set2:
        if query_string in set2[id].split(","):
            yield id   

# now you can call it for every string in set1 (or whatever...)

for string in set1:
    print string, [ i for i in handle_query( string ) ]

，是的，Python 是正确的选择:)

但是，这不是最佳的（如果数据集较大，扩展性会很差），因为您必须迭代所有 id。这对于 30000 个 ids 来说还可以，但是“近乎完美”在于所谓的“倒排索引”，这是搜索引擎的核心概念。因此，“保存”set2 数据的最佳方法在某种意义上是：

DEVELOPMENTAL DELAY -> 1230, another_id, yet_another_id, ...
LANGUAGE DELAY -> 1230, ...
MOTOR DELAY -> 1230, ...
MULTIPLE CONGENITAL ANOMALIES -> 2257, ...
etc

现在，字符串是键，id（列表）是值。因此，“严格”字符串查询应该产生所有所需的 id...实现留给读者作为练习:)

Not sure about how you hold your sets, but somehow you have to transform/import them in some python data structures. It seems that a very simple way to do this is to hold set1 in a python set (or list) and set2 in a dict, with ids as keys and strings as values. Then a simple iteration through keys and appropriate value searching would yield your desired results. This should suffice

set1 = set( ... ) # your 500 members must be here 
set2 = dict( ... ) # your 30000 other ids-strings stuff

# and a simple function to make queries

def handle_query( query_string ):
    for id in set2:
        if query_string in set2[id].split(","):
            yield id   

# now you can call it for every string in set1 (or whatever...)

for string in set1:
    print string, [ i for i in handle_query( string ) ]

and yes, python is the way to go :)

However, this is not optimal (and scales badly if the dataset is larger), cause you have to iterate over all ids. This is ok for 30000 ids, but "almost perfection" lies in something called the "inverted index", which is a core concept of a search engine. So the optimal way to "hold" your data of set2 is in a sense:

DEVELOPMENTAL DELAY -> 1230, another_id, yet_another_id, ...
LANGUAGE DELAY -> 1230, ...
MOTOR DELAY -> 1230, ...
MULTIPLE CONGENITAL ANOMALIES -> 2257, ...
etc

Now, strings are keys and (a list of) ids are values. So, "strict" string queries should yield all the desired ids... Implementation is left as an exersice to the reader :)

回复收藏 0 原文

~没有更多了~