如何在项目列表中搜索并根据3个列表的元素提取一些关键词?

发布于 2025-02-12 23:41:21 字数 1387 浏览 3 评论 0原文

我有一个列是一个数组

假设我

id               desxription                 
1         ['this is bad', 'summerfull']     
2         ['city tehran, country iran']       
3         ['uA is a country', 'winternice'] 
5         ['this, is, summer']              
6         ['this is winter','uAsal']       
7         ['this is canada' ,'great']       
8         ['this is toronto']            

有3个

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

列表, L2,L3)。然后在“描述”列中的列表的每个元素中搜索。如果行具有列表的元素,请在列中提取它,否则NA:

注意:我希望提取确切的匹配。例如summerfull不应在夏季提取。

id               desxription                  L1          L2         L3
1         ['this is bad', 'summerfull']       NA         NA          NA
2         ['city tehran, country iran']       NA         iran        tehran
3         ['uA is a country', 'winternice']   NA         uA          NA
5         ['this, is, summer']               summer      NA          NA
6         ['this is winter','uAsal']         winter      NA          NA
7         ['this is canada' ,'great']        NA         NA          canada
8         ['this is toronto']                NA         NA          toronto

Suppose I have 3 lists and I have a column which is an array and I want to search in it to extract elements of those 3 lists..

Dataframe:

id               desxription                 
1         ['this is bad', 'summerfull']     
2         ['city tehran, country iran']       
3         ['uA is a country', 'winternice'] 
5         ['this, is, summer']              
6         ['this is winter','uAsal']       
7         ['this is canada' ,'great']       
8         ['this is toronto']            

Lists:

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

Now I want to make a new column with respect of each list (L1,L2,L3). Then search in each element of the lists in the description column. If the row has an element of the list, extract it in the column, otherwise NA:

Note: I want the exact match to be extracted. for example summerfull should not be extracted by summer.

id               desxription                  L1          L2         L3
1         ['this is bad', 'summerfull']       NA         NA          NA
2         ['city tehran, country iran']       NA         iran        tehran
3         ['uA is a country', 'winternice']   NA         uA          NA
5         ['this, is, summer']               summer      NA          NA
6         ['this is winter','uAsal']         winter      NA          NA
7         ['this is canada' ,'great']        NA         NA          canada
8         ['this is toronto']                NA         NA          toronto

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

爱情眠于流年 2025-02-19 23:41:21

注释的代码

# Create dictionary of key-vals
L = {'L1': L1, 'L2': L2, 'L3': L3}

for key, vals in L.items():
    # regex pattern for extracting vals
    pat = r'\\b(%s)\\b' % '|'.join(vals)

    # extract matching occurrences
    col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)

    # Mask the rows with null when there are no matches
    df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))

>>> df.show()

+---+--------------------+--------+------+---------+
| id|         desxription|      L1|    L2|       L3|
+---+--------------------+--------+------+---------+
|  1|[this is bad, sum...|    null|  null|     null|
|  2|[city tehran, cou...|    null|[iran]| [tehran]|
|  3|[uA is a country,...|    null|  [uA]|     null|
|  5|  [this, is, summer]|[summer]|  null|     null|
|  6|[this is winter, ...|[winter]|  null|     null|
|  7|[this is canada, ...|    null|  null| [canada]|
|  8|   [this is toronto]|    null|  null|[toronto]|
+---+--------------------+--------+------+---------+

Annotated code

# Create dictionary of key-vals
L = {'L1': L1, 'L2': L2, 'L3': L3}

for key, vals in L.items():
    # regex pattern for extracting vals
    pat = r'\\b(%s)\\b' % '|'.join(vals)

    # extract matching occurrences
    col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)

    # Mask the rows with null when there are no matches
    df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))

>>> df.show()

+---+--------------------+--------+------+---------+
| id|         desxription|      L1|    L2|       L3|
+---+--------------------+--------+------+---------+
|  1|[this is bad, sum...|    null|  null|     null|
|  2|[city tehran, cou...|    null|[iran]| [tehran]|
|  3|[uA is a country,...|    null|  [uA]|     null|
|  5|  [this, is, summer]|[summer]|  null|     null|
|  6|[this is winter, ...|[winter]|  null|     null|
|  7|[this is canada, ...|    null|  null| [canada]|
|  8|   [this is toronto]|    null|  null|[toronto]|
+---+--------------------+--------+------+---------+
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文