如何在项目列表中搜索并根据3个列表的元素提取一些关键词？

发布于 2025-02-12 23:41:21 字数 1387 浏览 3 评论 0原文

我有一个列是一个数组

假设我

id               desxription                 
1         ['this is bad', 'summerfull']     
2         ['city tehran, country iran']       
3         ['uA is a country', 'winternice'] 
5         ['this, is, summer']              
6         ['this is winter','uAsal']       
7         ['this is canada' ,'great']       
8         ['this is toronto']

有3个

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

列表， L2，L3）。然后在“描述”列中的列表的每个元素中搜索。如果行具有列表的元素，请在列中提取它，否则NA：

注意：我希望提取确切的匹配。例如summerfull不应在夏季提取。

id               desxription                  L1          L2         L3
1         ['this is bad', 'summerfull']       NA         NA          NA
2         ['city tehran, country iran']       NA         iran        tehran
3         ['uA is a country', 'winternice']   NA         uA          NA
5         ['this, is, summer']               summer      NA          NA
6         ['this is winter','uAsal']         winter      NA          NA
7         ['this is canada' ,'great']        NA         NA          canada
8         ['this is toronto']                NA         NA          toronto

原文

Suppose I have 3 lists and I have a column which is an array and I want to search in it to extract elements of those 3 lists..

Dataframe:

id               desxription                 
1         ['this is bad', 'summerfull']     
2         ['city tehran, country iran']       
3         ['uA is a country', 'winternice'] 
5         ['this, is, summer']              
6         ['this is winter','uAsal']       
7         ['this is canada' ,'great']       
8         ['this is toronto']

Lists:

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

Now I want to make a new column with respect of each list (L1,L2,L3). Then search in each element of the lists in the description column. If the row has an element of the list, extract it in the column, otherwise NA:

Note: I want the exact match to be extracted. for example summerfull should not be extracted by summer.

id               desxription                  L1          L2         L3
1         ['this is bad', 'summerfull']       NA         NA          NA
2         ['city tehran, country iran']       NA         iran        tehran
3         ['uA is a country', 'winternice']   NA         uA          NA
5         ['this, is, summer']               summer      NA          NA
6         ['this is winter','uAsal']         winter      NA          NA
7         ['this is canada' ,'great']        NA         NA          canada
8         ['this is toronto']                NA         NA          toronto

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱情眠于流年 2025-02-19 23:41:21

注释的代码

# Create dictionary of key-vals
L = {'L1': L1, 'L2': L2, 'L3': L3}

for key, vals in L.items():
    # regex pattern for extracting vals
    pat = r'\\b(%s)\\b' % '|'.join(vals)

    # extract matching occurrences
    col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)

    # Mask the rows with null when there are no matches
    df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))

>>> df.show()

+---+--------------------+--------+------+---------+
| id|         desxription|      L1|    L2|       L3|
+---+--------------------+--------+------+---------+
|  1|[this is bad, sum...|    null|  null|     null|
|  2|[city tehran, cou...|    null|[iran]| [tehran]|
|  3|[uA is a country,...|    null|  [uA]|     null|
|  5|  [this, is, summer]|[summer]|  null|     null|
|  6|[this is winter, ...|[winter]|  null|     null|
|  7|[this is canada, ...|    null|  null| [canada]|
|  8|   [this is toronto]|    null|  null|[toronto]|
+---+--------------------+--------+------+---------+

Annotated code

# Create dictionary of key-vals
L = {'L1': L1, 'L2': L2, 'L3': L3}

for key, vals in L.items():
    # regex pattern for extracting vals
    pat = r'\\b(%s)\\b' % '|'.join(vals)

    # extract matching occurrences
    col = F.expr("regexp_extract_all(array_join(desxription, ' '), '%s')" % pat)

    # Mask the rows with null when there are no matches
    df = df.withColumn(key, F.when(F.size(col) == 0, None).otherwise(col))

>>> df.show()

+---+--------------------+--------+------+---------+
| id|         desxription|      L1|    L2|       L3|
+---+--------------------+--------+------+---------+
|  1|[this is bad, sum...|    null|  null|     null|
|  2|[city tehran, cou...|    null|[iran]| [tehran]|
|  3|[uA is a country,...|    null|  [uA]|     null|
|  5|  [this, is, summer]|[summer]|  null|     null|
|  6|[this is winter, ...|[winter]|  null|     null|
|  7|[this is canada, ...|    null|  null| [canada]|
|  8|   [this is toronto]|    null|  null|[toronto]|
+---+--------------------+--------+------+---------+

回复收藏 0 原文

~没有更多了~