当前位置：文江博客话题详情

Python Function pandas

Python函数：如果一个数据集中的链接是另一个数据集中链接的一部分，则分配1，else 0

发布于 2025-02-13 15:14:59 字数 2195 浏览 2 评论 0 原文

示例数据集我有：

df1：

id	Page链接
1	http：// example1/path1/ru/path2/path3
2	https://example2.com/path1
3	https：//example3.subdomain

df2：

id	链接
1	http：/http：/http：/ /example1/path1/ru
2	https://example2.com/path1
3	https：//example3.subdomain/path2

在DF1中，我需要创建一个具有值1或0的列['contains']。如果df1链接是df2中链接的一部分，则['contains'] = 1，否则0

使得最终结果看起来像这样：

DF1

ID	PAGE链接	包含
1	http：// example1/path1/ru/path2/path	3 1
2	https://example2.com/path1	1
3	https：//example3.subdomain	0

我尝试了：

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            contains=1
        else
            contains=0
    return contains

df1['Contains']=df1['Page Link'].apply(assign)

这没有返回我期望的结果

leads['Marketing Team']=leads['Page Link'].apply(assign_marketing)

原文

Sample datasets i have:

df1:

ID	Page Link
1	http://example1/path1/ru/path2/path3
2	https://example2.com/path1
3	https://example3.subdomain

df2:

ID	Link
1	http://example1/path1/ru
2	https://example2.com/path1
3	https://example3.subdomain/path2

in df1 I need to create a column ['Contains'], which has values 1 or 0. If df1 links are a part of links in df2, then ['Contains']=1, else 0

so that end result looks like this:

df1

ID	Page Link	Contains
1	http://example1/path1/ru/path2/path3	1
2	https://example2.com/path1	1
3	https://example3.subdomain	0

I tried this:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            contains=1
        else
            contains=0
    return contains

df1['Contains']=df1['Page Link'].apply(assign)

This didn't return the result I expected

leads['Marketing Team']=leads['Page Link'].apply(assign_marketing)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

海夕 2025-02-20 15:14:59

您使用的功能在找到匹配项时不会停止，然后可能会“改变主意”。这是一个固定的版本：

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            return 1 # Return at once!
    return 0

如果匹配始终在开始时，则可以用更快的 startswith（）：

def assign(column):
    for link in df2['Link']:
        if link.startswith(column):
            return 1 # Return at once!
    return 0

The function that you use does not stop when it finds a match, and may later "change its mind." Here is a fixed version:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            return 1 # Return at once!
    return 0

If the match is always at the beginning, you can replace the expensive re.search() with a much faster startswith():

def assign(column):
    for link in df2['Link']:
        if link.startswith(column):
            return 1 # Return at once!
    return 0

回复收藏 0 原文

风苍溪 2025-02-20 15:14:59

这个问题不是高度明确的，因此，如果您想检查链接匹配 per ID ，则可以使用：

s = df1['ID'].map(df2.set_index('ID')['Link'])

df1['Contains'] = [int(b in a) if b else 0 for a,b in zip(df1['Page Link'], s)]

输出：

   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

The question is not highly explicit, so in case you want to check the link match per ID, you can use:

s = df1['ID'].map(df2.set_index('ID')['Link'])

df1['Contains'] = [int(b in a) if b else 0 for a,b in zip(df1['Page Link'], s)]

output:

   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

回复收藏 0 原文

天涯沦落人 2025-02-20 15:14:59

使用带有正则 - 通过 | 加入值 df2.link 带有Escape：

import re

regex = '|'.join(re.escape(x) for x in df2.Link)
df1['Contains'] = df1['Page Link'].str.contains(regex).astype(int)

print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

如果需要测试，所有链接都使用嵌套列表理解 - 如果两个大数据范围：

df1['Contains'] = [int(any(x in link for x in df2['Link'])) for link in df1['Page Link']]
print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

Use Series.str.contains with regex - join values by | of df2.Link with escape:

import re

regex = '|'.join(re.escape(x) for x in df2.Link)
df1['Contains'] = df1['Page Link'].str.contains(regex).astype(int)

print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

If need test all links use nested list comprehension - it should be slow if large both DataFrames:

df1['Contains'] = [int(any(x in link for x in df2['Link'])) for link in df1['Page Link']]
print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

回复收藏 0 原文

~没有更多了~