Python函数:如果一个数据集中的链接是另一个数据集中链接的一部分,则分配1,else 0

发布于 2025-02-13 15:14:59 字数 2195 浏览 2 评论 0 原文

示例数据集我有:

df1:

id Page链接
1 http:// example1/path1/ru/path2/path3
2 https://example2.com/path1
3 https://example3.subdomain

df2:

id 链接
1 http:/http:/http:/ /example1/path1/ru
2 https://example2.com/path1
3 https://example3.subdomain/path2

在DF1中,我需要创建一个具有值1或0的列['contains']。如果df1链接是df2中链接的一部分,则['contains'] = 1,否则0

使得最终结果看起来像这样:

DF1

ID PAGE链接 包含
1 http:// example1/path1/ru/path2/path 3 1
2 https://example2.com/path1 1
3 https://example3.subdomain 0

我尝试了:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            contains=1
        else
            contains=0
    return contains

df1['Contains']=df1['Page Link'].apply(assign)

这没有返回我期望的结果

leads['Marketing Team']=leads['Page Link'].apply(assign_marketing)

Sample datasets i have:

df1:

ID Page Link
1 http://example1/path1/ru/path2/path3
2 https://example2.com/path1
3 https://example3.subdomain

df2:

ID Link
1 http://example1/path1/ru
2 https://example2.com/path1
3 https://example3.subdomain/path2

in df1 I need to create a column ['Contains'], which has values 1 or 0. If df1 links are a part of links in df2, then ['Contains']=1, else 0

so that end result looks like this:

df1

ID Page Link Contains
1 http://example1/path1/ru/path2/path3 1
2 https://example2.com/path1 1
3 https://example3.subdomain 0

I tried this:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            contains=1
        else
            contains=0
    return contains

df1['Contains']=df1['Page Link'].apply(assign)

This didn't return the result I expected

leads['Marketing Team']=leads['Page Link'].apply(assign_marketing)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

海夕 2025-02-20 15:14:59

您使用的功能在找到匹配项时不会停止,然后可能会“改变主意”。这是一个固定的版本:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            return 1 # Return at once!
    return 0

如果匹配始终在开始时,则可以用更快的 startswith()

def assign(column):
    for link in df2['Link']:
        if link.startswith(column):
            return 1 # Return at once!
    return 0

The function that you use does not stop when it finds a match, and may later "change its mind." Here is a fixed version:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            return 1 # Return at once!
    return 0

If the match is always at the beginning, you can replace the expensive re.search() with a much faster startswith():

def assign(column):
    for link in df2['Link']:
        if link.startswith(column):
            return 1 # Return at once!
    return 0
风苍溪 2025-02-20 15:14:59

这个问题不是高度明确的,因此,如果您想检查链接匹配 per ID ,则可以使用:

s = df1['ID'].map(df2.set_index('ID')['Link'])

df1['Contains'] = [int(b in a) if b else 0 for a,b in zip(df1['Page Link'], s)]

输出:

   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

The question is not highly explicit, so in case you want to check the link match per ID, you can use:

s = df1['ID'].map(df2.set_index('ID')['Link'])

df1['Contains'] = [int(b in a) if b else 0 for a,b in zip(df1['Page Link'], s)]

output:

   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0
天涯沦落人 2025-02-20 15:14:59

使用 带有正则 - 通过 | 加入值 df2.link 带有Escape:

import re

regex = '|'.join(re.escape(x) for x in df2.Link)
df1['Contains'] = df1['Page Link'].str.contains(regex).astype(int)

print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

如果需要测试,所有链接都使用嵌套列表理解 - 如果两个大数据范围:

df1['Contains'] = [int(any(x in link for x in df2['Link'])) for link in df1['Page Link']]
print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

Use Series.str.contains with regex - join values by | of df2.Link with escape:

import re

regex = '|'.join(re.escape(x) for x in df2.Link)
df1['Contains'] = df1['Page Link'].str.contains(regex).astype(int)

print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

If need test all links use nested list comprehension - it should be slow if large both DataFrames:

df1['Contains'] = [int(any(x in link for x in df2['Link'])) for link in df1['Page Link']]
print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文