在 Python 中从字符串中获取主题标签的优雅方法?

发布于 2024-11-15 07:54:50 字数 441 浏览 5 评论 0原文

我正在寻找一种干净的方法来获取给定字符串内以 # 开头的一组(列表、数组等)单词。

在 C# 中,我会编写

var hashtags = input
    .Split (' ')
    .Where (s => s[0] == '#')
    .Select (s => s.Substring (1))
    .Distinct ();

在 Python 中执行此操作相对优雅的代码是什么?

编辑

示例输入:“嘿伙计们!#stackoverflow 真的#rocks #rocks #announcement”
预期输出:["stackoverflow", "rocks", "announcement"]

I'm looking for a clean way to get a set (list, array, whatever) of words starting with # inside a given string.

In C#, I would write

var hashtags = input
    .Split (' ')
    .Where (s => s[0] == '#')
    .Select (s => s.Substring (1))
    .Distinct ();

What is comparatively elegant code to do this in Python?

EDIT

Sample input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Expected output: ["stackoverflow", "rocks", "announcement"]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

故笙诉离歌 2024-11-22 07:54:50

@inspectorG4dget 的答案,如果您不想重复,可以使用集合推导式而不是列表推导式。

>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])

请注意,集合推导式的 { } 语法仅适用于从 Python 2.7 开始的版本。
如果您使用的是旧版本,请将 Feed 列表理解 ([ ]) 输出发送到 set 函数,如 建议者@伯特兰

With @inspectorG4dget's answer, if you want no duplicates, you can use set comprehensions instead of list comprehensions.

>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])

Note that { } syntax for set comprehensions only works starting with Python 2.7.
If you're working with older versions, feed list comprehension ([ ]) output to set function as suggested by @Bertrand.

柏林苍穹下 2024-11-22 07:54:50
[i[1:] for i in line.split() if i.startswith("#")]

此版本将消除任何空字符串(因为我在评论中读到了此类问题)和仅 "#" 的字符串。另外,正如 Bertrand Marron 的代码一样,最好将其转换为如下集合(以避免重复)对于 O(1) 查找时间):

set([i[1:] for i in line.split() if i.startswith("#")])
[i[1:] for i in line.split() if i.startswith("#")]

This version will get rid of any empty strings (as I have read such concerns in the comments) and strings that are only "#". Also, as in Bertrand Marron's code, it's better to turn this into a set as follows (to avoid duplicates and for O(1) lookup time):

set([i[1:] for i in line.split() if i.startswith("#")])
陪你到最终 2024-11-22 07:54:50

正则表达式对象findall 方法可以一次获取所有这些:

>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>> 

the findall method of regular expression objects can get them all at once:

>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>> 
夜血缘 2024-11-22 07:54:50

我会说

hashtags = [word[1:] for word in input.split() if word[0] == '#']

编辑:这将创建一个没有任何重复项的集合。

set(hashtags)

I'd say

hashtags = [word[1:] for word in input.split() if word[0] == '#']

Edit: this will create a set without any duplicates.

set(hashtags)
花期渐远 2024-11-22 07:54:50

这里给出的答案存在一些问题。

  1. {tag.strip("#") for tag intags.split() if tag.startswith("#")}

    [i[1:] for i in line.split() if i.startswith("#")]

如果你有像 '#one#two#' 这样的主题标签,则不起作用

2 re.compile(r"#(\w+)") 不适用于许多 unicode 语言(即使使用 re.UNICODE)

我见过更多提取主题标签的方法,但发现没有一个能够回答所有问题 。

所以我写了一些小的Python代码来处理大多数情况 它对我有用。

def get_hashtagslist(string):
    ret = []
    s=''
    hashtag = False
    for char in string:
        if char=='#':
            hashtag = True
            if s:
                ret.append(s)
                s=''           
            continue

        # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
        if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
            ret.append(s)
            s=''
            hashtag=False 

        if hashtag:
            s+=char

    if s:
        ret.append(s)

    return set(ret)

there are some problems with the answers presented here.

  1. {tag.strip("#") for tag in tags.split() if tag.startswith("#")}

    [i[1:] for i in line.split() if i.startswith("#")]

wont works if you have hashtag like '#one#two#'

2 re.compile(r"#(\w+)") wont work for many unicode languages (even using re.UNICODE)

i had seen more ways to extract hashtag, but found non of them answering on all cases

so i wrote some small python code to handle most of the cases. it works for me.

def get_hashtagslist(string):
    ret = []
    s=''
    hashtag = False
    for char in string:
        if char=='#':
            hashtag = True
            if s:
                ret.append(s)
                s=''           
            continue

        # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
        if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
            ret.append(s)
            s=''
            hashtag=False 

        if hashtag:
            s+=char

    if s:
        ret.append(s)

    return set(ret)
乜一 2024-11-22 07:54:50

另一种选择是正则表达式:

import re

inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"

re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

Another option is regEx:

import re

inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"

re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文