在 Python 中从字符串中获取主题标签的优雅方法?
我正在寻找一种干净的方法来获取给定字符串内以 #
开头的一组(列表、数组等)单词。
在 C# 中,我会编写
var hashtags = input
.Split (' ')
.Where (s => s[0] == '#')
.Select (s => s.Substring (1))
.Distinct ();
在 Python 中执行此操作相对优雅的代码是什么?
编辑
示例输入:“嘿伙计们!#stackoverflow 真的#rocks #rocks #announcement”
预期输出:["stackoverflow", "rocks", "announcement"]
I'm looking for a clean way to get a set (list, array, whatever) of words starting with #
inside a given string.
In C#, I would write
var hashtags = input
.Split (' ')
.Where (s => s[0] == '#')
.Select (s => s.Substring (1))
.Distinct ();
What is comparatively elegant code to do this in Python?
EDIT
Sample input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Expected output: ["stackoverflow", "rocks", "announcement"]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
@inspectorG4dget 的答案,如果您不想重复,可以使用集合推导式而不是列表推导式。
请注意,集合推导式的
{ }
语法仅适用于从 Python 2.7 开始的版本。如果您使用的是旧版本,请将 Feed 列表理解 (
[ ]
) 输出发送到set
函数,如 建议者@伯特兰。With @inspectorG4dget's answer, if you want no duplicates, you can use set comprehensions instead of list comprehensions.
Note that
{ }
syntax for set comprehensions only works starting with Python 2.7.If you're working with older versions, feed list comprehension (
[ ]
) output toset
function as suggested by @Bertrand.此版本将消除任何空字符串(因为我在评论中读到了此类问题)和仅
"#"
的字符串。另外,正如 Bertrand Marron 的代码一样,最好将其转换为如下集合(以避免重复)对于 O(1) 查找时间):This version will get rid of any empty strings (as I have read such concerns in the comments) and strings that are only
"#"
. Also, as in Bertrand Marron's code, it's better to turn this into a set as follows (to avoid duplicates and for O(1) lookup time):正则表达式对象的
findall
方法可以一次获取所有这些:the
findall
method of regular expression objects can get them all at once:我会说
编辑:这将创建一个没有任何重复项的集合。
I'd say
Edit: this will create a set without any duplicates.
这里给出的答案存在一些问题。
{tag.strip("#") for tag intags.split() if tag.startswith("#")}
[i[1:] for i in line.split() if i.startswith("#")]
如果你有像 '#one#two#' 这样的主题标签,则不起作用
2
re.compile(r"#(\w+)")
不适用于许多 unicode 语言(即使使用 re.UNICODE)我见过更多提取主题标签的方法,但发现没有一个能够回答所有问题 。
所以我写了一些小的Python代码来处理大多数情况 它对我有用。
there are some problems with the answers presented here.
{tag.strip("#") for tag in tags.split() if tag.startswith("#")}
[i[1:] for i in line.split() if i.startswith("#")]
wont works if you have hashtag like '#one#two#'
2
re.compile(r"#(\w+)")
wont work for many unicode languages (even using re.UNICODE)i had seen more ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
另一种选择是正则表达式:
Another option is regEx: