使用Python分割运行文本中的单词?
我正在编写一段代码,将从运行文本中提取单词。该文本可以包含文本中可能存在的分隔符,例如 \r、\n 等。
我想丢弃所有这些分隔符,只提取完整的单词。我怎样才能用Python做到这一点?有没有可用于在 python 中处理文本的库?
I am writing a piece of code which will extract words from running text. This text can contain delimiters like \r,\n etc. which might be there in text.
I want to discard all these delimiters and only extract full words. How can I do this with Python? any library available for crunching text in python?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设您对“word”的定义与正则表达式模块 (
re
) 的定义一致,即字母、数字和下划线,这很简单:其中
thetext
是字符串有问题(例如,来自打开以供阅读的文件对象f
的f.read()
,如果您从中获取文本)。如果您以不同的方式定义单词(例如,您想包含撇号,因此“it's”将被视为“一个单词”),这并不难——只需用作
findall
的第一个参数即可适当的模式,例如用于撇号情况的r"[\w']+"
。如果您需要非常非常复杂(例如,处理单词之间不使用分隔符的语言),那么问题会突然变得更加困难,您将需要一些第三方包,例如 nltk。
Assuming your definition of "word" agrees with that of the regular expression module (
re
), that is, letters, digits and underscores, it's easy:where
thetext
is the string in question (e.g., coming from anf.read()
of a file objectf
open for reading, if that's where you get your text from).If you define words differently (e.g. you want to include apostrophes so for example "it's" will be considered "one word"), it isn't much harder -- just use as the first argument of
findall
the appropriate pattern, e.g.r"[\w']+"
for the apostrophe case.If you need to be very, very sophisticated (e.g., deal with languages that use no breaks between words), then the problem suddenly becomes much harder and you'll need some third-party package like nltk.
假设您的分隔符是空白字符(例如空格、
\r
和\n
),则基本的str.split()
执行您想要的操作:Assuming your delimiters are whitespace characters (like space,
\r
and\n
), then basicstr.split()
does what you want: