当前位置：文江博客话题详情

Python String substring split words

将一个大字符串拆分为多个包含“n”的子字符串通过python计算单词数

发布于 2024-08-15 15:35:13 字数 294 浏览 6 评论 0原文

源文本：美国独立宣言

如何将上述源文本拆分为多个子字符串，包含“n”个单词？

我使用 split(' ') 来提取每个单词，但是我不知道如何在一次操作中提取多个单词。

我可以遍历现有的单词列表，然后通过将第一个列表中的单词粘合在一起（同时添加空格）来创建另一个单词。然而我的方法不是很Pythonic。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（3）

時窥 2024-08-22 15:35:13

text = """
When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation.

We hold these Truths to be self-evident, that all Men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness?-That to secure these Rights, Governments are instituted among Men, deriving their just Powers from the Consent of the Governed, that whenever any Form of Government becomes destructive of these Ends, it is the Right of the People to alter or abolish it, and to institute a new Government, laying its Foundation on such Principles, and organizing its Powers in such Form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient Causes; and accordingly all Experience hath shewn, that Mankind are more disposed to suffer, while Evils are sufferable, than to right themselves by abolishing the Forms to which they are accustomed. But when a long Train of Abuses and Usurpations, pursuing invariably the same Object, evinces a Design to reduce them under absolute Despotism, it is their Right, it is their Duty, to throw off such Government, and to provide new Guards for their future Security. Such has been the patient Sufferance of these Colonies; and such is now the Necessity which constrains them to alter their former Systems of Government. The History of the Present King of Great-Britain is a History of repeated Injuries and Usurpations, all having in direct Object the Establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid World.
"""

words = text.split()
subs = []
n = 4
for i in range(0, len(words), n):
    subs.append(" ".join(words[i:i+n]))
print subs[:10]

打印：

['When in the course', 'of human Events, it', 'becomes necessary for one', 'People to dissolve the', 'Political Bands which have', 'connected them with another,', 'and to assume among', 'the Powers of the', 'Earth, the separate and', 'equal Station to which']

或者，作为列表理解：

subs = [" ".join(words[i:i+n]) for i in range(0, len(words), n)]

text = """
When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation.

We hold these Truths to be self-evident, that all Men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness?-That to secure these Rights, Governments are instituted among Men, deriving their just Powers from the Consent of the Governed, that whenever any Form of Government becomes destructive of these Ends, it is the Right of the People to alter or abolish it, and to institute a new Government, laying its Foundation on such Principles, and organizing its Powers in such Form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient Causes; and accordingly all Experience hath shewn, that Mankind are more disposed to suffer, while Evils are sufferable, than to right themselves by abolishing the Forms to which they are accustomed. But when a long Train of Abuses and Usurpations, pursuing invariably the same Object, evinces a Design to reduce them under absolute Despotism, it is their Right, it is their Duty, to throw off such Government, and to provide new Guards for their future Security. Such has been the patient Sufferance of these Colonies; and such is now the Necessity which constrains them to alter their former Systems of Government. The History of the Present King of Great-Britain is a History of repeated Injuries and Usurpations, all having in direct Object the Establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid World.
"""

words = text.split()
subs = []
n = 4
for i in range(0, len(words), n):
    subs.append(" ".join(words[i:i+n]))
print subs[:10]

prints:

['When in the course', 'of human Events, it', 'becomes necessary for one', 'People to dissolve the', 'Political Bands which have', 'connected them with another,', 'and to assume among', 'the Powers of the', 'Earth, the separate and', 'equal Station to which']

or, as a list comprehension:

subs = [" ".join(words[i:i+n]) for i in range(0, len(words), n)]

回复收藏 0 原文

梦情居士 2024-08-22 15:35:13

您正在尝试创建 n 元语法吗？以下是我使用 NLTK 的方法。

punct = re.compile(r'^[^A-Za-z0-9]+|[^a-zA-Z0-9]+
然后
for ngram in ngrams(sometext, 3):
    print ngram

)
is_word=re.compile(r'[a-z]', re.IGNORECASE)
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer=nltk.tokenize.punkt.PunktWordTokenizer()

def get_words(sentence):
    return [punct.sub('',word) for word in word_tokenizer.tokenize(sentence) if is_word.search(word)]

def ngrams(text, n):
    for sentence in sentence_tokenizer.tokenize(text.lower()):
        words = get_words(sentence)
        for i in range(len(words)-(n-1)):
            yield(' '.join(words[i:i+n]))

然后

You're trying to create n-grams? Here's how I do it, using the NLTK.

punct = re.compile(r'^[^A-Za-z0-9]+|[^a-zA-Z0-9]+
Then
for ngram in ngrams(sometext, 3):
    print ngram

)
is_word=re.compile(r'[a-z]', re.IGNORECASE)
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer=nltk.tokenize.punkt.PunktWordTokenizer()

def get_words(sentence):
    return [punct.sub('',word) for word in word_tokenizer.tokenize(sentence) if is_word.search(word)]

def ngrams(text, n):
    for sentence in sentence_tokenizer.tokenize(text.lower()):
        words = get_words(sentence)
        for i in range(len(words)-(n-1)):
            yield(' '.join(words[i:i+n]))

Then

回复收藏 0 原文

薄凉少年不暖心 2024-08-22 15:35:13

对于大字符串，建议使用迭代器，以提高速度并减少内存占用。

import re, itertools

# Original text
text = "When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation."
n = 10

# An iterator which will extract words one by one from text when needed
words = itertools.imap(lambda m:m.group(), re.finditer(r'\w+', text))
# The final iterator that combines words into n-length groups
word_groups = itertools.izip_longest(*(words,)*n)

for g in word_groups: print g

将得到以下结果：

('When', 'in', 'the', 'course', 'of', 'human', 'Events', 'it', 'becomes', 'necessary')
('for', 'one', 'People', 'to', 'dissolve', 'the', 'Political', 'Bands', 'which', 'have')
('connected', 'them', 'with', 'another', 'and', 'to', 'assume', 'among', 'the', 'Powers')
('of', 'the', 'Earth', 'the', 'separate', 'and', 'equal', 'Station', 'to', 'which')
('the', 'Laws', 'of', 'Nature', 'and', 'of', 'Nature', 's', 'God', 'entitle')
('them', 'a', 'decent', 'Respect', 'to', 'the', 'Opinions', 'of', 'Mankind', 'requires')
('that', 'they', 'should', 'declare', 'the', 'causes', 'which', 'impel', 'them', 'to')
('the', 'Separation', None, None, None, None, None, None, None, None)

For large string, iterator is recommended for speed and low memory footprint.

import re, itertools

# Original text
text = "When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation."
n = 10

# An iterator which will extract words one by one from text when needed
words = itertools.imap(lambda m:m.group(), re.finditer(r'\w+', text))
# The final iterator that combines words into n-length groups
word_groups = itertools.izip_longest(*(words,)*n)

for g in word_groups: print g

will get the following result:

('When', 'in', 'the', 'course', 'of', 'human', 'Events', 'it', 'becomes', 'necessary')
('for', 'one', 'People', 'to', 'dissolve', 'the', 'Political', 'Bands', 'which', 'have')
('connected', 'them', 'with', 'another', 'and', 'to', 'assume', 'among', 'the', 'Powers')
('of', 'the', 'Earth', 'the', 'separate', 'and', 'equal', 'Station', 'to', 'which')
('the', 'Laws', 'of', 'Nature', 'and', 'of', 'Nature', 's', 'God', 'entitle')
('them', 'a', 'decent', 'Respect', 'to', 'the', 'Opinions', 'of', 'Mankind', 'requires')
('that', 'they', 'should', 'declare', 'the', 'causes', 'which', 'impel', 'them', 'to')
('the', 'Separation', None, None, None, None, None, None, None, None)

回复收藏 0 原文

~没有更多了~

关于作者

我的影子我的梦

暂无简介

0 文章

0 评论

24 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

烙印

文章 0 评论 0

singlesman

文章 0 评论 0

给自己一个微笑

文章 0 评论 0

独孤求败

文章 0 评论 0

晨钟暮鼓

文章 0 评论 0

我是自愿种绣球花的

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文