如何检查嵌套列表中的列表项是否存在于集合中?

发布于 2025-01-09 07:05:06 字数 242 浏览 0 评论 0原文

我有一个语料库中每个句子的嵌套列表。该集合是所有出现多次的单词。我如何检查列表中的每个单词是否在仅包含出现一次的单词的集合中? 然后我需要用 str UNK 替换所有出现多次的单词。

我试过:

for sent in tokenized_sents:
    for word in sent:
        if word in set:
           word = '<UNK>'

I have a nested list of every sentence from a corpus. The set is all the words that occur more than once. How would I check if each word within the list is in the set containing only words that occur once?
I then need to replace all words that occur more than once with the str UNK.

I tried:

for sent in tokenized_sents:
    for word in sent:
        if word in set:
           word = '<UNK>'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

刘备忘录 2025-01-16 07:05:06

您可以创建一个字典,使用 collections.Counter 跟踪语料库中每个单词的出现次数

from collections import Counter

corpus = [['Hello', ',', 'my', 'name', 'is', 'Walter'], ['I', 'like', 'my', 'cats']]

corpus_unnested = []
for sentence in corpus:
    corpus_unnested += sentence
my_dict = Counter(corpus_unnested)

for i, sentence in enumerate(corpus):
    for j, word in enumerate(sentence):
        if my_dict[word] > 1:
            corpus[i][j] = '<UNK>'
>>> print(corpus)
[['Hello', ',', '<UNK>', 'name', 'is', 'Walter'], ['I', 'like', '<UNK>', 'cats']]

You can create a dictionary which keeps tracks of the number of occurrences of each word in your corpus with collections.Counter

from collections import Counter

corpus = [['Hello', ',', 'my', 'name', 'is', 'Walter'], ['I', 'like', 'my', 'cats']]

corpus_unnested = []
for sentence in corpus:
    corpus_unnested += sentence
my_dict = Counter(corpus_unnested)

for i, sentence in enumerate(corpus):
    for j, word in enumerate(sentence):
        if my_dict[word] > 1:
            corpus[i][j] = '<UNK>'
>>> print(corpus)
[['Hello', ',', '<UNK>', 'name', 'is', 'Walter'], ['I', 'like', '<UNK>', 'cats']]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文