什么是正确的标记化算法? &错误:类型错误:强制转换为 Unicode:需要字符串或缓冲区,已找到列表

发布于 2024-09-30 08:22:38 字数 893 浏览 6 评论 0原文

我正在做一项信息检索任务。作为我想做的预处理的一部分。

  1. 停用词删除 分
  2. 词词干
  3. 分析 (Porter Stemmer)

最初,我跳过了分词词干分析。结果我得到了这样的术语:

broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.

所以,现在我意识到了标记化的重要性。英语有没有标准的标记化算法?基于string.whitespace和常用的标点符号。我写道

def Tokenize(text):
    words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])    
    return [word.strip() for word in words if word.strip() != '']
  1. 我收到 TypeError: coercing to Unicode: need string or buffer, list found 错误!
  2. 如何改进这个标记化例程?

I'm doing an Information Retrieval Task. As part of pre-processing I want to doing.

  1. Stopword removal
  2. Tokenization
  3. Stemming (Porter Stemmer)

Initially, I skipped tokenization. As a result I got terms like this:

broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.

So, Now I realized importance of tokenization. Is there any standard algorithm for tokenization for English language? Based on string.whitespace and commonly used puncuation marks. I wrote

def Tokenize(text):
    words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])    
    return [word.strip() for word in words if word.strip() != '']
  1. I'm getting TypeError: coercing to Unicode: need string or buffer, list found error!
  2. How can this Tokenization routine be improved?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

長街聽風 2024-10-07 08:22:38

尽管您的算法可能足以满足信息检索的目的,但没有一种完美的标记化算法。使用正则表达式会更容易实现:

def Tokenize(text):
    words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
    return [word.strip() for word in words if word.strip() != '']

它可以通过多种方式进行改进,例如正确处理缩写:

>>> Tokenize('U.S.')
['U', 'S']

并注意您对破折号(-)的操作。请考虑:

>>> Tokenize('A-level')
['A', 'level']

如果 'A''a' 出现在您的停用词列表中,则这将降低到仅级别

我建议您查看使用 Python 进行自然语言处理 ,第 3 章,以及 NLTK 工具包。

There is no single perfect algorithm for tokenization, though your algorithm may suffice for information retrieval purposes. It will be easier to implement using a regular expression:

def Tokenize(text):
    words = re.split(r'[-\.,?!:;_()\[\]\'`"/\t\n\r \x0b\x0c]+', text)
    return [word.strip() for word in words if word.strip() != '']

It can be improved in various ways, such as handling abbreviations properly:

>>> Tokenize('U.S.')
['U', 'S']

And watch out what you do with the dash (-). Consider:

>>> Tokenize('A-level')
['A', 'level']

If 'A' or 'a' occurs in your stop list, this will be reduced to just level.

I suggest you check out Natural Language Processing with Python, chapter 3, and the NLTK toolkit.

自此以后,行同陌路 2024-10-07 08:22:38

正如 larsman 提到的,ntlk 有多种不同的标记器,可以接受各种选项。使用默认值:

>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal',       'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']

如果您想过滤掉仅包含标点符号的列表项,您可以执行以下操作

>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
        '''
        returns bool(set(s) is not a subset of set(filter_chars))
        '''
        return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)

>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']

As larsman mentions, ntlk has a variety of different tokenizers that accept various options. Using the default:

>>> import nltk
>>> words = nltk.wordpunct_tokenize('''
... broker
... broker'
... broker,
... broker.
... broker/deal
... broker/dealer'
... broker/dealer,
... broker/dealer.
... broker/dealer;
... broker/dealers),
... broker/dealers,
... broker/dealers.
... brokerag
... brokerage,
... broker-deal
... broker-dealer,
... broker-dealers,
... broker-dealers.
... brokered.
... brokers,
... brokers.
... ''')
['broker', 'broker', "'", 'broker', ',', 'broker', '.', 'broker', '/', 'deal',       'broker', '/', 'dealer', "'", 'broker', '/', 'dealer', ',', 'broker', '/', 'dealer', '.', 'broker', '/', 'dealer', ';', 'broker', '/', 'dealers', '),', 'broker', '/', 'dealers', ',', 'broker', '/', 'dealers', '.', 'brokerag', 'brokerage', ',', 'broker', '-', 'deal', 'broker', '-', 'dealer', ',', 'broker', '-', 'dealers', ',', 'broker', '-', 'dealers', '.', 'brokered', '.', 'brokers', ',', 'brokers', '.']

If you want to filter out list items that are punctuation only, you could do something like this:

>>> filter_chars = "',.;()-/"
>>> def is_only_punctuation(s):
        '''
        returns bool(set(s) is not a subset of set(filter_chars))
        '''
        return not set(list(i)) < set(list(filter_chars))
>>> filter(is_only_punctuation, words)

returns

>>> ['broker', 'broker', 'broker', 'broker', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'broker', 'dealers', 'brokerag', 'brokerage', 'broker', 'deal', 'broker', 'dealer', 'broker', 'dealers', 'broker', 'dealers', 'brokered', 'brokers', 'brokers']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文