什么是正确的标记化算法? &错误:类型错误:强制转换为 Unicode:需要字符串或缓冲区,已找到列表
我正在做一项信息检索任务。作为我想做的预处理的一部分。
- 停用词删除 分
- 词词干
- 分析 (Porter Stemmer)
最初,我跳过了分词词干分析。结果我得到了这样的术语:
broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.
所以,现在我意识到了标记化的重要性。英语有没有标准的标记化算法?基于string.whitespace
和常用的标点符号。我写道
def Tokenize(text):
words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])
return [word.strip() for word in words if word.strip() != '']
- 我收到
TypeError: coercing to Unicode: need string or buffer, list found
错误! - 如何改进这个标记化例程?
I'm doing an Information Retrieval Task. As part of pre-processing I want to doing.
- Stopword removal
- Tokenization
- Stemming (Porter Stemmer)
Initially, I skipped tokenization. As a result I got terms like this:
broker
broker'
broker,
broker.
broker/deal
broker/dealer'
broker/dealer,
broker/dealer.
broker/dealer;
broker/dealers),
broker/dealers,
broker/dealers.
brokerag
brokerage,
broker-deal
broker-dealer,
broker-dealers,
broker-dealers.
brokered.
brokers,
brokers.
So, Now I realized importance of tokenization. Is there any standard algorithm for tokenization for English language? Based on string.whitespace
and commonly used puncuation marks. I wrote
def Tokenize(text):
words = text.split(['.',',', '?', '!', ':', ';', '-','_', '(', ')', '[', ']', '\'', '`', '"', '/',' ','\t','\n','\x0b','\x0c','\r'])
return [word.strip() for word in words if word.strip() != '']
- I'm getting
TypeError: coercing to Unicode: need string or buffer, list found
error! - How can this Tokenization routine be improved?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尽管您的算法可能足以满足信息检索的目的,但没有一种完美的标记化算法。使用正则表达式会更容易实现:
它可以通过多种方式进行改进,例如正确处理缩写:
并注意您对破折号(
-
)的操作。请考虑:如果
'A'
或'a'
出现在您的停用词列表中,则这将降低到仅级别
。我建议您查看使用 Python 进行自然语言处理 ,第 3 章,以及 NLTK 工具包。
There is no single perfect algorithm for tokenization, though your algorithm may suffice for information retrieval purposes. It will be easier to implement using a regular expression:
It can be improved in various ways, such as handling abbreviations properly:
And watch out what you do with the dash (
-
). Consider:If
'A'
or'a'
occurs in your stop list, this will be reduced to justlevel
.I suggest you check out Natural Language Processing with Python, chapter 3, and the NLTK toolkit.
正如 larsman 提到的,ntlk 有多种不同的标记器,可以接受各种选项。使用默认值:
如果您想过滤掉仅包含标点符号的列表项,您可以执行以下操作
:
As larsman mentions, ntlk has a variety of different tokenizers that accept various options. Using the default:
If you want to filter out list items that are punctuation only, you could do something like this:
returns