We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(6)
我怀疑是否有一个公开可用的标记,但作为一个简单的近似,您可以在相当大的语料库中创建一个非常频繁的标记列表。然后,根据您的需要,您可以直接使用该列表,或者手动过滤它,或者对您的算法进行一些反复试验以了解其工作原理。
以下是来自相当大的 100 个最常见标记的列表我有新闻语料库。请注意,出于我的目的,我将各种标点符号算作标记。数字“1”代表所有数字标记,因此它在列表中的位置较高。
您可能会意识到,由于词法和语法的原因,停止列表在希伯来语中是一个有问题的概念。正字法 - 一些有用的正字法只是附加在单词上。
I doubt that there is one openly available, but as a simple approximation, you could create a list of very frequent tokens in a reasonably large corpus. Then, depending on your need, you can use the list as such, or filter it manually, or do some trial-and-error with your algorithm to see how it works.
Here's a list of the 100 most common tokens from a pretty large news corpus I have. Note that for my purposes, I counted various punctuation characters as tokens. The number "1" represents all the numeric tokens, hence its high position in the list.
You would probably be aware of that stop list is a problematic concept in Hebrew due to the morphology & orthography - some of the useful ones are just attached to the words.
Mila 中心有一个高频令牌列表,该列表是根据他们正在使用的大型语料库编制的。请参阅页面底部:http://www .mila.cs.technion.ac.il/hebrew/resources/corpora/index.html。
另外,另一件事要考虑的是停用词歧义——某个单词可能没有任何含义,也可能具有非常重要的含义。例如,“אלה”和“אשר”这两个词都是希伯来语介词和有效的人名。有关这种希伯来现象的更多信息,请访问:http:// www.code972.com/blog/2010/05/challenges-indexing-hebrew/(滚动到“停用词歧义”)。
因此,我认为不可能有一个完整且绝对的希伯来语非索引字列表 - 它太依赖于您的语料库和用例。
The Mila center has a list of high frequency token compiled from large corporas they are working with. See the bottom of the page: http://www.mila.cs.technion.ac.il/hebrew/resources/corpora/index.html.
Also, another thing to take into account is stop words ambiguity - where a certain word can either be without any meaning, or with a very improtant meaning. For example - the words אלה and אשר, both are both Hebrew prepositions and valid personal names. More info on this Hebrew phenomenon can be found here: http://www.code972.com/blog/2010/05/challenges-indexing-hebrew/ (scroll to "Stop words ambiguity").
Because of this, I don't think it is possible to have a complete and absolute Hebrew stop list - it is too dependent on your corpora and use case.
之前提供的链接已损坏。
这是新链接:http://www.mila.cs.technion.ac .il/index.html
相关列表缺少一些术语(אתך、אתכן、אתכם 等)。
亲切的问候,
亚龙·沙赫拉巴尼。
The link provided earlier is broken.
This is the new link: http://www.mila.cs.technion.ac.il/index.html
The list in question has some missing terms (אתך, אתכן, אתכם, etc.).
Kind regards,
Yaron Shahrabani.
以下是 500 个希伯来语停用词的列表(带或不带计数):
https://github.com/gidim/HebrewStopWords
也可以在这里找到:
Here's a list of 500 Hebrew stop words (with and without the counts):
https://github.com/gidim/HebrewStopWords
Also available here:
我在 https://yeda.cs.technion.ac 中找到了这个 .xlsx 文件.il/resources_lexicons_stopwords.html
这是一个非常广泛的列表(23k 行),您可以很快地从中获取列表
I've found this .xlsx file in https://yeda.cs.technion.ac.il/resources_lexicons_stopwords.html
It's a very reach list (23k rows) and you can get the list out of it pretty quickly