在 Lucene 中对 Twitter 帖子进行标记
简而言之,我的问题是:有人知道 Lucene 的 TwitterAnalyzer 或 TwitterTokenizer 吗?
更详细的版本:
我想在 Lucene 中索引一些推文,并保持 @user 或 #hashtag 等术语完整。 StandardTokenizer 不起作用,因为它会丢弃标点符号(但它会执行其他有用的操作,例如保留域名、电子邮件地址或识别首字母缩略词)。我怎样才能拥有一个分析器,它可以执行 StandardTokenizer 执行的所有操作,但不涉及 @user 和 #hashtag 等术语?
我当前的解决方案是在将推文文本输入分析器之前对其进行预处理,并用其他字母数字字符串替换字符。例如,
String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");
不幸的是,这种方法破坏了合法的电子邮件地址,但我可以忍受。这种方法有意义吗?
提前致谢!
阿玛奇
My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?
My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,
String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");
Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?
Thanks in advance!
Amaç
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
StandardTokenizer 和 StandardAnalyzer 基本上将您的标记传递给 StandardFilter(它从标准标记中删除所有类型的字符,例如单词末尾的 's),然后是 Lowercase 过滤器(将单词小写),最后是 StopFilter。最后一个删除了诸如“as”、“in”、“for”等无关紧要的单词。
您可以轻松地开始做的是实现您自己的分析器,该分析器的执行与 StandardAnalyzer 相同,但使用 WhitespaceTokenizer 作为第一个项目处理输入流。
有关分析器内部工作原理的更多详细信息,您可以查看 此处
The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.
What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.
For more details one the inner workings of the analyzers you can have a look over here
使用本地处理 Twitter 用户名的自定义分词器会更干净。我在这里做了一个: https://github.com/wetneb/lucene-twitter
这个tokenizer 将识别 Twitter 用户名和主题标签,并且可以使用配套过滤器将它们小写(假定它们不区分大小写):
It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter
This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):
这里有一个 Twitter 特定的标记器: https ://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java
There's a Twitter-specific tokenizer here: https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java
有关 Twitter 特定分词器的教程(ark-tweet-nlp API 的修改版本)可以在 http:// /preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
该 API 能够识别推文中存在的表情符号、主题标签、感叹词等
A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet
Twitter API 可以被告知返回所有推文、Bios 等,其中“实体”(主题标签、用户 ID、URL 等)已从内容中解析到集合中。
https://dev.twitter.com/docs/entities
那么你不只是在寻找有没有一种方法可以重新做 Twitter 人员已经为你做的事情?
The Twitter API can be told to return all Tweets, Bios etc with the "entities" (hashtags, userIds, urls etc) already parsed out of the content into collections.
https://dev.twitter.com/docs/entities
So aren't you just looking for a way to re-do something that the folks at Twitter have already done for you?
Twitter 开源了文本处理库,实现了主题标签等的令牌处理程序,
例如:HashtagExtractor
https:// github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java
它基于lucene的TokenStream。
Twitter open source there text process lib, implements token handler for hashtag etc.
such as: HashtagExtractor
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java
It is base on lucene's TokenStream.