在 Lucene 中对 Twitter 帖子进行标记

发布于 2024-08-28 00:23:04 字数 578 浏览 11 评论 0原文

简而言之,我的问题是:有人知道 Lucene 的 TwitterAnalyzerTwitterTokenizer 吗?

更详细的版本:

我想在 Lucene 中索引一些推文,并保持 @user#hashtag 等术语完整。 StandardTokenizer 不起作用,因为它会丢弃标点符号(但它会执行其他有用的操作,例如保留域名、电子邮件地址或识别首字母缩略词)。我怎样才能拥有一个分析器,它可以执行 StandardTokenizer 执行的所有操作,但不涉及 @user 和 #hashtag 等术语?

我当前的解决方案是在将推文文本输入分析器之前对其进行预处理,并用其他字母数字字符串替换字符。例如,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

不幸的是,这种方法破坏了合法的电子邮件地址,但我可以忍受。这种方法有意义吗?

提前致谢!

阿玛奇

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

泡沫很甜 2024-09-04 00:23:04

StandardTokenizer 和 StandardAnalyzer 基本上将您的标记传递给 StandardFilter(它从标准标记中删除所有类型的字符,例如单词末尾的 's),然后是 Lowercase 过滤器(将单词小写),最后是 StopFilter。最后一个删除了诸如“as”、“in”、“for”等无关紧要的单词。

您可以轻松地开始做的是实现您自己的分析器,该分析器的执行与 StandardAnalyzer 相同,但使用 WhitespaceTokenizer 作为第一个项目处理输入流。

有关分析器内部工作原理的更多详细信息,您可以查看 此处

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.

What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.

For more details one the inner workings of the analyzers you can have a look over here

阳光下慵懒的猫 2024-09-04 00:23:04

使用本地处理 Twitter 用户名的自定义分词器会更干净。我在这里做了一个: https://github.com/wetneb/lucene-twitter

这个tokenizer 将识别 Twitter 用户名和主题标签,并且可以使用配套过滤器将它们小写(假定它们不区分大小写):

<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
    <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
  <analyzer type="query">
     <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
     <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
</fieldType>

It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter

This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):

<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
    <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
  <analyzer type="query">
     <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
     <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
</fieldType>
失去的东西太少 2024-09-04 00:23:04

有关 Twitter 特定分词器的教程(ark-tweet-nlp API 的修改版本)可以在 http:// /preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
该 API 能够识别推文中存在的表情符号、主题标签、感叹词等

A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet

心在旅行 2024-09-04 00:23:04

Twitter API 可以被告知返回所有推文、Bios 等,其中“实体”(主题标签、用户 ID、URL 等)已从内容中解析到集合中。

https://dev.twitter.com/docs/entities

那么你不只是在寻找有没有一种方法可以重新做 Twitter 人员已经为你做的事情?

The Twitter API can be told to return all Tweets, Bios etc with the "entities" (hashtags, userIds, urls etc) already parsed out of the content into collections.

https://dev.twitter.com/docs/entities

So aren't you just looking for a way to re-do something that the folks at Twitter have already done for you?

当爱已成负担 2024-09-04 00:23:04

Twitter 开源了文本处理库,实现了主题标签等的令牌处理程序,

例如:HashtagExtractor
https:// github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java

它基于lucene的TokenStream。

Twitter open source there text process lib, implements token handler for hashtag etc.

such as: HashtagExtractor
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java

It is base on lucene's TokenStream.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文