在 Lucene 中对 Twitter 帖子进行标记

发布于 2024-08-28 00:23:04 字数 578 浏览 13 评论 0原文

简而言之，我的问题是：有人知道 Lucene 的 TwitterAnalyzer 或 TwitterTokenizer 吗？

更详细的版本：

我想在 Lucene 中索引一些推文，并保持 @user 或 #hashtag 等术语完整。 StandardTokenizer 不起作用，因为它会丢弃标点符号（但它会执行其他有用的操作，例如保留域名、电子邮件地址或识别首字母缩略词）。我怎样才能拥有一个分析器，它可以执行 StandardTokenizer 执行的所有操作，但不涉及 @user 和 #hashtag 等术语？

我当前的解决方案是在将推文文本输入分析器之前对其进行预处理，并用其他字母数字字符串替换字符。例如，

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

不幸的是，这种方法破坏了合法的电子邮件地址，但我可以忍受。这种方法有意义吗？

提前致谢！

阿玛奇

原文

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泡沫很甜 2024-09-04 00:23:04

StandardTokenizer 和 StandardAnalyzer 基本上将您的标记传递给 StandardFilter（它从标准标记中删除所有类型的字符，例如单词末尾的 's），然后是 Lowercase 过滤器（将单词小写），最后是 StopFilter。最后一个删除了诸如“as”、“in”、“for”等无关紧要的单词。

您可以轻松地开始做的是实现您自己的分析器，该分析器的执行与 StandardAnalyzer 相同，但使用 WhitespaceTokenizer 作为第一个项目处理输入流。

有关分析器内部工作原理的更多详细信息，您可以查看此处

回复收藏 0 原文

阳光下慵懒的猫 2024-09-04 00:23:04

使用本地处理 Twitter 用户名的自定义分词器会更干净。我在这里做了一个： https://github.com/wetneb/lucene-twitter

这个tokenizer 将识别 Twitter 用户名和主题标签，并且可以使用配套过滤器将它们小写（假定它们不区分大小写）：

<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
    <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
  <analyzer type="query">
     <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
     <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
</fieldType>

It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter

This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):

<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
    <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
  <analyzer type="query">
     <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
     <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
</fieldType>

回复收藏 0 原文

绮烟 2024-09-04 00:23:04

这里有一个 Twitter 特定的标记器： https ://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

回复收藏 0 原文

失去的东西太少 2024-09-04 00:23:04

有关 Twitter 特定分词器的教程（ark-tweet-nlp API 的修改版本）可以在 http:// /preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
该 API 能够识别推文中存在的表情符号、主题标签、感叹词等

回复收藏 0 原文