使用 PHP 进行文本挖掘

发布于 2024-08-31 09:52:27 字数 1539 浏览 3 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

遗忘曾经 2024-09-07 09:52:27

如果您要使用朴素贝叶斯分类器,那么您实际上并不需要大量的 NL 处理。您所需要的只是一个算法来阻止推文中的单词,如果需要,还可以删除停用词。

词干算法比比皆是,而且编写起来并不困难。删除停用词只需搜索哈希图或类似的东西即可。我认为没有理由切换开发平台以适应 NLTK,尽管它是一个非常好的工具。

If you're going to be using a Naive Bayes classifier, you don't really need a whole ton of NL processing. All you'll need is an algorithm to stem the words in the tweets and if you want, remove stop words.

Stemming algorithms abound and aren't difficult to code. Removing stop words is just a matter of searching a hash map or something similar. I don't see a justification to switch your development platform to accomodate the NLTK, although it is a very nice tool.

单挑你×的.吻 2024-09-07 09:52:27

我不久前做了一个非常类似的项目 - 只对 RSS 新闻项而不是 twitter 进行分类 - 前端也使用 PHP,后端使用 WEKA。我使用了 PHP/Java Bridge ,它使用起来相对简单 - 添加几行到你的 Java (WEKA ) 代码,它允许您的 PHP 调用其方法。下面是他们网站上的 PHP 端代码示例:

<?php 
require_once("http://localhost:8087/JavaBridge/java/Java.inc");

$world = new java("HelloWorld");
echo $world->hello(array("from PHP"));
?>

然后(正如有人已经提到的),您只需要过滤掉停用词即可。为此保留一个 txt 文件对于添加新单词非常方便(当您开始过滤掉不相关的单词并考虑拼写错误时,它们往往会堆积起来)。

朴素贝叶斯模型具有很强的独立特征假设,即它不考虑通常配对的单词(例如习语或短语) - 只是将每个单词视为独立的出现。然而,它可以胜过一些更复杂的方法(例如词干提取、IIRC),并且应该非常适合大学课程,而不会使其变得不必要的复杂。

I did a very similar project a while ago - only classifying RSS news items instead of twitter - also using PHP for the front-end and WEKA for the back-end. I used PHP/Java Bridge which was relatively simple to use - a couple of lines added to your Java (WEKA) code and it allows your PHP to call its methods. Here's an example of the PHP-side code from their website:

<?php 
require_once("http://localhost:8087/JavaBridge/java/Java.inc");

$world = new java("HelloWorld");
echo $world->hello(array("from PHP"));
?>

Then (as someone has already mentioned), you just need to filter out the stop words. Keeping a txt file for this is pretty handy for adding new words (they tend to pile up when you start filtering out irrelevant words and account for typos).

The naive-bayes model has strong independent-feature assumptions, i.e. it doesn't account for words that are commonly paired (such as an idiom or phrase) - just taking each word as an independent occurrence. However, it can outperform some of the more complex methods (such as word-stemming, IIRC) and should be perfect for a college class without making it needlessly complex.

日裸衫吸 2024-09-07 09:52:27

您还可以使用 uClassify API 执行类似于朴素贝叶斯的操作。您基本上可以像使用任何算法一样训练分类器(除了这里您是通过 Web 界面或通过向 API 发送 xml 文档来进行训练)。然后,每当您收到新推文(或一批推文)时,您都可以调用 API 来对其进行分类。它速度很快,您不必担心调整它。当然,这意味着您失去了通过自己控制分类器获得的灵活性,但如果这本身不是班级项目的目标,那么这也意味着您的工作量会减少。

You can also use the uClassify API to do something similar to Naive Bayes. You basically train a classifier as you would with any algorithm (except here you're doing it via the web interface or by sending xml documents to the API). Then whenever you get a new tweet (or batch of tweets), you call the API to have it classify them. It's fast and you don't have to worry about tuning it. Of course, that means you lose the flexibility you get by controlling the classifier yourself, but that also means less work for you if that in itself is not the goal of the class project.

妳是的陽光 2024-09-07 09:52:27

尝试打开 calais - http://viewer.opencalais.com/ 。它有 api、PHP 类等等。另外,LingPipe 可以完成此任务 - http://alias-i.com/lingpipe/index.html< /a>

Try open calais - http://viewer.opencalais.com/ . It has api, PHP classes and many more. Also, LingPipe for this task - http://alias-i.com/lingpipe/index.html

天暗了我发光 2024-09-07 09:52:27

你可以非常直接地检查这个库 https://github.com/Dachande663/PHP-Classifier

you can check this library https://github.com/Dachande663/PHP-Classifier very straight forward

等待圉鍢 2024-09-07 09:52:27

您还可以使用 thrift 或 gearman 来处理 nltk

you can also use thrift or gearman to deal with nltk

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文