如何选择特征选择算法? - 建议

发布于 2024-08-18 05:55:09 字数 225 浏览 8 评论 0原文

有没有我可以阅读的研究论文/书籍可以告诉我针对手头的问题哪种特征选择算法最有效。

我试图简单地将 Twitter 消息识别为 pos/neg(首先)。我从基于频率的特征选择开始(从 NLTK 书开始),但很快意识到,对于类似的问题,不同的人选择了不同的算法,

尽管我可以尝试基于频率、互信息、信息增益和各种其他算法,但列表似乎无穷无尽。并想知道是否有一种有效的方法然后反复试验。

任何建议

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.

I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms

Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.

any advice

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

满地尘埃落定 2024-08-25 05:55:09

您是否尝试过我针对您最后一个问题推荐的书?它可以在线免费获取,并且完全关于您正在处理的任务:情感分析和意见挖掘,作者:Pang 和 Lee。第 4 章(“提取和分类”)正是您所需要的!

Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!

十级心震 2024-08-25 05:55:09

上学期我学了一门 NLP 课程,很明显,情感分析目前还没有人真正知道如何做好。通过无监督学习来做到这一点当然更加困难。

关于这一点正在进行相当多的研究,其中一些是商业性的,因此不向公众开放。我无法向您指出任何研究论文,但我们在课程中使用的书是 这本书 (Google 图书预览)。也就是说,这本书涵盖了大量材料,可能不是找到解决这个特定问题的最快方法。

我唯一可以向您指出的另一件事是尝试在谷歌上搜索,也许在scholar.google.com 中搜索“情绪分析”或“意见挖掘”。

查看 NLTK movie_reviews 语料库。这些评论已经进行了正/负分类,可能会帮助您训练分类器。尽管您在 Twitter 中找到的语言可能与那些语言有很大不同。

最后一点,请在此处发布任何成功(或失败)的信息。这个问题肯定会在稍后的某个时候出现。

I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.

There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.

The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".

Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.

As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.

断爱 2024-08-25 05:55:09

不幸的是,在处理机器学习时没有灵丹妙药。它通常被称为“没有免费的午餐”定理。基本上,许多算法都可以解决一个问题,有些算法在某些问题上做得更好,而在另一些问题上做得较差。总的来说,它们的表现都差不多。对于给定的数据集,相同的特征集可能会导致一种算法表现更好,而另一种算法表现更差。对于不同的数据集,情况可能完全相反。

通常我所做的就是选择一些在类似任务上适用于其他人的特征选择算法,然后从这些算法开始。如果我使用我最喜欢的分类器获得的性能是可以接受的,那么再寻找半个百分点可能不值得我花时间。但如果它不可接受,那么是时候重新评估我的方法,或者寻找更多的特征选择方法。

Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.

Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文