用于标记研究数据的简单界面?

发布于 2024-10-07 16:25:25 字数 458 浏览 7 评论 0原文

在构建自动分类文本的系统之前,我需要手动将一大堆样本分类为训练/评估集。 是否有一些现有的工具可以让我手动标记数千个项目而不会太痛苦?如果没有,将某些东西组合在一起的最快方法是什么?

举个例子,假设您有一堆 Twitter 消息。您希望将它们放入特定的桶中:快乐、悲伤、有趣、愤怒和垃圾邮件。有些东西放在多个桶中。您可以将所有内容转储到文件中并使用 vi 插入一些标签,但这很容易出错并且有点慢。更重要的是,拥有一个漂亮的界面意味着也许您可以说服您的同事完成大量工作。 Web、GUI 或控制台并不重要;重要的是。只要快速简单即可。有这样的事吗?

我希望是的,尽管我在谷歌上找不到任何东西。如果我必须构建一些东西,有一个好的起点吗?通过翻阅,我的第一印象是 Rails + jQuery + Acts_as_taggable_on + jQuery Tokenizing Autocomplete 看起来不错,但我对其他事情持开放态度。

Before I can build a system that automatically classifies text, I need to manually classify a whole bunch of samples as a training/evaluation set. Is there some existing tool that will let me manually tag thousands of items without too much pain? And if not, what's the quickest way to whip something together?

As an example, imagine you have a bunch of Twitter messages. You'd like to put them in particular buckets: happy, sad, funny, angry, and spam. Some things go in multiple buckets. You could just dump everything into a file and insert some tags with vi, but that's error-prone and kinda slow. More importantly, having a nice interface means maybe you can talk your colleagues into doing a bunch of the work. Web, GUI, or console doesn't matter much; just as long as it's quick and easy. Is there anything like that?

I'm hoping yes, although I can't find anything with Google. If I have to build something, is there a good place to start? From rummaging, my first impression is that Rails + jQuery + acts_as_taggable_on + jQuery Tokenizing Autocomplete seems ok, but I'm open to other things.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

不…忘初心 2024-10-14 16:25:25

我认为 Rails + jQuery +acts_as_taggable_on + jQuery Tokenizing Autocomplete,就像你提到的那样是一个不错的选择!

I think Rails + jQuery + acts_as_taggable_on + jQuery Tokenizing Autocomplete, like you mentioned is a good choice!

烟花肆意 2024-10-14 16:25:25

Amazon Mechanical Turk https://www.mturk.com/mturk/welcome 专为您描述的用例。它允许您上传数据,创建表单,然后将您的分类外包给人们,然后生成返回文件。

Amazon Mechanical Turk https://www.mturk.com/mturk/welcome is designed specifically for the use case you describe. It allows you to upload data, create a form, then farm out your classification to people, which then results in a return file.

月亮坠入山谷 2024-10-14 16:25:25

为什么不简单地使用 Excel(或任何其他电子表格程序)?

只需将消息(要标记的)放在第一列中,然后创建小宏以允许用户(您/同事/...)单击相邻的单元格来选择其中一个存储桶。如果要将消息放入多个存储桶中,请让用户单击下一个相邻单元格以选择另一个存储桶。 (如果需要,您可以通过限制可编辑的单元格数量来固定所选存储桶的最大数量)。

然后,您将以一种很容易上传到数据库以进行进一步处理的格式标记所有消息。

这里没有什么高科技,这对于不太懂电脑的同事来说是有好处的。每个人都知道如何将数据输入电子表格!

Why not just go simple and use Excel (or any other spreadsheet program)?

Just have the messages (to be tagged) in the first column, and then create little macro to allow the user (you/colleagues/...) to click the adjacent cell to select one of the buckets. If the message is to be put in multiple buckets, let the user click the next adjacent cell to choose another bucket. (If you want, you can fix the maximum number of chosen buckets by just restricting the number of cells that can be edited).

You will then have all messages tagged in a format that is very easy to upload to a database for further processing.

There is nothing high tech here, which is good for colleagues who may not be computer savvy. Everybody knows how to enter data into a spreadsheet!

如何视而不见 2024-10-14 16:25:25

如果你想要高科技(与我之前的低科技 Excel 答案相比),你可以使用 Weka Tools,“...包含数据预处理、分类、回归、聚类、关联规则和可视化的工具,它也非常适合开发新的机器学习方案。”

If you want to go high tech (compared with my earlier low tech Excel answer), you could just use Weka Tools, which "...contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes."

清醇 2024-10-14 16:25:25

就我而言,我最终使用 Ruby 的 HighLine 模块为命令行界面构建了一些东西。它不像基于网络的界面那么花哨,但构建起来很简单,并且由于其单字符模式,使用起来非常快。

In my case, I ended up building something with Ruby's HighLine module for command-line interfaces. It's not as fancy as a web-based interface, but it was simple to build and, thanks to its single-character mode, very fast to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文