自然语言/文本挖掘和 Reddit/社交新闻网站
我认为有大量与 reddit、digg 或 news.google.com 等网站相关的自然语言数据。
我对文本挖掘做了一些研究,但找不到如何使用这些工具来解析像 reddit 这样的东西。
你能想出什么样的应用程序?
I think there is a wealth of natural language data associated with sites like reddit or digg or news.google.com.
I have done a little bit of research with text mining, but can't find how I could use those tools to parse something like reddit.
What kind of applications can you come up with?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我过去发现,在 Reddit 或 Digg 等网站上挖掘数据的最佳方法是首先使用他们提供的开发人员 API。 通常,您对某个主题或趋势有浓厚的兴趣,而获取该数据的唯一方法是通过已建立的公共界面。 您还可以解析提要,并将它们组合起来以发现 90% 的您想知道的内容。 如果您想对无法通过 API 获得的数据进行深入研究,那么您应该准备好花费大量时间围绕 cURL 等工具编写自定义包装器。 如果你有预算,你也可以打电话给他们,询问他们是否提供付费的用户研究数据。
I have found in the past that the best way to mine data on sites like Reddit or Digg is to first use the developer API that they provide. Typically you have a focused interest in either a topic or trend, and the only way to get that data is through an established public interface. You can also parse feeds, and combine them both to uncover 90% of what you would want to know. If you want to do deep research on data not available through an API, then you should be prepared to spend a significant amount of time writing custom wrappers around a tool like cURL. If you have the budget you can also call them and ask if they offer paid research data on users.
我会从 RSS 开始,然后我可能会使用 Nutch; 实际如何处理数据更多的是您的决定。
I'd start on the RSS, and after that I might use Nutch; what to actually do with the data is more your call.
这些都是好主意。 我可以获得数据,但可以围绕它构建哪些应用程序?
These are good ideas. I can get the data, but what applications can be built around it?