According to the context you have provided, this is a supervised learning problem. Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.
发布评论
评论(3)
根据您提供的上下文,这是一个监督学习问题。
因此,您正在进行分类,而不是聚类。如果我误解了,请更新您的问题以说明这一点。
我将从最简单的功能开始,即对页面的 unicode 文本进行标记,并使用字典将每个新标记转换为数字,并简单地将标记的存在视为一种功能。
接下来,我会使用最简单的算法 - 我倾向于使用朴素贝叶斯,但如果您有一种简单的方法来运行 SVM,这也很好。
将您的结果与一些基线进行比较 - 比如说将最常见的类别分配给所有页面。
最简单的方法就足够了吗?如果没有,请开始迭代算法和功能。
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
如果您走受监督的路线,那么网页采用多种语言的事实应该不会产生影响。如果你使用词汇特征(bag-o'-words 风格),那么每种语言最终都会产生不相交的特征集,但这没关系。所有标准算法都可能会给出类似的结果,因此只需选择一个并使用它即可。我同意 Yuval 的观点,朴素贝叶斯是一个很好的起点,只有当这不能满足您的需求时,才尝试 SVM 或随机森林之类的东西。
但是,如果您走无人监督的路线,那么文本并非全部采用同一种语言这一事实可能是一个大问题。任何合理的聚类算法都会首先按语言对文本进行分组,然后在每个语言聚类中按主题之类的内容进行分组(如果您使用实词作为特征)。这是错误还是功能完全取决于您想要对这些文本进行分类的原因。如果重点是按主题对文档进行分组,而不管语言如何,那就不好了。但是,如果您同意为每种语言设置不同的类别,那么是的,您会遇到与语言一样多的单独分类问题。
如果您确实需要一组统一的类,那么您将需要某种方法来跨语言链接相似的文档。是否有超过一种语言的文档?如果是这样,您可以将它们用作一种统计罗塞塔石碑,以链接不同语言的单词。然后,使用潜在语义分析之类的东西,您可以将其扩展到二阶关系:不同语言中的单词从未出现在同一文档中,但往往与出现的单词同时出现。或者,您可以使用锚文本或 URL 属性之类的内容,以与语言无关的方式为文档分配粗略分类,并将其用作入门方式。
但是,老实说,在不清楚类别是什么(或者至少什么算作好的分类)的情况下进入分类问题似乎很奇怪。提出课程是最困难的部分,这部分将决定项目的成功或失败。实际的算法部分相当死记硬背。
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
主要答案是:尝试不同的方法。如果没有实际测试,很难预测哪种方法会产生最佳结果。因此,我只会建议一些我首先尝试的方法,并描述它们的优缺点。
首先,我推荐监督学习。即使数据分类不是很准确,它仍然可能比无监督聚类给出更好的结果。其原因之一是在聚类过程中使用了许多随机因素。例如,k-means 算法在启动过程时依赖于随机选择的点,这可能会导致不同的程序运行产生非常不同的结果(尽管 x-means 修改似乎使这种行为标准化)。仅当底层元素在特征空间中产生良好分离的区域时,聚类才会产生良好的结果。
处理多语言数据的方法之一是使用多语言资源作为支撑点。例如,您可以索引一些维基百科的文章,并在不同语言的相同主题之间创建“桥梁”。或者,您可以创建多语言关联词典,如本文所述。
至于方法,首先想到的是基于实例的语义方法,例如LSI。它使用向量空间模型来计算单词和/或文档之间的距离。与其他方法相比,它可以有效地处理同义和一词多义。该方法的缺点是计算效率低下和实现泄漏。 LSI的其中一个阶段使用了非常大的共现矩阵,对于大型文档语料库来说,这需要分布式计算和其他特殊处理。 LSA 的修改称为随机索引,它不构造完整的同现矩阵,但你很难找到合适的实现。前段时间我在 Clojure 中为这个方法创建了库,但现在它是 pre-alpha,所以我不建议使用它。不过,如果你决定尝试一下,你可以在 github 上找到用户“faithlessfriend”的项目“Clinch”(我不会发布直接链接以避免不必要的广告)。
除了特殊的语义方法之外,还必须使用“简单性第一”规则。从这一点来看,朴素贝叶斯是一个正确的起点。这里唯一需要注意的是,朴素贝叶斯的多项式版本更好:我的经验告诉我,字数确实很重要。
SVM是一种对线性可分离数据进行分类的技术,而文本数据几乎总是不可线性分离的(至少有几个常见单词出现在任何一对文档中)。这并不意味着 SVM 不能用于文本分类 - 您仍然应该尝试一下,但结果可能比其他机器学习任务低得多。
我对决策树没有足够的经验,但使用它进行有效的文本分类对我来说似乎很奇怪。我见过一些例子,它们给出了很好的结果,但是当我尝试使用 C4.5 算法来完成这个任务时,结果很糟糕。我相信您应该获得一些实现决策树的软件并亲自测试它们。了解总比提出建议更好。
每个主题都有很多内容要说,因此请随意就特定主题提出更多问题。
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.