将一组短语分类为一组相似的短语
我有一些应用程序可以生成文本跟踪信息(日志)到日志文件。跟踪信息是典型的 printf() 风格 - 即有很多相似的日志条目(与 printf 相同的格式参数),但格式字符串具有参数的地方不同。
允许我分析日志条目并将其分类到多个容器/容器中的算法(网址、书籍、文章等)是什么,其中每个容器都有一种关联的格式?
本质上,我想要的是将原始日志条目转换为 (formatA, arg0 ... argN) 实例,其中 formatA 在许多日志条目之间共享。 formatA 不必是用于生成条目的确切格式(如果这使算法更简单,则更是如此)。
我发现的大多数文献和网络信息都涉及精确匹配、最大子串匹配或 k 差异(k 提前已知/固定)。此外,它还专注于匹配一对(长)字符串或单个 bin 输出(所有输入中的一个匹配)。我的情况有些不同,因为我必须发现什么代表(足够好的)匹配(通常是一系列不连续的字符串),然后将每个输入条目分类为发现的匹配之一。
最后,我不是在寻找完美的算法,而是在寻找简单/易于维护的算法。
谢谢!
I have a few apps that generate textual tracing information (logs) to log files. The tracing information is the typical printf() style - i.e. there are a lot of log entries that are similar (same format argument to printf), but differ where the format string had parameters.
What would be an algorithm (url, books, articles, ...) that will allow me to analyze the log entries and categorize them into several bins/containers, where each bin has one associated format?
Essentially, what I would like is to transform the raw log entries into (formatA, arg0 ... argN) instances, where formatA is shared among many log entries. The formatA does not have to be the exact format used to generate the entry (even more so if this makes the algo simpler).
Most of the literature and web-info I found deals with exact matching, a max substring matching, or a k-difference (with k known/fixed ahead of time). Also, it focuses on matching a pair of (long) strings, or a single bin output (one match among all input). My case is somewhat different, since I have to discover what represents a (good-enough) match (generally a sequence of discontinuous strings), and then categorize each input entries to one of the discovered matches.
Lastly, I'm not looking for a perfect algorithm, but something simple/easy to maintain.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用著名的词袋技术,通过使用表示每个文本的稀疏矩阵来帮助对文本进行分组对文本进行排(通常使用波特词干分析器对文本进行词干分析以获得更好的结果)在计算词袋之后,您需要计算每个单词在每段文本中出现的次数,然后通过以下方式计算矩阵的总数您按顺序计算每个单元格的 tf-idf 的行和列注意文本上最有效的角距离。完成所有这些操作后,您可以执行聚类算法,对相关文本片段进行分组,您甚至可以从此处提取文本的主要关键字。有一个程序可以自动执行所有这些操作,名为 cluto,我强烈推荐它。
You can use the famous Bag of Words techniques tan help group texts by using a sparse matrix representing in each row the text (generally the text is stemmed with the porter stemmer for better results) After computing the bag of words, where you need to count the number of times each word appears in each piece of text and then calculating the totals of the matrix by rows and columns you calculate the tf-idf for each cell, in order to pay attention to the angular distance on the texts that is the one that best works. After doing all of this you can perform a clustering algorithm that groups the related pieces of texts, you can even extract from here the main keywords of the text. there is a program that does all this automatically that is called cluto, I strongly recommend it.