在十六进制文件中查找模式
我有两个不同的文件,每个文件的内容都来自不同的数据流。我在两个不同的文件中从这些流中收集了一些数据。然后我想搜索文件以找到任何类型的模式,这样在稍后阶段,如果我从流中收集更多数据,我应该能够区分哪些数据属于哪个流(基于我找到的模式)较早)。
文件中包含的数据示例可以是: b0 82 91 a2 c3 89 b0 82 4a e3....(更多字节)... 虽然我在这里只占用了很少的字节,但是我们可以发现模式“b0 82”在上面出现了两次。因此输出应该显示模式和它出现的次数。类似地,我们可以有 3 字节模式甚至更多字节模式。
还有其他示例可以是:aa 00 a7 2f 7b 4c ....(更多字节).....aa 01 a7......(更多字节)......aa 05 a7 …… 我认为即使这也可以被认为是一种 3 字节的模式,其中两个字节(aa 和 a7)是固定的,中间一个从 00 到 05 变化。
这是我能想到的两个例子,尽管可能还有更多模式。甚至可能存在一些无法立即显现的隐藏模式。整个想法是任何模式都可以,只要有助于在稍后阶段区分两个流。我想我现在更清楚地说明我的问题了。请让我知道以下事项:
我们如何进行这种类型的模式查找?
是否有任何工具或库可以帮助实现此目的?
还可以使用哪种语言或工具来实现高效、更快的开发?
数据挖掘领域可以为此目的提供帮助吗?如果是,如何继续?
I have two different files each of whose content is coming from different streams of data. I have some data collected from these streams in two different files. Then i want to search the files to find any sort of patterns, So that at a later stage if i collect some more data from the streams i should be able to distinguish which data belongs to which stream (based on the patterns that i have found earlier).
An example of the data contained in the file can be : b0 82 91 a2 c3 89 b0 82 4a e3....(more bytes)...
Though i have taken very few bytes here, but we can find the pattern "b0 82" coming twice above. So the output should show the pattern and the no of times it is coming. Similarly we can have 3 byte pattern or even more byte pattern.
Still other example can be : aa 00 a7 2f 7b 4c ....(more bytes).....aa 01 a7.........(more bytes)......aa 05 a7.....
I think even this can be considered a pattern of 3 bytes where two bytes (aa & a7) are fixed and middle one varies from 00 to 05.
These are two examples that i could think of though there can be more patterns possibly. Even there may be some hidden patterns which can't be visualized immediately. The whole idea is any pattern will do as long as that helps to distinguish between two streams at a later stage. I think i am more clear now on specifying my problem. Please let me know the following things :
How can we do this type of pattern finding?
Are any tools or libraries which can help for this purpose?
Also which language or tool to use for efficient and faster development?
can the field of data mining help for this purpose ? If yes how to go ahead with that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这似乎是一个非常典型的 ngram 查找问题。这是一些 ngram 解决方案的链接。
更快地检测字符串中的 n 元语法?
您应该像对待任何其他字符串一样对待您的十六进制。
This seems like a pretty typical ngram finding problem. Here is a link to some ngram solutions.
quicker way to detect n-grams in a string?
You should treat your hex just like any other string.
您可以在流上训练马尔可夫模型甚至隐藏的马尔可夫模型,并使用这些模型来决定哪个流新数据很可能属于。据说有数十个库可以用您选择的编程语言来执行此操作。
也许可以从读一本书开始。我建议模式识别 C. 主教。
You can train Markov Models or even hidden Markov Models on your streams and use these models to decide which stream new data most probably belongs to. There are supposedly tens of libraries which can do this in your programming language of choice.
Maybe start by reading a book. I suggest Pattern Recognition by C. Bishop.
这是另一个想法。它是否适合您取决于您正在处理多少数据、您可以使用多少内存以及它检测到的模式类型最终是否对您的目的有用。
考虑到所有这些条件,您可能需要尝试使用后缀树或后缀数组。特别是对于后缀树,有一些算法可以让您在将字符附加到文本时不断更新树(所谓的在线后缀树构建),最著名的是 Ukkonen 算法。这对于使用数据流(而不是固定长度、完全定义的输入文本)可能特别有效。
后缀树(以及后缀数组,以类似的方式)表示文本的所有后缀(在字符串结尾的意义上,而不是语言后缀)。因此,它特别适合于(a)检查任何给定字符串是否是文本的子字符串,以及(b)检测文本中的重复子字符串。通过在正确的位置进行一些修改,它可以用于检测稍微改变的重复子字符串(就像您的重复模式示例,其中一个字符在中间交换)。
要全面介绍这些数据结构,并且如果您可以访问大学图书馆或有钱,Dan Gusfield 的介绍字符串、树和序列的算法将会非常有帮助。但 SO 上也有许多与此相关的问题和答案。
如果在进一步阅读后,您认为值得一试,我可以进一步详细说明我认为后缀树如何用于您的目的,以回答一个新问题,特别是关于使用这些算法进行重复模式检测。
Here is another idea. Whether it works for you depends on how much data you are processing, how much memory you can use, and whether the kinds of patterns it detects end up being useful for your purposes.
With all these qualifications in mind, you may want to try using a suffix tree or suffix array. Especially for suffix trees, there are algorithms that enable you to constantly update the tree as you append characters to the text (so called online suffix tree construction), most notably the Ukkonen algorithm. This might work particularly well with the use of data streams (as opposed to a fixed-length, fully defined input text).
A suffix tree (and, in a similar way, a suffix array) represents all suffixes (in the sense of string ends, not linguistic suffixes) of a text. As such, it is particularly suitable for (a) checking whether any given string is a substring of the text, and (b) for detecting repeated substrings in the text. With a few modifications in the right places, it can be used to detect repeated substrings with slight alterations (like your example of a pattern that is repeated with one character exchanged in the middle).
For a thorough introduction to these data structures, and if you have access to a university library or have the money, Dan Gusfield's introduction to algorithms on strings, trees and sequences will be very helpful. But there are a number of questions and answers related to this on SO, too.
If, after some further reading, you think it's worth a try, I could elaborate further on how I think suffix trees could be used for your purposes, in an answer to a new question specifically about the use of these algorithms for repeated pattern detection.
您的问题尚未完全定义,但我会尝试为您提供一些提示:
您的模式可能可以用正则表达式表示。如果您不知道这些是什么 - 我会尝试寻找您最喜欢的编程语言的具体示例。 Python 是一个不错的选择(re 模块包含在核心语言中)。对于 C++,请使用 boost::regex,对于其他语言,请使用 google :)
现在 - 要使用正则表达式搜索二进制文件(十六进制)而不是文本,请尝试查看类似 this。
祝你好运 :)
Your question isn't completely defined, but I will try to give you a few pointers:
Your patterns are probably expressible as regular expressions. If you don't know what these are - I would try to look for a concrete example in your favorite programming language. Python is a good option (the re module is included in the core language). For C++, use boost::regex, and for other languages, use google :)
Now - for using regexes to search binaries (hex) instead of text, try looking at something like this.
Good luck :)