检查通过程序的 C 管道——边界情况
我从套接字 A 接收数据并将其即时写入套接字 B(就像代理服务器一样)。 我想检查并可能修改通过的数据。 我的问题是如何处理边界情况,即我正在搜索的正则表达式在两个连续的套接字 A 读取和套接字 B 写入迭代之间匹配。
char buffer[4096]
int socket_A, socket_B
/* Setting up the connection goes here */
for(;;) {
recv(socket_A, buffer, 4096, 0);
/* Inspect, and possibly modify buffer */
send(socket_B, buffer, 4096, 0);
/* Oops, the matches I was looking for were at the end of buffer,
* and will be at the beginning of buffer next iteration :( */
}
I'm receiving from socket A and writing that to socket B on the fly (like a proxy server might). I would like to inspect and possibly modify data passing through. My question is how to handle border cases, ie where the regular expression I'm searching for would match between two successive socket A read and socket B write iterations.
char buffer[4096]
int socket_A, socket_B
/* Setting up the connection goes here */
for(;;) {
recv(socket_A, buffer, 4096, 0);
/* Inspect, and possibly modify buffer */
send(socket_B, buffer, 4096, 0);
/* Oops, the matches I was looking for were at the end of buffer,
* and will be at the beginning of buffer next iteration :( */
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我的建议:有两个缓冲区,并在它们之间轮换:
或者类似的事情?
My suggestion: have two buffers, and rotate between them:
Or something like that?
假设您知道可能的正则表达式匹配的最大长度 M(或者可以使用任意值 - 或者只使用整个缓冲区),您可以通过不传递完整缓冲区但保留 M-1 字节来处理它。 在下一次迭代中,将新接收到的数据放在 M-1 字节的末尾并应用正则表达式。
如果您知道传输数据的格式(例如http),您应该能够解析内容以了解何时到达通信末尾,并且应该发送出您可能已缓存的尾部字节。 如果您不知道格式,那么您需要在接收中实现超时,这样您就不会在通信结束时等待太久。 太长的是你必须自己决定的事情,
Assuming you know the maximum length M of the possible regular expression matches (or can live with an arbitrary value - or just use the whole buffer), you could handle it by not passing on the full buffer but keep M-1 bytes back. In the next iteration put the new received data at the end of the M-1 bytes and apply the regular expression.
If you know the format of the data transmitted (e.g. http), you should be able to parse the contents to know when you reached the end of the communication and should send out the trailing bytes you may have cached. If you do not know the format, then you'd need to implement a timeout in the recv so that you do not hold on to the end of the communication for too long. What is too long is something that you will have to decide on your own,
您需要了解和/或说出一些有关您的正则表达式的信息。
根据正则表达式,您可能需要比现在缓冲更多的缓冲。
最坏的情况可能类似于正则表达式,它表示“找到所有内容,从头开始直到第一次出现“狗”一词,然后用其他内容替换它”:如果您有这样的正则表达式,那么您需要缓冲(不转发)从开始到第一次出现单词“dog”的所有内容:这可能永远不会发生,即可能需要无限量的缓冲。
You need to know and/or say something about your regular expression.
Depending on the regular expression, you might need to buffer a lot more than you are buffering now.
A worst case scenario might be something like a regular expression which says, "find everything, starting from the begining up until the first occurence of the word 'dog', and replace that with something else": if you have a regular expression like that, then you need to buffer (without forwarding) everything from the begining until the first occurence of the word 'dog': which might never happen, i.e. might be an infinite amount to buffer.
从这个意义上讲,您正在谈论的(以及 TCP 的所有含义)套接字都是流。 从你的问题可以看出,你的数据有一定的结构。 因此,您必须执行类似于以下操作的操作:
这可以处理大多数情况。 如果您遇到真正没有“记录”的罕见情况之一,那么您必须构建某种状态机(DFA)。 我的意思是你必须能够积累数据,直到 a) 它不可能与你的正则表达式匹配,或者 b) 它是一个完整的匹配。
编辑:
如果您匹配固定字符串而不是真正的正则表达式,那么您应该能够使用 Boyer-Moore 算法,该算法实际上可以在亚线性时间内运行(通过跳过字符)。 如果你做得正确,当你移动输入时,你可以将之前看到的数据扔到输出缓冲区,从而显着减少延迟并提高吞吐量。
In that sense you're talking about (and all senses for, say, TCP) sockets are streams. It follows from your question that you have some structure in the data. So you must do something similar to the following:
That handles most cases. If you have one of the rare cases where there's really no "record" then you have to build some sort of state machine (DFA). By this I mean you must be able to accumulate data until either a) it can't possibly match your regex, or b) it's a completed match.
EDIT:
If you're matching fixed strings instead of a true regex then you should be able to use the Boyer-Moore algorithm, which can actually run in sub-linear time (by skipping characters). If you do it right, as you move over the input you can throw previously seen data to the output buffer as you go, decreasing latency and increasing throughput significantly.
基本上,您的代码的问题在于,recv/send 循环在比您的修改更低的网络层上运行。 如何解决这个问题取决于您所做的修改,但它可能涉及缓冲数据,直到可以进行所有本地修改。
编辑:我不知道有任何正则表达式库可以过滤这样的流。 这有多困难将取决于您的正则表达式及其过滤的协议。
Basically, the problem with your code is that the recv/send loop is operating on a lower network layer than your modifications. How you solve this problem depends on what modifications you're making, but it probably involves buffering data until all local modifications can be made.
EDIT: I don't know of any regex library that can filter a stream like that. How hard this is going to be will depend on your regex and the protocol it's filtering.
一种替代方法是对非阻塞套接字使用类似
poll(2)
的策略。 在读取事件时,从套接字抓取缓冲区,将其推送到传入队列,调用词法分析器/解析器/匹配器将缓冲区组装成流,然后将块推送到输出队列。 在发生写入事件时,从输出队列中取出一个块(如果有),并将其写入套接字。 这听起来有点复杂,但一旦你习惯了反向控制模型,事情就不是那么复杂了。One alternative is to use
poll(2)
-like strategy with non-blocking sockets. On read event grab a buffer from the socket, push it onto incoming queue, call the lexer/parser/matcher that assembles the buffers into a stream, then pushes chunks onto the output queue. On write event, take a chunk from the output queue, if any, and write it into the socket. This sounds kind of complicated, but it's not really once you get used to the inverted control model.