过滤(搜索和替换)InputStream 中的字节数组
我有一个 InputStream ,它将 html 文件作为输入参数。我必须从输入流中获取字节。
我有一个字符串:"XYZ"
。我想将此字符串转换为字节格式,并检查从 InputStream 获得的字节序列中是否存在与该字符串匹配的字符串。如果有的话,我必须将匹配替换为其他字符串的再见序列。
有谁可以帮助我解决这个问题吗?我使用正则表达式来查找和替换。但是查找和替换字节流,我不知道。
以前,我使用 jsoup 来解析 html 并替换字符串,但是由于一些 utf 编码问题,当我这样做时,文件似乎已损坏。
TL;DR:我的问题是:
有一种方法可以在 Java 的原始 InputStream 中查找和替换字节格式的字符串吗?
I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .
I have a string: "XYZ"
. I'd like to convert this string to byte format and check if there is a match for the string in the byte sequence which I obtained from the InputStream. If there is then, I have to replace the match with the bye sequence for some other string.
Is there anyone who could help me with this? I have used regex to find and replace. however finding and replacing byte stream, I am unaware of.
Previously, I use jsoup to parse html and replace the string, however due to some utf encoding problems, the file seems to appear corrupted when I do that.
TL;DR: My question is:
Is a way to find and replace a string in byte format in a raw InputStream in Java?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
不确定您是否选择了解决问题的最佳方法。
也就是说,我不喜欢(并且按照政策不)用“不”回答问题,所以这里...
看看
FilterInputStream
。从文档中:
写下来是一个有趣的练习。这是一个完整的示例:
示例用法
给定字符串
"Hello xyz world"
的字节,它会打印:Not sure you have chosen the best approach to solve your problem.
That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...
Have a look at
FilterInputStream
.From the documentation:
It was a fun exercise to write it up. Here's a complete example for you:
Example Usage
Given the bytes for the string
"Hello xyz world"
it prints:以下方法可行,但我不知道对性能的影响有多大。
InputStreamReader
包装InputStream
FilterReader
包装InputStreamReader
,然后ReaderInputStream
。选择适当的编码至关重要,否则流的内容将被损坏。
如果你想使用正则表达式来替换字符串,那么你可以使用 Streamflyer< /a>,我的一个工具,它是
FilterReader
的便捷替代品。您可以在 Streamflyer 的网页上找到字节流的示例。希望这有帮助。The following approach will work but I don't how big the impact is on the performance.
InputStream
with aInputStreamReader
,InputStreamReader
with aFilterReader
that replaces the strings, thenFilterReader
with aReaderInputStream
.It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.
If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to
FilterReader
. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.我也需要类似的东西,并决定推出自己的解决方案,而不是使用@aioobe 上面的示例。看看 代码。您可以从 Maven Central 拉取库,或者只复制源代码。
这就是你如何使用它。在本例中,我使用嵌套实例来替换两个模式(两个修复 dos 和 mac 行结尾)。
这是完整的源代码:
I needed something like this as well and decided to roll my own solution instead of using the example above by @aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.
This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.
Here's the full source code:
没有任何用于在字节流 (
InputStream
) 上进行搜索和替换的内置功能。并且,有效且正确地完成该任务的方法并不是立即显而易见的。我已经为流实现了 Boyer-Moore 算法,效果很好,但需要一些时间。如果没有这样的算法,您必须诉诸暴力方法,您从流中的每个位置开始查找模式, 这可能会很慢。
即使您将 HTML 解码为文本,使用使用正则表达式来匹配模式可能不是一个好主意,,因为 HTML 不是“常规”语言。
因此,即使您遇到了一些困难,我还是建议您采用将 HTML 解析为文档的原始方法。虽然您在字符编码方面遇到问题,但从长远来看,修复正确的解决方案可能比临时决定错误的解决方案更容易。
There isn't any built-in functionality for search-and-replace on byte streams (
InputStream
).And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.
Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.
So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.
我需要一个解决方案,但发现这里的答案会产生太多的内存和/或 CPU 开销。基于简单的基准测试,以下解决方案在这些方面明显优于其他解决方案。
该解决方案特别节省内存,即使对于大于 GB 的流也不会产生可测量的成本。
也就是说,这不是一个零 CPU 成本的解决方案。对于除了最苛刻/资源敏感的场景之外的所有场景,CPU/处理时间开销可能是合理的,但开销是真实的,在评估在给定上下文中使用此解决方案的价值时应考虑到这一开销。
就我而言,我们正在处理的最大实际文件大小约为 6MB,其中我们看到 44 个 URL 替换增加了约 170 毫秒的延迟。这是针对在具有单个 CPU 共享 (1024) 的 AWS ECS 上运行的基于 Zuul 的反向代理。对于大多数文件(100KB 以下),增加的延迟为亚毫秒级。在高并发(以及 CPU 争用)下,增加的延迟可能会增加,但我们目前能够在单个节点上同时处理数百个文件,而不会产生人类可察觉的延迟影响。
我们正在使用的解决方案:
I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.
This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.
That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.
In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.
The solution we are using:
当我需要在 Servlet 中提供模板文件并用值替换某个关键字时,我想出了这段简单的代码。它应该非常快并且内存不足。然后使用管道流,我想你可以用它来做各种各样的事情。
/JC
I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.
/JC
您可以借助“apache.poi”库中的 ReplacingInputStream 类来完成此操作。
Java
摇篮:
You can do it with help of the ReplacingInputStream class from "apache.poi" library.
Java
Gradle: