在 C# 中解析具有自定义分隔符和一些非常非常大的字段值的文本的最快方法是什么?

发布于 2024-07-09 11:00:53 字数 639 浏览 14 评论 0 原文

我一直在尝试处理一些具有非标准分隔符(不是逗号/引号或制表符分隔)的分隔文本文件。 分隔符是随机 ASCII 字符,在分隔符之间不经常出现。 经过一番搜索后,我似乎只发现 .NET 中没有任何解决方案可以满足我的需求,并且人们为此编写的自定义库在涉及巨大输入时似乎存在一些缺陷(4GB 文件,其中一些字段值具有很容易就达到数百万个字符)。

虽然这似乎有点极端,但它实际上是电子文档发现 (EDD) 行业的一个标准,某些审阅软件具有包含文档完整内容的字段值。 作为参考,我之前已经使用 csv 模块在 python 中完成了此操作,没有出现任何问题。

这是一个示例输入:

Field delimiter = 
quote character = þ

þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...

编辑: 所以我继续从头开始创建一个分隔文件解析器。 我有点厌倦使用这个解决方案,因为它可能容易出现错误。 必须为这样的任务编写自己的解析器也感觉不“优雅”或正确。 我还有一种感觉,无论如何我可能不必为此从头开始编写解析器。

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).

While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.

Here's an example input:

Field delimiter = 
quote character = þ

þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...

Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

紫轩蝶泪 2024-07-16 11:00:53

使用文件助手 API。 它是 .NET 且开源的。 它使用编译的 IL 代码在强类型对象上设置字段,具有极高的性能,并且支持流式传输。

它支持各种文件类型和自定义分隔符; 我用它来读取大于 4GB 的文件。

如果由于某种原因这不适合您,请尝试使用 string.split 逐行读取:

public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
    string line;
    while ((line = input.ReadLine()) != null)
    {
        yield return line.Split('þ');
    }
}

这将为您提供简单的字符串数组,以流式方式表示行,您甚至可以 Linq 进入;)但是请记住IEnumerable 是延迟加载的,因此在迭代之前不要关闭或更改 StreamReader(或导致像 ToList/ToArray 之类的完整加载操作 - 但是考虑到您的文件大小,我认为您不会这样做!) 。

这是一个很好的使用示例:

using (StreamReader sr = new StreamReader("c:\\test.file"))
{
    var qry = from l in CreateEnumerable(sr).Skip(1)
              where l[3].Contains("something")
              select new { Field1 = l[0], Field2 = l[1] };
    foreach (var item in qry)
    {
        Console.WriteLine(item.Field1 + " , " + item.Field2);
    }
}
Console.ReadLine();

这将跳过标题行,然后打印出文件中的前两个字段,其中第四个字段包含字符串“something”。 它将在不将整个文件加载到内存中的情况下执行此操作。

Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.

It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.

If for some reason that doesn't do it for you, try just reading line by line with a string.split:

public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
    string line;
    while ((line = input.ReadLine()) != null)
    {
        yield return line.Split('þ');
    }
}

That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).

Here's a good sample use of it:

using (StreamReader sr = new StreamReader("c:\\test.file"))
{
    var qry = from l in CreateEnumerable(sr).Skip(1)
              where l[3].Contains("something")
              select new { Field1 = l[0], Field2 = l[1] };
    foreach (var item in qry)
    {
        Console.WriteLine(item.Field1 + " , " + item.Field2);
    }
}
Console.ReadLine();

This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.

≈。彩虹 2024-07-16 11:00:53

Windows 和高性能 I/O 方式,使用 IO 完成 端口。 您可能需要做一些额外的管道工作才能使其在您的情况下正常工作。

这是基于您想要使用 C#/.NET 的理解,并根据 乔·达菲

18) 不要在托管中使用 Windows 异步过程调用 (APC)
代码。

我必须通过艰难的方式才能学会这一点;),但排除 APC 的使用,IOCP 是唯一明智的选择。 它还支持套接字服务器中经常使用的许多其他类型的 I/O。

至于解析实际文本,请查看 Eric White 的 博客,用于一些简化的流使用。

Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.

This is with the understanding that you want to use C#/.NET, and according to Joe Duffy

18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.

I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.

As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.

单身情人 2024-07-16 11:00:53

我倾向于使用内存映射文件的组合(msdn 指向 . NET 包装器)和一个简单的增量解析,返回记录/文本行(或其他)的 IEnumerable 列表

I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)

窝囊感情。 2024-07-16 11:00:53

你提到有些领域非常非常大,如果你试图完整地阅读它们并记住你可能会给自己带来麻烦。 我会以 8K(或小块)的形式读取文件,解析当前缓冲区,跟踪状态。

您想对正在解析的这些数据做什么? 您在寻找什么吗? 你正在改造它吗?

You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.

What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?

a√萤火虫的光℡ 2024-07-16 11:00:53

我不认为你编写自定义解析器有问题。 这些要求似乎与 BCL 已提供的任何内容有很大不同,因此请继续。

“优雅”显然是一个主观的东西。 在我看来,如果您的解析器的 API 看起来和工作起来都像标准的 BCL“阅读器”类型 API,那么那就相当“优雅”。

对于大数据量,让解析器通过一次读取一个字节来工作,并使用简单的状态机来确定要做什么。 将流式传输和缓冲留给底层的 FileStream 类。 您应该对性能和内存消耗感到满意。

如何使用此类解析器类的示例:

using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
    // Read a small field
    string smallField = reader.ReadFieldAsText();
    // Read a large field
    Stream largeField = reader.ReadFieldAsStream();
}

I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.

"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".

As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.

Example of how you might use such a parser class:

using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
    // Read a small field
    string smallField = reader.ReadFieldAsText();
    // Read a large field
    Stream largeField = reader.ReadFieldAsStream();
}
紧拥背影 2024-07-16 11:00:53

虽然这无助于解决大输入问题,但解析问题的可能解决方案可能包括使用策略模式来提供分隔符的自定义解析器。

While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文