性能 - Python 与 C#/C++/C 逐个字符读取

发布于 2024-08-03 14:06:15 字数 993 浏览 3 评论 0原文

所以我有这些巨大的 XML 文件(我所说的巨大,是指 1.5GB 以上),而且它们没有 CRLF。我正在尝试运行一个类似 diff 的程序来查找这些文件之间的差异。

由于我还没有找到一个不会因内存耗尽而崩溃的 diff 程序,因此我决定最好的选择是在关闭标签后添加 CRLF。

我编写了一个 python 脚本来逐个字符读取并在“>”后添加换行符。问题是我在 1995 年左右的单核 PC 上运行这个程序,或者其他一些荒谬的东西,当我同时进行转换时,它只能处理大约 20MB/小时。

知道用 C#/C/C++ 编写此代码是否会产生任何好处吗?如果没有,有人知道会逐字节进行比较的程序吗?谢谢。


编辑:

这是我的处理函数的代码...

def read_and_format(inputfile, outputfile):
    ''' Open input and output files, then read char-by-char and add new lines after ">" '''
    infile = codecs.open(inputfile,"r","utf-8")
    outfile = codecs.open(outputfile,"w","utf-8")

    char = infile.read(1) 
    while(1):
        if char == "":
            break
        else:
            outfile.write(char)
            if(char == ">"):
                outfile.write("\n")
        char = infile.read(1)

    infile.close()
    outfile.close()

编辑2: 感谢您的精彩回复。增加读取大小带来了令人难以置信的速度提升。问题解决了。

So I have these giant XML files (and by giant, I mean like 1.5GB+) and they don't have CRLFs. I'm trying to run a diff-like program to find the differences between these files.

Since I've yet to find a diff program that won't explode due to memory exhaustion, I've decided the best bet was to add CRLFs after closing tags.

I wrote a python script to read char-by-char and add new-lines after '>'. The problem is I'm running this on a single core PC circa-1995 or something ridiculous, and it's only processing about 20MB/hour when I have both converting at the same time.

Any idea if writing this in C#/C/C++ instead will yield any benefits? If not, does anyone know of a diff program that will go byte-by-byte? Thanks.


EDIT:

Here's the code for my processing function...

def read_and_format(inputfile, outputfile):
    ''' Open input and output files, then read char-by-char and add new lines after ">" '''
    infile = codecs.open(inputfile,"r","utf-8")
    outfile = codecs.open(outputfile,"w","utf-8")

    char = infile.read(1) 
    while(1):
        if char == "":
            break
        else:
            outfile.write(char)
            if(char == ">"):
                outfile.write("\n")
        char = infile.read(1)

    infile.close()
    outfile.close()

EDIT2:
Thanks for the awesome responses. Increaseing the read size created an unbelievable speed increase. Problem solved.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

宁愿没拥抱 2024-08-10 14:06:15

一次读取和写入一个字符几乎总是很慢,因为磁盘是基于块的设备,而不是基于字符的设备 - 它会读取的不仅仅是您要读取的一个字节,而且多余的部分需要丢弃。

尝试一次读取和写入更多数据,例如 8192 字节 (8KB),然后在将其写出之前在该字符串中查找并添加换行符 - 您应该节省大量性能,因为所需的 I/O 会少得多。

正如 LBushkin 指出的那样,您的 I/O 库可能正在进行缓冲,但除非有某种形式的文档表明确实发生了这种情况(对于读取和写入),否则在用不同的语言重写之前尝试是相当容易的事情。

Reading and writing a single character at a time is almost always going to be slow, because disks are block-based devices, rather than character-based devices - it will read a lot more than just the one byte you're after, and the surplus parts need to be discarded.

Try reading and writing more at a time, say, 8192 bytes (8KB) and then finding and adding newlines in that string before writing it out - you should save a lot in performance because a lot less I/O is required.

As LBushkin points out, your I/O library may be doing buffering, but unless there is some form of documentation that shows this does indeed happen (for reading AND writing), it's a fairly easy thing to try before rewriting in a different language.

〃安静 2024-08-10 14:06:15

你为什么不直接使用 sed 呢?
猫巨人.xml | sed 's/>/>\x0a\x0d/g' >带有换行符的巨人.xml

Why don't you just use sed?
cat giant.xml | sed 's/>/>\x0a\x0d/g' > giant-with-linebreaks.xml

做个ˇ局外人 2024-08-10 14:06:15

不要逐字节读取(这会导致每个字节读取都需要一次磁盘访问),而是尝试一次读取约 20 MB 并对其进行搜索和替换:)

您可以在记事本中执行此操作......

Billy3

Rather than reading byte by byte, which incurs a disk access for each byte read, try reading ~20 MB at a time and doing your search + replace on that :)

You can probably do this in Notepad....

Billy3

余生再见 2024-08-10 14:06:15

对于您描述的问题类型,我怀疑您用于比较数据的算法将比 I/O 模型或语言产生更显着的影响。事实上,字符串分配和搜索在这里可能比其他任何事情都更昂贵。

在您自己编写此内容之前,有一些一般性建议:

  1. 如果有可用的机器,请尝试在更快的机器上运行。这将会产生巨大的影响。
  2. 在线寻找一个现有的工具来进行 XML 差异...不要自己编写一个。

如果要用 C#(或 Java 或 C/C++)编写此代码,我会执行以下操作:

  1. 一次将相当大的块读入内存(假设在 200k 到 1M 之间)
  2. 分配一个两倍于该大小的空块(假设每个字符的最坏情况是“>”)
  3. 从输入块复制到输出块,有条件地在每个“>”后附加一个 CRLF特点。
  4. 将新块写入磁盘。
  5. 重复此操作,直到处理完所有数据。

此外,您还可以编写这样一个程序来在多个线程上运行,这样当线程在内存中执行 CRLF 插入时,就会有一个单独的线程从磁盘读取块。这种类型的并行化很复杂......所以只有当您确实需要最大性能时我才会这样做。

如果您需要的话,这里有一个非常简单的 C# 程序可以帮助您入门。它在命令行上接受输入文件路径和输出路径,并执行您要查找的替换('>' ==> CRLF)。这个示例还有很多需要改进的地方(并行处理、流媒体、一些验证等)......但这应该是一个不错的开始。

using System;
using System.IO;

namespace ExpandBrackets
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length == 2)
            {
                using( StreamReader input = new StreamReader( args[0] ) )
                using( StreamWriter output = new StreamWriter( args[1] ) )
                {
                    int readSize = 0;
                    int blockSize = 100000;
                    char[] inBuffer = new char[blockSize];
                    char[] outBuffer = new char[blockSize*3];
                    while( ( readSize = input.ReadBlock( inBuffer, 0, blockSize ) ) > 0 )
                    {
                        int writeSize = TransformBlock( inBuffer, outBuffer, readSize );
                        output.Write( outBuffer, 0, writeSize );
                    }
                }
            }
            else
            {
                Console.WriteLine( "Usage:  repchar {inputfile} {outputfile}" );
            }
        }

        private static int TransformBlock( char[] inBuffer, char[] outBuffer, int size )
        {
            int j = 0;
            for( int i = 0; i < size; i++ )
            {
                outBuffer[j++] = inBuffer[i];
                if (inBuffer[i] == '>') // append CR LF
                {
                    outBuffer[j++] = '\r';
                    outBuffer[j++] = '\n';
                }
            }
            return j;
        }
    }
}

For the type of problem you describe, I suspect the algorithm you employ for comparing the data will have a much more significant effect than the I/O model or language. In fact, string allocation and search may be more expensive here than anything else.

Some general suggestions before you write this yourself:

  1. Try running on a faster machine if you have one available. That will make a huge difference.
  2. Look for an existing tool online for doing XML diffs ... don't write one yourself.

If are are going to write this in C# (or Java or C/C++), I would do the following:

  1. Read a fairly large block into memory all at once (let's say between 200k and 1M)
  2. Allocate an empty block that's twice that size (this assumes a worst case of every character is a '>')
  3. Copy from the input block to the output block conditionally appending a CRLF after each '>' character.
  4. Write the new block out to disk.
  5. Repeat until all the data has been processed.

Additionally, you could also write such a program to run on multiple threads, so that while once thread is perform CRLF insertions in memory, a separate thread is read blocks in from disk. This type of parallelization is complicated ... so I would only do so if you really need maximum performance.

Here's a really simple C# program to get you started, if you need it. It accepts an input file path and an output path on the command line, and performs the substitution you are looking for ('>' ==> CRLF). This sample leaves much to be improved (parallel processing, streaming, some validation, etc)... but it should be a decent start.

using System;
using System.IO;

namespace ExpandBrackets
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length == 2)
            {
                using( StreamReader input = new StreamReader( args[0] ) )
                using( StreamWriter output = new StreamWriter( args[1] ) )
                {
                    int readSize = 0;
                    int blockSize = 100000;
                    char[] inBuffer = new char[blockSize];
                    char[] outBuffer = new char[blockSize*3];
                    while( ( readSize = input.ReadBlock( inBuffer, 0, blockSize ) ) > 0 )
                    {
                        int writeSize = TransformBlock( inBuffer, outBuffer, readSize );
                        output.Write( outBuffer, 0, writeSize );
                    }
                }
            }
            else
            {
                Console.WriteLine( "Usage:  repchar {inputfile} {outputfile}" );
            }
        }

        private static int TransformBlock( char[] inBuffer, char[] outBuffer, int size )
        {
            int j = 0;
            for( int i = 0; i < size; i++ )
            {
                outBuffer[j++] = inBuffer[i];
                if (inBuffer[i] == '>') // append CR LF
                {
                    outBuffer[j++] = '\r';
                    outBuffer[j++] = '\n';
                }
            }
            return j;
        }
    }
}
一江春梦 2024-08-10 14:06:15

通常提到的所有语言在某些时候都会恢复到 C 运行时库以进行逐字节文件访问。用 C 语言编写可能是最快的选择。

然而,我怀疑它能否提供巨大的速度提升。如果你做事正确的话,Python 相当快。

真正获得大幅速度提升的主要方法是引入线程。如果您在一个线程中从一个大块的文件中读取数据,并且有一个单独的线程来执行换行处理+差异处理,则可以显着提高该算法的速度。这可能比直接在 C 或 CPython 中更容易在 C++、C# 或 IronPython 中实现,因为它们提供了非常简单的高级同步工具来处理线程问题(尤其是在使用 .NET 时)。

All of the languages mentioned typically, at some point, revert to the C runtime library for byte by byte file access. Writing this in C will probably be the fastest option.

However, I doubt it will provide a huge speed boost. Python is fairly speedy, if you're doing things correctly.

The main way to really get a big speed improvement would be to introduce threading. If you read the data in from the file in a large block in one thread, and had a separate thread that did your newline processing + diff processing, you could dramatically improve the speed of this algorithm. This would probably be easier to implement in C++, C#, or IronPython than in C or CPython directly, since they provide very easy, high-level synchronization tools for handling the threading issues (especially when using .NET).

静若繁花 2024-08-10 14:06:15

你可以尝试 xmldiff - http://msdn.microsoft.com/en-us /library/aa302294.aspx

我还没有使用它来处理这么大的数据,但我认为它会得到合理的优化

you could try xmldiff - http://msdn.microsoft.com/en-us/library/aa302294.aspx

I haven't used it for such huge data but I think it would be reasonably optimized

这样的小城市 2024-08-10 14:06:15

我将此作为对另一个答案的评论,但如果您错过了它 - 您可能需要查看 枪战。它是针对多种语言的各种问题的高度优化的代码集。

根据这些结果,Python 往往比 c 慢 50 倍左右(但它比其他解释语言快)。相比之下,Java 比 c 慢大约 2 倍。如果您使用一种编译速度更快的语言,我不明白为什么您不会看到类似的增长。

顺便说一下,从点球大战中获得的数字是非常无懈可击的,你不能真正挑战它们,相反,如果你不相信这些数字是公平的,因为用你最喜欢的语言解决问题的代码没有经过优化正确的话,你就可以自己提交更好的代码。许多人这样做的行为意味着其中的大部分代码都针对每种流行语言进行了非常优化的优化。如果您向他们展示更优化的编译器或解释器,他们也可能会包含其结果。

哦:除了C#,都只用MONO来表示,所以如果微软的编译器更优化的话,就不会显示了。所有测试似乎都在 Linux 机器上运行。我的猜测是微软的 C# 应该以与 Java 相同的速度运行,但枪战中单声道的速度稍慢(大约是 C 的 3 倍)。

I put this as a comment on another answer, but in case you miss it--you might want to look at The Shootout. It's a highly optimized set of code for various problems in many languages.

According to those results, Python tends to be about 50x slower than c (but it is faster than the other interpreted languages). In comparison Java is about 2x slower than c. If you went to one of the faster compiled languages, I don't see why you wouldn't see a similar increase.

By the way, the figures attained from the shootout are wonderfully un-assailable, you can't really challenge them, instead if you don't believe the numbers are fair because the code to solve a problem in your favorite language isn't optimized properly, then you can submit better code yourself. The act of many people doing this means most of the code on there is pretty damn optimized for every popular language. If you show them a more optimized compiler or interpreter, they may include the results from it as well.

Oh: except C#, that's only represented by MONO so if Microsoft's compiler is more optimized, it's not shown. All the tests seem to run on Linux machines. My guess is Microsoft's C# should run at about the same speed as Java, but the shootout lists mono as a bit slower (about 3x as slow as C)..

草莓味的萝莉 2024-08-10 14:06:15

正如其他人所说,如果您用 C 语言执行此操作,它将几乎是无与伦比的,因为 C 缓冲 I/O,并且 getc() 是内联的(在我的记忆中)。

您真正的性能问题将在差异中。

也许那里有一个相当不错的文件,但对于那些大小的文件我对此表示怀疑。为了好玩,我是一个自己动手的人。我使用的策略是在每个文件中都有一个滚动窗口,长度为几兆字节。不匹配的搜索策略是对角搜索,即如果您位于第 i 行和第 j 行,则按以下顺序进行比较:

line(i+0) == line(j+0)

line(i+0) == line(j+1)
line(i+1) == line(j+0)

line(i+0) == line(j+2)
line(i+1) == line(j+1)
line(i+2) == line(j+0)

依此类推。毫无疑问有更好的方法,但如果我要自己编码并管理滚动窗口,这就是我会尝试的方法。

As others said, if you do it in C it will be pretty much unbeatable, because C buffers I/O, and getc() is inlined (in my memory).

Your real performance issue will be in the diff.

Maybe there's a pretty good one out there, but for those size files I doubt it. For fun, I'm a do-it-yourselfer. The strategy I would use is to have a rolling window in each file, several megabytes long. The search strategy for mismatches is diagonal search, which is if you are at lines i and j, compare in this sequence:

line(i+0) == line(j+0)

line(i+0) == line(j+1)
line(i+1) == line(j+0)

line(i+0) == line(j+2)
line(i+1) == line(j+1)
line(i+2) == line(j+0)

and so on. No doubt there's a better way, but if I'm going to code it myself and manage the rolling windows, that's what I'd try.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文