从文件中查找 N 个最大的行：如何做得更好？

发布于 2025-01-02 02:46:40 字数 909 浏览 1 评论 0原文

最近，我在提交此代码后被潜在雇主拒绝。他们认为我技术能力不够。我想知道是否有人可以阐明如何使其更好/更高效。

问题是从多行文件中找到最长的 N 行。这最终归结为一个排序问题，因此我构建了一个算法来从数字列表中查找 N 个最大数字，如下所示：

def selection(numbers, n):

    maximum = []

    for x in range (0, n):

        maximum.append(numbers[x])
        ind = x

        for y in range ( x, len(numbers) ):
            if numbers[y] > maximum[len(maximum)-1]:
                maximum[len(maximum)-1] = numbers[y]
                numbers[ind], numbers[y] = numbers[y], numbers[ind]

    return maximum

运行时间为 O(n)，除非 N = n，其中其运行时间为O(n^2)。我很惊讶听到他们怀疑我的技术能力，所以我想我会把它带给你。我该如何改善？

编辑：感谢您的反馈。澄清一下：我用文件中的逐行字数填充了一个列表，并通过此函数运行它。

编辑2：有些人提到了语法。我只用了大约一两天的Python。我的雇主建议我用 Python 编写它（我提到我不懂 Python），所以我认为小的语法错误和方法不会是这样的问题。

EDIT3：结果是我最初的选择排序的推理存在缺陷。我脑子里认为最小堆是 nlogn，但我忘记了我的代码的平均复杂度是 n^2。感谢大家的帮助。

原文

I was recently rejected from a potential employer after submitting this code. They suggested I wasn't technically capable enough. I'm wondering if someone could shed light on to how to make this better/more efficient.

The question was to find the N longest lines from a file of multiple lines. This ultimately boiled down to a sorting problem, so I built an algorithm to find the N largest numbers from a list of numbers as so:

def selection(numbers, n):

    maximum = []

    for x in range (0, n):

        maximum.append(numbers[x])
        ind = x

        for y in range ( x, len(numbers) ):
            if numbers[y] > maximum[len(maximum)-1]:
                maximum[len(maximum)-1] = numbers[y]
                numbers[ind], numbers[y] = numbers[y], numbers[ind]

    return maximum

This runs in O(n), unless N = n, in which case it runs in O(n^2). I was surprised to hear them doubt my technical abilities, so I thought I'd bring it to you SO. How do I make this better?

EDIT: Thanks for the feedback. To clarify: I populated a list with the line-by-line word-counts from the file, and ran it through this function.

EDIT2: Some people mentioned syntax. I've only been doing Python for about a day or two. My employer suggested I write it in Python (and I mentioned that I didn't know Python), so I assumed small syntax errors and methods wouldn't be such an issue.

EDIT3: Turns out my initial reasoning was flawed with the selection sort. I had it in my head that a min-heap would be nlogn, but I forgot that the average complexity for my code is n^2. Thanks for the help everyone.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋意浓 2025-01-09 02:46:40

from heapq import nlargest

def longest_lines(n, filename):
    with open(filename) as input:
        return nlargest(n, input, key=len)

好吧，解决下面的评论：

def longest_lines(n, filename):
    heap = []
    with open(filename) as input:
        for ln in filename:
            push(heap, ln)
            if len(heap) > n:
                pop(heap)
    return heap

其中 push 和 pop 是很好的旧最小堆插入和删除最小算法，可以在任何教科书中找到（而且我从来没有一口气完成，所以我现在不发布它们），按长度比较行。运行时间为 O(N×lg(n))，其中 N 是文件中的行数，消耗 O(n)) 时间>n) 临时空间。

请注意，结果列表不是按长度排序的，而是可以通过弹出元素直到堆为空并反转结果来完成添加。

from heapq import nlargest

def longest_lines(n, filename):
    with open(filename) as input:
        return nlargest(n, input, key=len)

Alright, addressing the comments below:

def longest_lines(n, filename):
    heap = []
    with open(filename) as input:
        for ln in filename:
            push(heap, ln)
            if len(heap) > n:
                pop(heap)
    return heap

where push and pop are the good old min-heap insert and delete-min algorithms that can be found in any textbook (and that I never get right in one go, so I'm not posting them now), comparing lines by their length. This runs in O(N×lg(n)) time where N is the number of lines in the file, consuming O(n) temporary space.

Note that the resulting list is not sorted by length, but adding that can be done by popping the elements until the heap is empty and reversing the result of that.

回复收藏 0 原文

掩于岁月 2025-01-09 02:46:40

我会使用堆，但使用最小堆，而不是最大堆，这似乎违反直觉。

Create an empty heap.
For each line, 
  if there are less than N lines in the heap, add the current line;
  otherwise,
    if the current line is longer than the minimum element in the heap,
      remove the minimum element from the heap, and
      add the current line to the heap.
Return the contents of the heap.

I would use a heap, but a min-heap, not a max-heap, which may seem counterintuitive.

Create an empty heap.
For each line, 
  if there are less than N lines in the heap, add the current line;
  otherwise,
    if the current line is longer than the minimum element in the heap,
      remove the minimum element from the heap, and
      add the current line to the heap.
Return the contents of the heap.

回复收藏 0 原文