从文件中查找 N 个最大的行:如何做得更好?
最近,我在提交此代码后被潜在雇主拒绝。他们认为我技术能力不够。我想知道是否有人可以阐明如何使其更好/更高效。
问题是从多行文件中找到最长的 N 行。这最终归结为一个排序问题,因此我构建了一个算法来从数字列表中查找 N 个最大数字,如下所示:
def selection(numbers, n):
maximum = []
for x in range (0, n):
maximum.append(numbers[x])
ind = x
for y in range ( x, len(numbers) ):
if numbers[y] > maximum[len(maximum)-1]:
maximum[len(maximum)-1] = numbers[y]
numbers[ind], numbers[y] = numbers[y], numbers[ind]
return maximum
运行时间为 O(n),除非 N = n,其中其运行时间为O(n^2)。我很惊讶听到他们怀疑我的技术能力,所以我想我会把它带给你。我该如何改善?
编辑:感谢您的反馈。澄清一下:我用文件中的逐行字数填充了一个列表,并通过此函数运行它。
编辑2:有些人提到了语法。我只用了大约一两天的Python。我的雇主建议我用 Python 编写它(我提到我不懂 Python),所以我认为小的语法错误和方法不会是这样的问题。
EDIT3:结果是我最初的选择排序的推理存在缺陷。我脑子里认为最小堆是 nlogn,但我忘记了我的代码的平均复杂度是 n^2。感谢大家的帮助。
I was recently rejected from a potential employer after submitting this code. They suggested I wasn't technically capable enough. I'm wondering if someone could shed light on to how to make this better/more efficient.
The question was to find the N longest lines from a file of multiple lines. This ultimately boiled down to a sorting problem, so I built an algorithm to find the N largest numbers from a list of numbers as so:
def selection(numbers, n):
maximum = []
for x in range (0, n):
maximum.append(numbers[x])
ind = x
for y in range ( x, len(numbers) ):
if numbers[y] > maximum[len(maximum)-1]:
maximum[len(maximum)-1] = numbers[y]
numbers[ind], numbers[y] = numbers[y], numbers[ind]
return maximum
This runs in O(n), unless N = n, in which case it runs in O(n^2). I was surprised to hear them doubt my technical abilities, so I thought I'd bring it to you SO. How do I make this better?
EDIT: Thanks for the feedback. To clarify: I populated a list with the line-by-line word-counts from the file, and ran it through this function.
EDIT2: Some people mentioned syntax. I've only been doing Python for about a day or two. My employer suggested I write it in Python (and I mentioned that I didn't know Python), so I assumed small syntax errors and methods wouldn't be such an issue.
EDIT3: Turns out my initial reasoning was flawed with the selection sort. I had it in my head that a min-heap would be nlogn, but I forgot that the average complexity for my code is n^2. Thanks for the help everyone.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
好吧,解决下面的评论:
其中
push
和pop
是很好的旧最小堆插入和删除最小算法,可以在任何教科书中找到(而且我从来没有一口气完成,所以我现在不发布它们),按长度比较行。运行时间为 O(N×lg(n)),其中 N 是文件中的行数,消耗 O(n)) 时间>n) 临时空间。请注意,结果列表不是按长度排序的,而是可以通过弹出元素直到堆为空并反转结果来完成添加。
Alright, addressing the comments below:
where
push
andpop
are the good old min-heap insert and delete-min algorithms that can be found in any textbook (and that I never get right in one go, so I'm not posting them now), comparing lines by their length. This runs in O(N×lg(n)) time where N is the number of lines in the file, consuming O(n) temporary space.Note that the resulting list is not sorted by length, but adding that can be done by popping the elements until the heap is empty and reversing the result of that.
我会使用堆,但使用最小堆,而不是最大堆,这似乎违反直觉。
I would use a heap, but a min-heap, not a max-heap, which may seem counterintuitive.
这归结为排序吗?如果您有一个可以容纳 N 行的列表,则可以保留几个整数来保存迄今为止找到的最短行的长度及其在列表中的索引。如果读取到的任何行比该行长,则只需迭代列表一次即可找到新的最短行并替换旧的最短行。
Does this boil down to sorting? If you have a list that can hold N lines, you could keep a couple of ints that hold the length of the shortest line so far found and its index in the list. If any line is read that is longer than this line, you only need to iterate the list once to find the new shortest line and also to replace the old shortest line.