对列表中的条目数进行下采样(不进行插值)

发布于 2024-09-05 01:23:29 字数 833 浏览 3 评论 0原文

我有一个包含许多条目的 Python 列表,我需要使用以下任一方法对其进行下采样:

  • 最大行数。例如,将 1234 个条目的列表限制为 1000。
  • 原始行的比例。例如,将列表设为原始长度的 1/3。

(我需要能够同时执行这两种方法,但一次只能使用一种方法)。

我相信,对于最大行数,我只需计算所需的比例并将其传递给比例缩小器:

def downsample_to_max(self, rows, max_rows):
        return downsample_to_proportion(rows, max_rows / float(len(rows)))

...所以我实际上只需要一个下采样函数。有什么提示吗?

编辑:该列表包含对象,而不是数值,因此我不需要插值。扔掉物体是没问题的。

解决方案:

def downsample_to_proportion(self, rows, proportion):

    counter = 0.0
    last_counter = None
    results = []

    for row in rows:

        counter += proportion

        if int(counter) != last_counter:
            results.append(row)
            last_counter = int(counter)

    return results

谢谢。

I have a Python list with a number of entries, which I need to downsample using either:

  • A maximum number of rows. For example, limiting a list of 1234 entries to 1000.
  • A proportion of the original rows. For example, making the list 1/3 its original length.

(I need to be able to do both ways, but only one is used at a time).

I believe that for the maximum number of rows I can just calculate the proportion needed and pass that to the proportional downsizer:

def downsample_to_max(self, rows, max_rows):
        return downsample_to_proportion(rows, max_rows / float(len(rows)))

...so I really only need one downsampling function. Any hints, please?

EDIT: The list contains objects, not numeric values so I do not need to interpolate. Dropping objects is fine.

SOLUTION:

def downsample_to_proportion(self, rows, proportion):

    counter = 0.0
    last_counter = None
    results = []

    for row in rows:

        counter += proportion

        if int(counter) != last_counter:
            results.append(row)
            last_counter = int(counter)

    return results

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

风轻花落早 2024-09-12 01:23:29

您可以使用 itertools 中的 islice

from itertools import islice

def downsample_to_proportion(rows, proportion=1):
    return list(islice(rows, 0, len(rows), int(1/proportion)))

用法:

x = range(1,10)
print downsample_to_proportion(x, 0.3)
# [1, 4, 7]

You can use islice from itertools:

from itertools import islice

def downsample_to_proportion(rows, proportion=1):
    return list(islice(rows, 0, len(rows), int(1/proportion)))

Usage:

x = range(1,10)
print downsample_to_proportion(x, 0.3)
# [1, 4, 7]
时光是把杀猪刀 2024-09-12 01:23:29

如果输入已经是序列类型,则直接使用切片语法比使用 islice() + list() 更有效:

def downsample_to_proportion(rows, proportion):
    return rows[::int(1 / proportion)]

Instead of islice() + list() it is more efficient to use slice syntax directly if the input is already a sequence type:

def downsample_to_proportion(rows, proportion):
    return rows[::int(1 / proportion)]
高速公鹿 2024-09-12 01:23:29

这个解决方案对于原始海报来说可能有点过大,但我想我会分享我一直用来解决这个问题和类似问题的代码。

它有点长(大约 90 行),但如果你经常有这种需求,想要一个易于使用的 oneliner,并且需要一个纯 Python 的无依赖环境,那么我认为它可能会有用。

基本上,您唯一要做的就是将列表传递给函数并告诉它您希望新列表的长度,该函数将:

  • 通过删除项目来缩小列表,如果新的长度更小,很像之前已经建议的答案。
  • 如果新长度较大,则拉伸/放大您的列表(与缩小尺寸相反),并使用添加的选项,您可以决定是否:
    • 对已知值进行线性插值(如果列表包含整数或浮点数,则自动选择)
    • 复制每个值,以便它们占据新列表的比例大小(如果列表包含非数字,则自动选择)
    • 将原始值分开并在之间留下间隙

所有内容都收集在一个函数内,因此如果您需要它,只需将其复制并粘贴到您的脚本中,您就可以立即开始使用它。

例如,您可能会说:

origlist = [0,None,None,30,None,50,60,70,None,None,100]
resizedlist = ResizeList(testlist, 21)
print(resizedlist)

并得到

[0, 5.00000000001, 9.9999999999900009, 15.0, 20.000000000010001, 24.999999999989999, 30, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70, 75.000000000010004, 79.999999999989996, 85.0, 90.000000000010004, 94.999999999989996, 100]

注意,由于浮点限制,会出现轻微的错误。另外,我是为 Python 2.x 编写的,因此要在 Python 3.x 上使用它,只需添加一行 xrange = range 即可。

这是一个在列表列表中的定位子项之间进行插值的巧妙技巧。例如,您可以轻松地在 RGB 颜色元组之间进行插值,以创建 x nr 级的颜色渐变。假设您使用 RGB 颜色元组列表 3 和所需的 GRADIENTLENGTH 变量来执行此操作:

crosssections = zip(*rgbtuples)
grad_crosssections = ( ResizeList(spectrum,GRADIENTLENGTH) for spectrum in crosssections )
rgb_gradient = [list(each) for each in zip(*grad_crosssections)]

它可能需要相当多的优化,我必须做相当多的实验。如果您觉得可以改进,请随时编辑我的帖子。这是代码:

def ResizeList(rows, newlength, stretchmethod="not specified", gapvalue=None):
    """
    Resizes (up or down) and returns a new list of a given size, based on an input list.
    - rows: the input list, which can contain any type of value or item (except if using the interpolate stretchmethod which requires floats or ints only)
    - newlength: the new length of the output list (if this is the same as the input list then the original list will be returned immediately)
    - stretchmethod: if the list is being stretched, this decides how to do it. Valid values are:
      - 'interpolate'
        - linearly interpolate between the known values (automatically chosen if list contains ints or floats)
      - 'duplicate'
        - duplicate each value so they occupy a proportional size of the new list (automatically chosen if the list contains non-numbers)
      - 'spread'
        - drags the original values apart and leaves gaps as defined by the gapvalue option
    - gapvalue: a value that will be used as gaps to fill in between the original values when using the 'spread' stretchmethod
    """
    #return input as is if no difference in length
    if newlength == len(rows):
        return rows
    #set auto stretchmode
    if stretchmethod == "not specified":
        if isinstance(rows[0], (int,float)):
            stretchmethod = "interpolate"
        else:
            stretchmethod = "duplicate"
    #reduce newlength 
    newlength -= 1
    #assign first value
    outlist = [rows[0]]
    writinggapsflag = False
    if rows[1] == gapvalue:
        writinggapsflag = True
    relspreadindexgen = (index/float(len(rows)-1) for index in xrange(1,len(rows))) #warning a little hacky by skipping first index cus is assigned auto
    relspreadindex = next(relspreadindexgen)
    spreadflag = False
    gapcount = 0
    for outlistindex in xrange(1, newlength):
        #relative positions
        rel = outlistindex/float(newlength)
        relindex = (len(rows)-1) * rel
        basenr,decimals = str(relindex).split(".")
        relbwindex = float("0."+decimals)
        #determine equivalent value
        if stretchmethod=="interpolate":
            #test for gap
            maybecurrelval = rows[int(relindex)]
            maybenextrelval = rows[int(relindex)+1]
            if maybecurrelval == gapvalue:
                #found gapvalue, so skipping and waiting for valid value to interpolate and add to outlist
                gapcount += 1
                continue
            #test whether to interpolate for previous gaps
            if gapcount > 0:
                #found a valid value after skipping gapvalues so this is where it interpolates all of them from last valid value to this one
                startvalue = outlist[-1]
                endindex = int(relindex)
                endvalue = rows[endindex]
                gapstointerpolate = gapcount 
                allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
                outlist.extend(allinterpolatedgaps[1:-1])
                gapcount = 0
                writinggapsflag = False
            #interpolate value
            currelval = rows[int(relindex)]
            lookahead = 1
            nextrelval = rows[int(relindex)+lookahead]
            if nextrelval == gapvalue:
                if writinggapsflag:
                    continue
                relbwval = currelval
                writinggapsflag = True
            else:
                relbwval = currelval + (nextrelval - currelval) * relbwindex #basenr pluss interindex percent interpolation of diff to next item
        elif stretchmethod=="duplicate":
            relbwval = rows[int(round(relindex))] #no interpolation possible, so just copy each time
        elif stretchmethod=="spread":
            if rel >= relspreadindex:
                spreadindex = int(len(rows)*relspreadindex)
                relbwval = rows[spreadindex] #spread values further apart so as to leave gaps in between
                relspreadindex = next(relspreadindexgen)
            else:
                relbwval = gapvalue
        #assign each value
        outlist.append(relbwval)
    #assign last value
    if gapcount > 0:
        #this last value also has to interpolate for previous gaps       
        startvalue = outlist[-1]
        endvalue = rows[-1]
        gapstointerpolate = gapcount 
        allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
        outlist.extend(allinterpolatedgaps[1:-1])
        outlist.append(rows[-1])
        gapcount = 0
        writinggapsflag = False
    else:
        outlist.append(rows[-1])
    return outlist

This solution might be a bit overkill for the original poster, but I thought I would share the code that I've been using to solve this and similar problems.

It's a bit lengthy (about 90 lines), but if you often have this need, want an easy-to-use oneliner, and need a pure-Python dependency free environment then I reckon it might be of use.

Basically, the only thing you have to do is pass your list to the function and tell it what length you want your new list to be, and the function will either:

  • downsize your list by dropping items if the new length is smaller, much like the previous answers already suggested.
  • stretch/upscale your list (the opposite of downsizing) if the new length is larger, with the added option that you can decide whether to:
    • linearly interpolate bw the known values (automatically chosen if list contains ints or floats)
    • duplicate each value so they occupy a proportional size of the new list (automatically chosen if the list contains non-numbers)
    • pull the original values apart and leave gaps in between

Everything is collected inside one function so if you need it just copy and paste it to your script and you can start using it right away.

For instance you might say:

origlist = [0,None,None,30,None,50,60,70,None,None,100]
resizedlist = ResizeList(testlist, 21)
print(resizedlist)

and get

[0, 5.00000000001, 9.9999999999900009, 15.0, 20.000000000010001, 24.999999999989999, 30, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70, 75.000000000010004, 79.999999999989996, 85.0, 90.000000000010004, 94.999999999989996, 100]

Note that minor inaccuracies will occur due to floating point limitations. Also, I wrote this for Python 2.x, so to use it on Python 3.x just add a single line that says xrange = range.

And here is a nifty trick to interpolate between positioned subitems in a list of lists. So for instance you can easily interpolate between RGB color tuples to create a color gradient of x nr of steps. Assuming a list of RGB color tuples of 3 and a desired GRADIENTLENGTH variable you do this with:

crosssections = zip(*rgbtuples)
grad_crosssections = ( ResizeList(spectrum,GRADIENTLENGTH) for spectrum in crosssections )
rgb_gradient = [list(each) for each in zip(*grad_crosssections)]

It probably could need quite a few optimizations, I had to do quite a bit of experimentation. If you feel you can improve it feel free to edit my post. Here is the code:

def ResizeList(rows, newlength, stretchmethod="not specified", gapvalue=None):
    """
    Resizes (up or down) and returns a new list of a given size, based on an input list.
    - rows: the input list, which can contain any type of value or item (except if using the interpolate stretchmethod which requires floats or ints only)
    - newlength: the new length of the output list (if this is the same as the input list then the original list will be returned immediately)
    - stretchmethod: if the list is being stretched, this decides how to do it. Valid values are:
      - 'interpolate'
        - linearly interpolate between the known values (automatically chosen if list contains ints or floats)
      - 'duplicate'
        - duplicate each value so they occupy a proportional size of the new list (automatically chosen if the list contains non-numbers)
      - 'spread'
        - drags the original values apart and leaves gaps as defined by the gapvalue option
    - gapvalue: a value that will be used as gaps to fill in between the original values when using the 'spread' stretchmethod
    """
    #return input as is if no difference in length
    if newlength == len(rows):
        return rows
    #set auto stretchmode
    if stretchmethod == "not specified":
        if isinstance(rows[0], (int,float)):
            stretchmethod = "interpolate"
        else:
            stretchmethod = "duplicate"
    #reduce newlength 
    newlength -= 1
    #assign first value
    outlist = [rows[0]]
    writinggapsflag = False
    if rows[1] == gapvalue:
        writinggapsflag = True
    relspreadindexgen = (index/float(len(rows)-1) for index in xrange(1,len(rows))) #warning a little hacky by skipping first index cus is assigned auto
    relspreadindex = next(relspreadindexgen)
    spreadflag = False
    gapcount = 0
    for outlistindex in xrange(1, newlength):
        #relative positions
        rel = outlistindex/float(newlength)
        relindex = (len(rows)-1) * rel
        basenr,decimals = str(relindex).split(".")
        relbwindex = float("0."+decimals)
        #determine equivalent value
        if stretchmethod=="interpolate":
            #test for gap
            maybecurrelval = rows[int(relindex)]
            maybenextrelval = rows[int(relindex)+1]
            if maybecurrelval == gapvalue:
                #found gapvalue, so skipping and waiting for valid value to interpolate and add to outlist
                gapcount += 1
                continue
            #test whether to interpolate for previous gaps
            if gapcount > 0:
                #found a valid value after skipping gapvalues so this is where it interpolates all of them from last valid value to this one
                startvalue = outlist[-1]
                endindex = int(relindex)
                endvalue = rows[endindex]
                gapstointerpolate = gapcount 
                allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
                outlist.extend(allinterpolatedgaps[1:-1])
                gapcount = 0
                writinggapsflag = False
            #interpolate value
            currelval = rows[int(relindex)]
            lookahead = 1
            nextrelval = rows[int(relindex)+lookahead]
            if nextrelval == gapvalue:
                if writinggapsflag:
                    continue
                relbwval = currelval
                writinggapsflag = True
            else:
                relbwval = currelval + (nextrelval - currelval) * relbwindex #basenr pluss interindex percent interpolation of diff to next item
        elif stretchmethod=="duplicate":
            relbwval = rows[int(round(relindex))] #no interpolation possible, so just copy each time
        elif stretchmethod=="spread":
            if rel >= relspreadindex:
                spreadindex = int(len(rows)*relspreadindex)
                relbwval = rows[spreadindex] #spread values further apart so as to leave gaps in between
                relspreadindex = next(relspreadindexgen)
            else:
                relbwval = gapvalue
        #assign each value
        outlist.append(relbwval)
    #assign last value
    if gapcount > 0:
        #this last value also has to interpolate for previous gaps       
        startvalue = outlist[-1]
        endvalue = rows[-1]
        gapstointerpolate = gapcount 
        allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
        outlist.extend(allinterpolatedgaps[1:-1])
        outlist.append(rows[-1])
        gapcount = 0
        writinggapsflag = False
    else:
        outlist.append(rows[-1])
    return outlist
瑶笙 2024-09-12 01:23:29

保留一个计数器,并按第二个值递增。每次将其下限,并产生该指数的值。

Keep a counter, which you increment by the second value. Floor it each time, and yield the value at that index.

眼泪都笑了 2024-09-12 01:23:29

random.choices() 不能解决您的问题吗?
更多示例请参见此处

Can't random.choices() solve your problem?
More examples are available here

何止钟意 2024-09-12 01:23:29

参考 Ignacio Vazquez-Abrams 的回答:

从 7 个可用数字中打印 3 个数字:

msg_cache = [1, 2, 3, 4, 5, 6]
msg_n = 3
inc = len(msg_cache) / msg_n
inc_total = 0
for _ in range(0, msg_n):
    msg_downsampled = msg_cache[math.floor(inc_total)]
    print(msg_downsampled)
    inc_total += inc

输出:

0
2
4

对于将许多日志消息下采样到较小的子集很有用。

With reference to answer from Ignacio Vazquez-Abrams:

Print 3 numbers from the 7 available:

msg_cache = [1, 2, 3, 4, 5, 6]
msg_n = 3
inc = len(msg_cache) / msg_n
inc_total = 0
for _ in range(0, msg_n):
    msg_downsampled = msg_cache[math.floor(inc_total)]
    print(msg_downsampled)
    inc_total += inc

Output:

0
2
4

Useful for down-sampling many log messages to a smaller subset.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文