字符串与整数作为Golang中内存利用的地图密钥?

发布于 2025-01-28 15:03:07 字数 4131 浏览 1 评论 0原文

我有一个以下读取函数,该函数由多个go go Ratorines调用来读s3文件,它填充了两个并发地图如下所示。

  • 在服务器启动期间,它调用读取函数以下填充两个并发地图。
  • 而且,同样每30秒钟定期调用读取函数,以再次读取新的S3文件,并再次填充两个并发的映射,并使用一些新数据。

因此,基本上,在此应用程序的整个生命周期期间,在给定的时间状态下,我的并发地图都有一些数据,并且也会定期更新。

func (r *clientRepository) read(file string, bucket string) error {
    var err error
    //... read s3 file

    for {
        rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
        if err != nil {
            return errs.Wrap(err)
        }
        if len(rows) <= 0 {
            break
        }

        byteSlice, err := json.Marshal(rows)
        if err != nil {
            return errs.Wrap(err)
        }
        var productRows []ParquetData
        err = json.Unmarshal(byteSlice, &productRows)
        if err != nil {
            return errs.Wrap(err)
        }

        for i := range productRows {
            var flatProduct definitions.CustomerInfo
            err = r.ConvertData(spn, &productRows[i], &flatProduct)
            if err != nil {
                return errs.Wrap(err)
            }

            // populate first concurrent map here
            r.products.Set(strconv.FormatInt(flatProduct.ProductId, 10), &flatProduct)
            for _, catalogId := range flatProduct.Catalogs {
                strCatalogId := strconv.FormatInt(int64(catalogId), 10)
                // upsert second concurrent map here
                r.productCatalog.Upsert(strCatalogId, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
                    productID := newValue.(int64)
                    if valueInMap == nil {
                        return map[int64]struct{}{productID: {}}
                    }
                    oldIDs := valueInMap.(map[int64]struct{})
                    // value is irrelevant, no need to check if key exists
                    oldIDs[productID] = struct{}{}
                    return oldIDs
                })
            }
        }
    }
    return nil
}

在上面的代码flatproduct.productidstrcatalogid是整数,但我将它们转换为字符串BCOZ并发映射仅与字符串一起使用。然后,我的主要应用程序线程使用以下三个功能,以从上面填充的并发地图中获取数据。

func (r *clientRepository) GetProductMap() *cmap.ConcurrentMap {
    return r.products
}

func (r *clientRepository) GetProductCatalogMap() *cmap.ConcurrentMap {
    return r.productCatalog
}

func (r *clientRepository) GetProductData(pid string) *definitions.CustomerInfo {
    pd, ok := r.products.Get(pid)
    if ok {
        return pd.(*definitions.CustomerInfo)
    }
    return nil
}

我有一个用例,我需要从多个GO例程中填充映射,然后从一堆主应用程序线程中读取这些地图的数据,以便它需要安全性安全,并且它也应该足够快,而不会锁定太多。

问题语句

我正在处理很多数据,例如我正在阅读为内存中的所有这些文件中的30-40 GB数据。我在这里使用并发地图,该映射解决了我的大多数并发问题,但是并发地图的关键是String,并且没有任何实现密钥可以是整数的实现。在我的情况下,密钥只是一个可以是INT32的产品ID,因此在此并发地图中将所有这些产品ID作为字符串作为字符串值得吗?我认为字符串分配需要更多的内存与将所有这些键作为整数相比?至少它在c/c ++中都可以,因此我假设在golang中也应该是相同的情况。

在这里我有什么可以改进的WRT映射使用情况,以便我可以减少内存利用率,而且在从主线程中读取这些地图的数据时也不会失去性能?

我正在使用并发地图来自此 repo 密钥作为整数的实现。

更新

我正在尝试在我的代码中使用cmap_int尝试一下。

type clientRepo struct {
    customers       *cmap.ConcurrentMap
    customersCatalog *cmap.ConcurrentMap
}

func NewClientRepository(logger log.Logger) (ClientRepository, error) {
  // ....
    customers := cmap.New[string]()
    customersCatalog := cmap.New[string]()

    r := &clientRepo{
        customers:       &customers,
        customersCatalog: &customersCatalog,
    }

  // ....
    return r, nil
}

但是我会遇到错误,例如:

Cannot use '&products' (type *ConcurrentMap[V]) as the type *cmap.ConcurrentMap

我需要在clientrepo struct中更改什么,以便它可以与使用仿制药的新版本的并发地图一起使用?

I have a below read function which is called by multiple go routines to read s3 files and it populates two concurrent map as shown below.

  • During server startup, it calls read function below to populate two concurrent map.
  • And also periodically every 30 seconds, it calls read function again to read new s3 files and populate two concurrent map again with some new data.

So basically at a given state of time during the whole lifecycle of this app, both my concurrent map have some data and also periodically being updated too.

func (r *clientRepository) read(file string, bucket string) error {
    var err error
    //... read s3 file

    for {
        rows, err := pr.ReadByNumber(r.cfg.RowsToRead)
        if err != nil {
            return errs.Wrap(err)
        }
        if len(rows) <= 0 {
            break
        }

        byteSlice, err := json.Marshal(rows)
        if err != nil {
            return errs.Wrap(err)
        }
        var productRows []ParquetData
        err = json.Unmarshal(byteSlice, &productRows)
        if err != nil {
            return errs.Wrap(err)
        }

        for i := range productRows {
            var flatProduct definitions.CustomerInfo
            err = r.ConvertData(spn, &productRows[i], &flatProduct)
            if err != nil {
                return errs.Wrap(err)
            }

            // populate first concurrent map here
            r.products.Set(strconv.FormatInt(flatProduct.ProductId, 10), &flatProduct)
            for _, catalogId := range flatProduct.Catalogs {
                strCatalogId := strconv.FormatInt(int64(catalogId), 10)
                // upsert second concurrent map here
                r.productCatalog.Upsert(strCatalogId, flatProduct.ProductId, func(exists bool, valueInMap interface{}, newValue interface{}) interface{} {
                    productID := newValue.(int64)
                    if valueInMap == nil {
                        return map[int64]struct{}{productID: {}}
                    }
                    oldIDs := valueInMap.(map[int64]struct{})
                    // value is irrelevant, no need to check if key exists
                    oldIDs[productID] = struct{}{}
                    return oldIDs
                })
            }
        }
    }
    return nil
}

In above code flatProduct.ProductId and strCatalogId are integer but I am converting them into string bcoz concurrent map works with string only. And then I have below three functions which is used by my main application threads to get data from the concurrent map populated above.

func (r *clientRepository) GetProductMap() *cmap.ConcurrentMap {
    return r.products
}

func (r *clientRepository) GetProductCatalogMap() *cmap.ConcurrentMap {
    return r.productCatalog
}

func (r *clientRepository) GetProductData(pid string) *definitions.CustomerInfo {
    pd, ok := r.products.Get(pid)
    if ok {
        return pd.(*definitions.CustomerInfo)
    }
    return nil
}

I have a use case where I need to populate map from multiple go routines and then read data from those maps from bunch of main application threads so it needs to be thread safe and it should be fast enough as well without much locking.

Problem Statement

I am dealing with lots of data like 30-40 GB worth of data from all these files which I am reading into memory. I am using concurrent map here which solves most of my concurrency issues but the key for the concurrent map is string and it doesn't have any implementation where key can be integer. In my case key is just a product id which can be int32 so is it worth it storing all those product id's as string in this concurrent map? I think string allocation takes more memory compare to storing all those keys as integer? At least it does in c/c++ so I am assuming it should be same case here in golang too.

Is there anything I can to improve here w.r.t map usage so that I can reduce memory utilization plus I don't lose performance as well while reading data from these maps from main threads?

I am using concurrent map from this repo which doesn't have implementation for key as integer.

Update

I am trying to use cmap_int in my code to try it out.

type clientRepo struct {
    customers       *cmap.ConcurrentMap
    customersCatalog *cmap.ConcurrentMap
}

func NewClientRepository(logger log.Logger) (ClientRepository, error) {
  // ....
    customers := cmap.New[string]()
    customersCatalog := cmap.New[string]()

    r := &clientRepo{
        customers:       &customers,
        customersCatalog: &customersCatalog,
    }

  // ....
    return r, nil
}

But I am getting error as:

Cannot use '&products' (type *ConcurrentMap[V]) as the type *cmap.ConcurrentMap

What I need to change in my clientRepo struct so that it can work with new version of concurrent map which uses generics?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

带刺的爱情 2025-02-04 15:03:07

我不知道GO中并发地图的实现详细信息,但是如果它使用字符串作为键运营)。

这将是一种记忆猪,因为并发地图仅使用键字符串,因此无能为力。

如果有某种映射确实使用了整数,那么它很可能会使用这些整数的哈希。如果关键数据本身并不均匀分布,则平滑的哈希分布是良好和均匀查找性能的必要功能。几乎就像您需要一个非常简单的地图实现!

我想知道是否可以使用一个简单的阵列,如果您的产品ID适合32bits(或可以弹出来这样做,或者以其他可接受的整数长度为单位)。是的,这样,您将拥有大量内存分配的,可能会有大量的大块未使用。但是,索引是超级优势,并且操作系统的虚拟内存子系统将确保您不会索引的数组的区域不会交换。 caveat - 我对C和固定尺寸的对象 - 更少,所以这可能是一个虚假的建议。

坚持不懈,只要对数组没有任何暗示在 - 分配中的初始化(例如,编译器都不会初始化的数组),分配并不意味着它一次都在内存中,全部都在内存中,并且数组中最常用的区域只能由OS的虚拟内存子系统提供。

编辑

您可以拥有一个数组地图,每个阵列涵盖了一系列产品ID。这将接近相同的效果,从而避免了悬浮量的存储和字符串的存储,以抵御零引用的存储。如果产品ID以某种结构化的方式团结起来,则可以很好地工作。

另外,只是一个想法,我完全表现出对这里的知识。是否可以通过参考来存储对象?在这种情况下,一系列对象实际上不是一个参考数组(因此,固定在大小上),并且仅根据需要分配的实际对象(即许多数组是null引用)?这听起来对我的一个大型阵列建议并不好...

I don't know the implementation details of concurrent map in Go, but if it's using a string as a key I'm guessing that behind the scenes it's storing both the string and a hash of the string (which will be used for actual indexing operations).

That is going to be something of a memory hog, and there'll be nothing that can be done about that as concurrent map uses only strings for key.

If there were some sort of map that did use integers, it'd likely be using hashes of those integers anyway. A smooth hash distribution is a necessary feature for good and uniform lookup performance, in the event that key data itself is not uniformly distributed. It's almost like you need a very simple map implementation!

I'm wondering if a simple array would do, if your product ID's fit within 32bits (or can be munged to do so, or down to some other acceptable integer length). Yes, that way you'd have a large amount of memory allocated, possibly with large tracts unused. However, indexing is super-rapid, and the OS's virtual memory subsystem would ensure that areas of the array that you don't index aren't swapped in. Caveat - I'm thinking very much in terms of C and fixed-size objects here - less so Go - so this may be a bogus suggestion.

To persevere, so long as there's nothing about the array that implies initialisation-on-allocation (e.g. in C the array wouldn't get initialised by the compiler), allocation doesn't automatically mean it's all in memory, all at once, and only the most commonly used areas of the array will be in RAM courtesy of the OS's virtual memory subsystem.

EDIT

You could have a map of arrays, where each array covered a range of product Ids. This would be close to the same effect, trading off storage of hashes and strings against storage of null references. If product ids are clumped in some sort of structured way, this could work well.

Also, just a thought, and I'm showing a total lack of knowledge of Go here. Does Go store objects by reference? In which case wouldn't an array of objects actually be an array of references (so, fixed in size) and the actual objects allocated only as needed (ie a lot of the array is null references)? That doesn't sound good for my one big array suggestion...

天赋异禀 2025-02-04 15:03:07

library 所有字符串进入int32(修改哈希功能),它仍然可以正常工作。

我运行了a

$ go test -bench=. -benchtime=10x -benchmem
goos: linux
goarch: amd64
pkg: maps
BenchmarkCMapAlloc-4             10  174272711 ns/op  49009948 B/op    33873 allocs/op
BenchmarkCMapAllocSS-4           10  369259624 ns/op 102535456 B/op  1082125 allocs/op
BenchmarkCMapUpdateAlloc-4       10  114794162 ns/op         0 B/op        0 allocs/op
BenchmarkCMapUpdateAllocSS-4     10  192165246 ns/op  16777216 B/op  1048576 allocs/op
BenchmarkCMap-4                  10 1193068438 ns/op      5065 B/op       41 allocs/op
BenchmarkCMapSS-4                10 2195078437 ns/op 536874022 B/op 33554471 allocs/op

代码> SS 后缀是原始字符串版本。因此,将整数用作键的时间更少,并且正如任何人所期望的那样,运行速度更快。字符串版本分配了大约50个字节每个插入。 (虽然这不是实际的内存使用量。)

基本上,GO中的字符串只是一个结构:

type stringStruct struct {
    str unsafe.Pointer
    len int
}

因此,在64位计算机上,它至少需要8字节(指针) + 8个字节(长度)(长度) + len (基础字节)字节存储字符串。将其转换为INT32或INT64肯定会节省内存。但是,我认为CustomerInfo和目录集的记忆力最大,而且我认为不会有很大的改进。

(顺便说一句,在库中调整shard_count也可能有所帮助。)

The library you use is relatively simple and you may just replace all string into int32 (and modify the hashing function) and it will still work fine.

I ran a tiny (and not that rigorous) benchmark against the replaced version:

$ go test -bench=. -benchtime=10x -benchmem
goos: linux
goarch: amd64
pkg: maps
BenchmarkCMapAlloc-4             10  174272711 ns/op  49009948 B/op    33873 allocs/op
BenchmarkCMapAllocSS-4           10  369259624 ns/op 102535456 B/op  1082125 allocs/op
BenchmarkCMapUpdateAlloc-4       10  114794162 ns/op         0 B/op        0 allocs/op
BenchmarkCMapUpdateAllocSS-4     10  192165246 ns/op  16777216 B/op  1048576 allocs/op
BenchmarkCMap-4                  10 1193068438 ns/op      5065 B/op       41 allocs/op
BenchmarkCMapSS-4                10 2195078437 ns/op 536874022 B/op 33554471 allocs/op

Benchmarks with a SS suffix is the original string version. So using integers as keys takes less memory and runs faster, as anyone would expect. The string version allocates about 50 bytes more each insertion. (This is not the actual memory usage though.)

Basically, a string in go is just a struct:

type stringStruct struct {
    str unsafe.Pointer
    len int
}

So on a 64-bit machine, it takes at least 8 bytes (pointer) + 8 bytes (length) + len(underlying bytes) bytes to store a string. Turning it into a int32 or int64 will definitely save memory. However, I assume that CustomerInfo and the catalog sets takes the most memory and I don't think there will be a great improvement.

(By the way, tuning the SHARD_COUNT in the library might also help a bit.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文