Cassandra 缓冲读取数百万列

发布于 2024-11-03 02:28:43 字数 749 浏览 4 评论 0原文

我有一个行数较少(< 100)的 cassandra 集群。每行大约有 200 万列。我需要获取一整行(全部 200 万列),但在我完成阅读之前,一切都开始失败。我想做某种缓冲读取。

理想情况下,我想使用 Pycassa 做这样的事情(不,这不是调用 get 的正确方法,这只是为了让你能明白这个想法):

results = {}
start = 0
while True:
    # Fetch blocks of size 500
    buffer = column_family.get(key, column_offset=start, column_count=500)
    if len(buffer) == 0:
        break

    # Merge these results into the main one
    results.update(buffer)

    # Update the offset
    start += len(buffer)

Pycassa(以及扩展的 Cassandra)don不让你这么做。相反,您需要为 column_startcolumn_finish 指定列name。这是一个问题,因为我实际上不知道开始或结束列名称是什么。特殊值 "" 可以指示行的开头或结尾,但这不适用于中间的任何值。

那么如何才能完成单行中所有列的缓冲读取呢?谢谢。

I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.

Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get, it's just so you can get the idea):

results = {}
start = 0
while True:
    # Fetch blocks of size 500
    buffer = column_family.get(key, column_offset=start, column_count=500)
    if len(buffer) == 0:
        break

    # Merge these results into the main one
    results.update(buffer)

    # Update the offset
    start += len(buffer)

Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don't actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn't work for any of the values in the middle.

So how can I accomplish a buffered read of all the columns in a single row? Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

少女七分熟 2024-11-10 02:28:43

pycassa 1.0.8 文档

看来你可以使用一些东西就像下面的[伪代码]:

results = {}
start = 0
startColumn = ""
while True:
    # Fetch blocks of size 500

   buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
   # iterate returned values. 
   # set startColumn == previous column_finish. 

请记住,在每次后续调用中,您只会返回 99 个结果,因为它还会返回您已经看到的 startColumn。我对 Python 的熟练程度还不够,无法迭代缓冲区来提取列名。

From the pycassa 1.0.8 documentation

it would appear that you could use something like the following [pseudocode]:

results = {}
start = 0
startColumn = ""
while True:
    # Fetch blocks of size 500

   buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
   # iterate returned values. 
   # set startColumn == previous column_finish. 

Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.

网名女生简单气质 2024-11-10 02:28:43

在 pycassa v1.7.1+ 中,您可以使用 xget 并获取宽度为 2**63-1 列的行。

for col in cf.xget(key, column_count=2**63-1):
    # do something with the column.

In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.

for col in cf.xget(key, column_count=2**63-1):
    # do something with the column.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文