Cassandra 缓冲读取数百万列
我有一个行数较少(< 100)的 cassandra 集群。每行大约有 200 万列。我需要获取一整行(全部 200 万列),但在我完成阅读之前,一切都开始失败。我想做某种缓冲读取。
理想情况下,我想使用 Pycassa 做这样的事情(不,这不是调用 get
的正确方法,这只是为了让你能明白这个想法):
results = {}
start = 0
while True:
# Fetch blocks of size 500
buffer = column_family.get(key, column_offset=start, column_count=500)
if len(buffer) == 0:
break
# Merge these results into the main one
results.update(buffer)
# Update the offset
start += len(buffer)
Pycassa(以及扩展的 Cassandra)don不让你这么做。相反,您需要为 column_start
和 column_finish
指定列name。这是一个问题,因为我实际上不知道开始或结束列名称是什么。特殊值 ""
可以指示行的开头或结尾,但这不适用于中间的任何值。
那么如何才能完成单行中所有列的缓冲读取呢?谢谢。
I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.
Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get
, it's just so you can get the idea):
results = {}
start = 0
while True:
# Fetch blocks of size 500
buffer = column_family.get(key, column_offset=start, column_count=500)
if len(buffer) == 0:
break
# Merge these results into the main one
results.update(buffer)
# Update the offset
start += len(buffer)
Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start
and column_finish
. This is a problem since I don't actually know what the start or end column names will be. The special value ""
can indicate the start or end of the row, but that doesn't work for any of the values in the middle.
So how can I accomplish a buffered read of all the columns in a single row? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
从 pycassa 1.0.8 文档
看来你可以使用一些东西就像下面的[伪代码]:
请记住,在每次后续调用中,您只会返回 99 个结果,因为它还会返回您已经看到的 startColumn。我对 Python 的熟练程度还不够,无法迭代缓冲区来提取列名。
From the pycassa 1.0.8 documentation
it would appear that you could use something like the following [pseudocode]:
Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.
在 pycassa v1.7.1+ 中,您可以使用 xget 并获取宽度为 2**63-1 列的行。
In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.