摆脱记忆错误以加入算法

发布于 2025-02-12 13:33:40 字数 7322 浏览 0 评论 0原文

我得到了一个数据集,坐在.txt文件中,由RDF三元组的1000万行组成,例如:

wsdbm:User0 wsdbm:follows   wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0  rev:hasReview   wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf  wsdbm:User119 .
....

由于这些是RDF三元组,因此,在我们的情况下,我们

Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119

的目标是以SQL形式编写查询:

SELECT follows.subject, follows.object, friendOf.object, 
       likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

到目前为止,我创建了一个称为ProtertyTables的类,该类具有通过初始文件迭代的方法,并将每个主题,谓词和对象转换为一个整数,以改善Join并保存内存的计算时间:

class PropertyTables():
    """
    This class holds all 4 Property Tables necessary for the required query.
    Each Property Table is an instance of the class 'PropertyTable'.
    """

    def __init__(self):

        self.property_tables = defaultdict()
        self.hash_map = HashDict()

    def parse_file(self, file_path, remove_prefix = False):
    
        data = open(file_path, 'r')
        for line in data:
            subj, prop, *obj = line.rstrip('\n.').split('\t')
            obj = obj[0].rstrip()
    
            if remove_prefix:
                subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
            
            if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
                self.hash_and_store(subj, prop, obj)
    
        data.close()

该类propertytable,docstring中提到的属性:

class PropertyTable():
    """
    This class represents a single Property Table, i.e. it holds every Subject and Object
    """

    def __init__(self):

        self.table = []


    def insert(self, r, s):
        
        # If r and s are already tuples, they get appended to the Property Table. 
        # Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the 
        # Property Tables when reading the data.

        if type(r) == tuple:
            self.table.append(r + s)

        else:
            self.table.append((r, s))

hashdict()是一个简单的字典,可以在加入后再次检索它们。

为了不走一篇帖子,我现在有一个单个哈希算法:

def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):

    ht = defaultdict(list)

    # Create Hash Table for table1

    for s in property_1.table:
        ht[s[index_0]].append(s)

    # Join Tables

    joined_table = PropertyTable()

    for r in property_2.table:
        for s in ht[r[index_1]]:
            joined_table.insert(s, r)

    return joined_table

我使用此功能顺序加入每个表,给定以前的要求。

WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)

结果对于小表来说是正确的,但是1000万行只是导致了不足的记忆错误,我正在寻找避免这种情况的方法。对此非常广泛的帖子,我感到抱歉,但是我想为了一些建议,需要一些细节!

编辑:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    53     68.0 MiB     68.0 MiB           1       @profile
    54                                             def hash_and_store(self, subj, prop, obj):
    55                                         
    56     68.0 MiB      0.0 MiB           1           hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
    57                                         
    58     68.0 MiB      0.0 MiB           1           if prop not in self.property_tables: 
    59                                                     self.property_tables[prop] = PropertyTable()
    60     68.0 MiB      0.0 MiB           1           self.property_tables[prop].insert(hashed_subj, hashed_obj)


Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    32     68.1 MiB     68.1 MiB           1       @profile
    33                                             def parse_file(self, file_path, remove_prefix = False):
    34                                         
    35     68.1 MiB      0.0 MiB           1           data = open(file_path, 'r')
    36                                         
    37                                                 
    38                                                 
    39                                                
    40                                         
    41     80.7 MiB      0.3 MiB      109311           for line in data:
    42     80.7 MiB      0.0 MiB      109310               subj, prop, *obj = line.rstrip('\n.').split('\t')
    43     80.7 MiB      0.5 MiB      109310               obj = obj[0].rstrip()
    44                                         
    45     80.7 MiB      0.0 MiB      109310               if remove_prefix:
    46     80.7 MiB      9.0 MiB      655860                   subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
    47                                                     
    48     80.7 MiB      0.0 MiB      109310               if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
    49     80.7 MiB      2.8 MiB       80084                   self.hash_and_store(subj, prop, obj)
    50                                         
    51     80.7 MiB      0.0 MiB           1           data.close()
    
    
   Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     80.7 MiB     80.7 MiB           1       @profile
    39                                             def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
    40                                         
    41     80.7 MiB      0.0 MiB           1           ht = defaultdict(list)
    42                                         
    43                                                 # Create Hash Table for table1
    44                                         
    45     81.2 MiB      0.0 MiB       31888           for s in property_1.table:
    46     81.2 MiB      0.5 MiB       31887               ht[s[index_0]].append(s)
    47                                         
    48                                                 # Join Tables
    49                                         
    50     81.2 MiB      0.0 MiB           1           joined_table = PropertyTable()
    51                                         
    52    203.8 MiB      0.0 MiB       45713           for r in property_2.table:
    53    203.8 MiB      0.0 MiB     1453580               for s in ht[r[index_1]]:
    54    203.8 MiB    122.6 MiB     1407868                   joined_table.insert(s, r)
    55                                             
    56    203.8 MiB      0.0 MiB           1           return joined_table

I got a dataset, sitting in a .txt file, consisting of 10 million rows in the form of RDF triples, like such:

wsdbm:User0 wsdbm:follows   wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0  rev:hasReview   wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf  wsdbm:User119 .
....

Since these are RDF triples, in our case we have

Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119

My goal is to write a query in the SQL form:

SELECT follows.subject, follows.object, friendOf.object, 
       likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

So far, I create a class called PropertyTables, which has a method that iterates over the initial file and convert each subject, predicate and object into an integer to improve computational time on the join and save memory:

class PropertyTables():
    """
    This class holds all 4 Property Tables necessary for the required query.
    Each Property Table is an instance of the class 'PropertyTable'.
    """

    def __init__(self):

        self.property_tables = defaultdict()
        self.hash_map = HashDict()

    def parse_file(self, file_path, remove_prefix = False):
    
        data = open(file_path, 'r')
        for line in data:
            subj, prop, *obj = line.rstrip('\n.').split('\t')
            obj = obj[0].rstrip()
    
            if remove_prefix:
                subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
            
            if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
                self.hash_and_store(subj, prop, obj)
    
        data.close()

the class PropertyTable, mentioned in the docstring:

class PropertyTable():
    """
    This class represents a single Property Table, i.e. it holds every Subject and Object
    """

    def __init__(self):

        self.table = []


    def insert(self, r, s):
        
        # If r and s are already tuples, they get appended to the Property Table. 
        # Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the 
        # Property Tables when reading the data.

        if type(r) == tuple:
            self.table.append(r + s)

        else:
            self.table.append((r, s))

The class HashDict() is a simple dictionary that hashes values, so we can retrieve them again after the join.

To not go to far with one post, I have now a single hash join algorithm:

def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):

    ht = defaultdict(list)

    # Create Hash Table for table1

    for s in property_1.table:
        ht[s[index_0]].append(s)

    # Join Tables

    joined_table = PropertyTable()

    for r in property_2.table:
        for s in ht[r[index_1]]:
            joined_table.insert(s, r)

    return joined_table

I use this function to sequentially join each table, given the requirements from before.

WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)

The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this. I am sorry for this very extensive post, but I guess some details are necessary in order for some advice!

Edit:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    53     68.0 MiB     68.0 MiB           1       @profile
    54                                             def hash_and_store(self, subj, prop, obj):
    55                                         
    56     68.0 MiB      0.0 MiB           1           hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
    57                                         
    58     68.0 MiB      0.0 MiB           1           if prop not in self.property_tables: 
    59                                                     self.property_tables[prop] = PropertyTable()
    60     68.0 MiB      0.0 MiB           1           self.property_tables[prop].insert(hashed_subj, hashed_obj)


Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    32     68.1 MiB     68.1 MiB           1       @profile
    33                                             def parse_file(self, file_path, remove_prefix = False):
    34                                         
    35     68.1 MiB      0.0 MiB           1           data = open(file_path, 'r')
    36                                         
    37                                                 
    38                                                 
    39                                                
    40                                         
    41     80.7 MiB      0.3 MiB      109311           for line in data:
    42     80.7 MiB      0.0 MiB      109310               subj, prop, *obj = line.rstrip('\n.').split('\t')
    43     80.7 MiB      0.5 MiB      109310               obj = obj[0].rstrip()
    44                                         
    45     80.7 MiB      0.0 MiB      109310               if remove_prefix:
    46     80.7 MiB      9.0 MiB      655860                   subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
    47                                                     
    48     80.7 MiB      0.0 MiB      109310               if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
    49     80.7 MiB      2.8 MiB       80084                   self.hash_and_store(subj, prop, obj)
    50                                         
    51     80.7 MiB      0.0 MiB           1           data.close()
    
    
   Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     80.7 MiB     80.7 MiB           1       @profile
    39                                             def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
    40                                         
    41     80.7 MiB      0.0 MiB           1           ht = defaultdict(list)
    42                                         
    43                                                 # Create Hash Table for table1
    44                                         
    45     81.2 MiB      0.0 MiB       31888           for s in property_1.table:
    46     81.2 MiB      0.5 MiB       31887               ht[s[index_0]].append(s)
    47                                         
    48                                                 # Join Tables
    49                                         
    50     81.2 MiB      0.0 MiB           1           joined_table = PropertyTable()
    51                                         
    52    203.8 MiB      0.0 MiB       45713           for r in property_2.table:
    53    203.8 MiB      0.0 MiB     1453580               for s in ht[r[index_1]]:
    54    203.8 MiB    122.6 MiB     1407868                   joined_table.insert(s, r)
    55                                             
    56    203.8 MiB      0.0 MiB           1           return joined_table

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

亽野灬性zι浪 2025-02-19 13:33:40

您问题的核心是:

结果对于小表来说是正确的,但是1000万行只会导致记忆错误,我正在寻找避免这种情况的方法。

遵循您的顶级问题语句,但使用较少通用的结构,我们可以做类似的事情:

def runQuery(dataLines):
    from collections import defaultdict
    pred = dict(zip(['follows','friendOf','likes','hasReview'],range(4)))
    tables = [defaultdict(list) for _ in pred]

    def encode(s):
        if s[-1].isdigit():
            i = 0
            while s[-1 - i].isdigit():
                i += 1
            return int(s[-i:])
        if any(s.endswith(k) for k in pred):
            return sum(v for k, v in pred.items() if s.endswith(k))
        return None
    
    for line in dataLines:
        if not line:
            continue
        subj, prop, *obj = line.rstrip('\n.').split('\t')
        obj = obj[0].rstrip()
        subj, prop, obj = [encode(s) for s in (subj, prop, obj)]
        if prop is not None:
            tables[prop][subj].append(obj)

    tables = [{k:tuple(v) for k, v in table.items()} for table in tables]
    #[print(list(pred.keys())[i], tables[i], sep='\n') for i in range(len(pred))]

    # create reverse index for subject, object where subject [user] follows object [user]
    object_of_follows = defaultdict(set)
    for k, v in tables[pred['follows']].items():
        for user in v:
            object_of_follows[user].add(k)
    # create reverse index for subject, object where subject [user] is friendOf object [user]
    object_of_friendOf = defaultdict(set)
    for k, v in tables[pred['friendOf']].items():
        if k in object_of_follows:
            for user in v:
                object_of_friendOf[user].add(k)
    # create reverse index for subject, object where subject [user] likes object [product]
    object_of_likes = defaultdict(set)
    for k, v in tables[pred['likes']].items():
        if k in object_of_friendOf:
            for product in v:
                object_of_likes[product].add(k)
    # create reverse index for subject, object where subject [product] hasReview object [review]
    object_of_hasReview = defaultdict(set)
    for k, v in tables[pred['hasReview']].items():
        if k in object_of_likes:
            for review in v:
                object_of_hasReview[review].add(k)

    def addToResult(result, e):
        d = object_of_hasReview[e]
        c = {y for x in d for y in object_of_likes[x]}
        b = {y for x in c for y in object_of_friendOf[x]}
        a = {y for x in b for y in object_of_follows[x]}
        toAdd = [(ax, bx, cx, dx, e) for dx in d for cx in c for bx in b for ax in a]
        result += toAdd

    result = []
    for e in object_of_hasReview:
        addToResult(result, e)
    print(f'result row count {len(result):,}')
    return result

说明:

  • 创建4个表的列表(conters,friendf,likes,likes,hasreview),每个词典映射受对象元组的约束,
  • 创建4个反向索引(object_of_follows,object_of_friendof,object_of_likes, object_of_hasreview);例如:
    • object_of_follows是一个dict,它映射每个用户中的对象都遵循 映射到一组用户,每个用户都是中的主题,请的代码>
      friendf at object_of_follows(换句话说,是中一个或多个主题的对象, conters
    • 等。
  • object_of_hasreview中包含每个唯一结果的多个结果行contrys.subject,closts.Object,object,friendsof.object,likes.object, hasreview.Object,如查询中指定的
  • 返回所有此类爆炸行的列表。

1000万行的测试代码:

dataLines = []
numFollowers = 1000
numChildren = 10
overlapFactor = max(1, numChildren // 2)
def largerPowerOfTen(x):
    y = 1
    while x >= y:
        y *= 10
    return y

aCeil = largerPowerOfTen(numFollowers)
bCeil = largerPowerOfTen(aCeil * numChildren)
cCeil = largerPowerOfTen(bCeil * numChildren)
dCeil = largerPowerOfTen(cCeil * numChildren)
friendOf, likes = set(), set()
for a in range(numFollowers):
    for b in range(aCeil + a * overlapFactor, aCeil + a * overlapFactor + numChildren):
        dataLines.append(f'wsdbm:User{a}    wsdbm:follows   wsdbm:User{b} .\n')
        for c in range(bCeil + b * overlapFactor, bCeil + b * overlapFactor + numChildren):
            if (b,c) not in friendOf:
                dataLines.append(f'wsdbm:User{b}    wsdbm:friendOf  wsdbm:User{c} .\n')
                friendOf.add((b,c))
            for d in range(cCeil + c * overlapFactor, cCeil + c * overlapFactor + numChildren):
                if (c,d) not in likes:
                    dataLines.append(f'wsdbm:User{c}    wsdbm:likes wsdbm:Product{d} .\n')
                    likes.add((c,d))
                for e in range(dCeil * (d + 1), dCeil * (d + 1) + numChildren):
                    dataLines.append(f'wsdbm:Product{d} wsdbm:hasReview wsdbm:Review{e} .\n')

print(f'dataLines row count {len(dataLines):,}')

from timeit import timeit
n = 1
print(f'Timeit results:')
t = timeit(f"runQuery(dataLines)", setup=f"from __main__ import dataLines, runQuery", number=n) / n
print(f'======== runQuery ran in {t} seconds using {n} iterations')
'''
result = runQuery(dataLines)
print(f'result row count {len(result):,}')
print(f'{"follows.subject":>20}{"follows.object":>20}{"friendsOf.object":>20}{"likes.object":>20}{"hasReview.object":>20}')
[print(f'{a:20}{b:20}{c:20}{d:20}{e:20}') for a,b,c,d,e in result]
'''

输出:

dataLines row count 10,310,350
Timeit results:
result row count 12,398,500
======== runQuery ran in 81.53253880003467 seconds using 1 iterations

较小规模的样本运行中的输入/输出:

参数

numFollowers = 3
numChildren = 3
overlapFactor = 2

输入(存储在表中):

follows
{0: (10, 11, 12), 1: (12, 13, 14), 2: (14, 15, 16)}
friendOf
{10: (120, 121, 122), 11: (122, 123, 124), 12: (124, 125, 126), 13: (126, 127, 128), 14: (128, 129, 130), 15: (130, 131, 132), 16: (132, 133, 134)}
likes
{120: (1240, 1241, 1242), 121: (1242, 1243, 1244), 122: (1244, 1245, 1246), 123: (1246, 1247, 1248), 124: (1248, 1249, 1250), 125: (1250, 1251, 1252), 126: (1252, 1253, 1254), 127: (1254, 1255, 1256), 128: (1256, 1257, 1258), 129: (1258, 1259, 1260), 130: (1260, 1261, 1262), 131: (1262, 1263, 1264), 132: (1264, 1265, 1266), 133: (1266, 1267, 1268), 134: (1268, 1269, 1270)}
hasReview
{1240: (12410000, 12410001, 12410002), 1241: (12420000, 12420001, 12420002), 1242: (12430000, 12430001, 12430002, 12430000, 12430001, 12430002), 1243: (12440000, 12440001, 12440002), 1244: (12450000, 12450001, 12450002, 12450000, 12450001, 12450002, 12450000, 12450001, 12450002), 1245: (12460000, 12460001, 12460002, 12460000, 12460001, 12460002), 1246: (12470000, 12470001, 12470002, 12470000, 12470001, 12470002, 12470000, 12470001, 12470002), 1247: (12480000, 12480001, 12480002), 1248: (12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002), 1249: (12500000, 12500001, 12500002, 12500000, 12500001, 12500002, 12500000, 12500001, 12500002), 1250: (12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002), 1251: (12520000, 12520001, 12520002, 12520000, 12520001, 12520002), 1252: (12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002), 1253: (12540000, 12540001, 12540002, 12540000, 12540001, 12540002, 12540000, 12540001, 12540002), 1254: (12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002), 1255: (12560000, 12560001, 12560002), 1256: (12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002), 1257: (12580000, 12580001, 12580002, 12580000, 12580001, 12580002, 12580000, 12580001, 12580002), 1258: (12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002), 1259: (12600000, 12600001, 12600002, 12600000, 12600001, 12600002), 1260: (12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002), 1261: (12620000, 12620001, 12620002, 12620000, 12620001, 12620002, 12620000, 12620001, 12620002), 1262: (12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002), 1263: (12640000, 12640001, 12640002), 1264: (12650000, 12650001, 12650002, 12650000, 12650001, 12650002, 12650000, 12650001, 12650002), 1265: (12660000, 12660001, 12660002, 12660000, 12660001, 12660002), 1266: (12670000, 12670001, 12670002, 12670000, 12670001, 12670002, 12670000, 12670001, 12670002), 1267: (12680000, 12680001, 12680002), 1268: (12690000, 12690001, 12690002, 12690000, 12690001, 12690002), 1269: (12700000, 12700001, 12700002), 1270: (12710000, 12710001, 12710002)}

输出

result row count 351
     follows.subject      follows.object    friendsOf.object        likes.object    hasReview.object
                   0                  10                 120                1240            12410000
                   0                  10                 120                1240            12410001
                   0                  10                 120                1240            12410002
                   0                  10                 120                1241            12420000
                   0                  10                 120                1241            12420001
                   0                  10                 120                1241            12420002
                   0                  10                 120                1242            12430000
                   0                  10                 121                1242            12430000
                   0                  10                 120                1242            12430001
                   0                  10                 121                1242            12430001
                   0                  10                 120                1242            12430002
                   0                  10                 121                1242            12430002
                   0                  10                 121                1243            12440000
                   0                  10                 121                1243            12440001
                   0                  10                 121                1243            12440002
                   0                  10                 121                1244            12450000
                   0                  11                 121                1244            12450000
                   0                  10                 122                1244            12450000
                   0                  11                 122                1244            12450000
                   0                  10                 121                1244            12450001
                   0                  11                 121                1244            12450001
                   0                  10                 122                1244            12450001
                   0                  11                 122                1244            12450001
                   0                  10                 121                1244            12450002
                   0                  11                 121                1244            12450002

etc.

The core of your question is this:

The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this.

Following your top-level problem statement but with a less generic structure, we can do something like this:

def runQuery(dataLines):
    from collections import defaultdict
    pred = dict(zip(['follows','friendOf','likes','hasReview'],range(4)))
    tables = [defaultdict(list) for _ in pred]

    def encode(s):
        if s[-1].isdigit():
            i = 0
            while s[-1 - i].isdigit():
                i += 1
            return int(s[-i:])
        if any(s.endswith(k) for k in pred):
            return sum(v for k, v in pred.items() if s.endswith(k))
        return None
    
    for line in dataLines:
        if not line:
            continue
        subj, prop, *obj = line.rstrip('\n.').split('\t')
        obj = obj[0].rstrip()
        subj, prop, obj = [encode(s) for s in (subj, prop, obj)]
        if prop is not None:
            tables[prop][subj].append(obj)

    tables = [{k:tuple(v) for k, v in table.items()} for table in tables]
    #[print(list(pred.keys())[i], tables[i], sep='\n') for i in range(len(pred))]

    # create reverse index for subject, object where subject [user] follows object [user]
    object_of_follows = defaultdict(set)
    for k, v in tables[pred['follows']].items():
        for user in v:
            object_of_follows[user].add(k)
    # create reverse index for subject, object where subject [user] is friendOf object [user]
    object_of_friendOf = defaultdict(set)
    for k, v in tables[pred['friendOf']].items():
        if k in object_of_follows:
            for user in v:
                object_of_friendOf[user].add(k)
    # create reverse index for subject, object where subject [user] likes object [product]
    object_of_likes = defaultdict(set)
    for k, v in tables[pred['likes']].items():
        if k in object_of_friendOf:
            for product in v:
                object_of_likes[product].add(k)
    # create reverse index for subject, object where subject [product] hasReview object [review]
    object_of_hasReview = defaultdict(set)
    for k, v in tables[pred['hasReview']].items():
        if k in object_of_likes:
            for review in v:
                object_of_hasReview[review].add(k)

    def addToResult(result, e):
        d = object_of_hasReview[e]
        c = {y for x in d for y in object_of_likes[x]}
        b = {y for x in c for y in object_of_friendOf[x]}
        a = {y for x in b for y in object_of_follows[x]}
        toAdd = [(ax, bx, cx, dx, e) for dx in d for cx in c for bx in b for ax in a]
        result += toAdd

    result = []
    for e in object_of_hasReview:
        addToResult(result, e)
    print(f'result row count {len(result):,}')
    return result

Explanation:

  • Create a list of 4 tables (follows, friendOf, likes, hasReview), each a dictionary mapping subject to a tuple of objects
  • Create 4 reverse indexes (object_of_follows, object_of_friendOf, object_of_likes, object_of_hasReview); for example:
    • object_of_follows is a dict that maps each user that is an object in follows to a set of users, each of which is a subject in follows that follows the object
    • object_of_friendOf is a dict that maps each object (user) in friendOf to a set of users, each of which is a subject (user) associated with the object in friendOf and is in object_of_follows (in other words, is an object for one or more subjects in follows)
    • etc.
  • Explode each review that survived in object_of_hasReview into multiple result rows containing each unique result follows.subject, follows.object, friendsOf.object, likes.object, hasReview.object as specified in the query
  • Return the list of all such exploded rows.

Test code for 10 million lines:

dataLines = []
numFollowers = 1000
numChildren = 10
overlapFactor = max(1, numChildren // 2)
def largerPowerOfTen(x):
    y = 1
    while x >= y:
        y *= 10
    return y

aCeil = largerPowerOfTen(numFollowers)
bCeil = largerPowerOfTen(aCeil * numChildren)
cCeil = largerPowerOfTen(bCeil * numChildren)
dCeil = largerPowerOfTen(cCeil * numChildren)
friendOf, likes = set(), set()
for a in range(numFollowers):
    for b in range(aCeil + a * overlapFactor, aCeil + a * overlapFactor + numChildren):
        dataLines.append(f'wsdbm:User{a}    wsdbm:follows   wsdbm:User{b} .\n')
        for c in range(bCeil + b * overlapFactor, bCeil + b * overlapFactor + numChildren):
            if (b,c) not in friendOf:
                dataLines.append(f'wsdbm:User{b}    wsdbm:friendOf  wsdbm:User{c} .\n')
                friendOf.add((b,c))
            for d in range(cCeil + c * overlapFactor, cCeil + c * overlapFactor + numChildren):
                if (c,d) not in likes:
                    dataLines.append(f'wsdbm:User{c}    wsdbm:likes wsdbm:Product{d} .\n')
                    likes.add((c,d))
                for e in range(dCeil * (d + 1), dCeil * (d + 1) + numChildren):
                    dataLines.append(f'wsdbm:Product{d} wsdbm:hasReview wsdbm:Review{e} .\n')

print(f'dataLines row count {len(dataLines):,}')

from timeit import timeit
n = 1
print(f'Timeit results:')
t = timeit(f"runQuery(dataLines)", setup=f"from __main__ import dataLines, runQuery", number=n) / n
print(f'======== runQuery ran in {t} seconds using {n} iterations')
'''
result = runQuery(dataLines)
print(f'result row count {len(result):,}')
print(f'{"follows.subject":>20}{"follows.object":>20}{"friendsOf.object":>20}{"likes.object":>20}{"hasReview.object":>20}')
[print(f'{a:20}{b:20}{c:20}{d:20}{e:20}') for a,b,c,d,e in result]
'''

Output:

dataLines row count 10,310,350
Timeit results:
result row count 12,398,500
======== runQuery ran in 81.53253880003467 seconds using 1 iterations

Here's input/output from a smaller-scale sample run:

Params

numFollowers = 3
numChildren = 3
overlapFactor = 2

Input (after storing in tables):

follows
{0: (10, 11, 12), 1: (12, 13, 14), 2: (14, 15, 16)}
friendOf
{10: (120, 121, 122), 11: (122, 123, 124), 12: (124, 125, 126), 13: (126, 127, 128), 14: (128, 129, 130), 15: (130, 131, 132), 16: (132, 133, 134)}
likes
{120: (1240, 1241, 1242), 121: (1242, 1243, 1244), 122: (1244, 1245, 1246), 123: (1246, 1247, 1248), 124: (1248, 1249, 1250), 125: (1250, 1251, 1252), 126: (1252, 1253, 1254), 127: (1254, 1255, 1256), 128: (1256, 1257, 1258), 129: (1258, 1259, 1260), 130: (1260, 1261, 1262), 131: (1262, 1263, 1264), 132: (1264, 1265, 1266), 133: (1266, 1267, 1268), 134: (1268, 1269, 1270)}
hasReview
{1240: (12410000, 12410001, 12410002), 1241: (12420000, 12420001, 12420002), 1242: (12430000, 12430001, 12430002, 12430000, 12430001, 12430002), 1243: (12440000, 12440001, 12440002), 1244: (12450000, 12450001, 12450002, 12450000, 12450001, 12450002, 12450000, 12450001, 12450002), 1245: (12460000, 12460001, 12460002, 12460000, 12460001, 12460002), 1246: (12470000, 12470001, 12470002, 12470000, 12470001, 12470002, 12470000, 12470001, 12470002), 1247: (12480000, 12480001, 12480002), 1248: (12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002), 1249: (12500000, 12500001, 12500002, 12500000, 12500001, 12500002, 12500000, 12500001, 12500002), 1250: (12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002), 1251: (12520000, 12520001, 12520002, 12520000, 12520001, 12520002), 1252: (12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002), 1253: (12540000, 12540001, 12540002, 12540000, 12540001, 12540002, 12540000, 12540001, 12540002), 1254: (12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002), 1255: (12560000, 12560001, 12560002), 1256: (12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002), 1257: (12580000, 12580001, 12580002, 12580000, 12580001, 12580002, 12580000, 12580001, 12580002), 1258: (12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002), 1259: (12600000, 12600001, 12600002, 12600000, 12600001, 12600002), 1260: (12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002), 1261: (12620000, 12620001, 12620002, 12620000, 12620001, 12620002, 12620000, 12620001, 12620002), 1262: (12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002), 1263: (12640000, 12640001, 12640002), 1264: (12650000, 12650001, 12650002, 12650000, 12650001, 12650002, 12650000, 12650001, 12650002), 1265: (12660000, 12660001, 12660002, 12660000, 12660001, 12660002), 1266: (12670000, 12670001, 12670002, 12670000, 12670001, 12670002, 12670000, 12670001, 12670002), 1267: (12680000, 12680001, 12680002), 1268: (12690000, 12690001, 12690002, 12690000, 12690001, 12690002), 1269: (12700000, 12700001, 12700002), 1270: (12710000, 12710001, 12710002)}

Output

result row count 351
     follows.subject      follows.object    friendsOf.object        likes.object    hasReview.object
                   0                  10                 120                1240            12410000
                   0                  10                 120                1240            12410001
                   0                  10                 120                1240            12410002
                   0                  10                 120                1241            12420000
                   0                  10                 120                1241            12420001
                   0                  10                 120                1241            12420002
                   0                  10                 120                1242            12430000
                   0                  10                 121                1242            12430000
                   0                  10                 120                1242            12430001
                   0                  10                 121                1242            12430001
                   0                  10                 120                1242            12430002
                   0                  10                 121                1242            12430002
                   0                  10                 121                1243            12440000
                   0                  10                 121                1243            12440001
                   0                  10                 121                1243            12440002
                   0                  10                 121                1244            12450000
                   0                  11                 121                1244            12450000
                   0                  10                 122                1244            12450000
                   0                  11                 122                1244            12450000
                   0                  10                 121                1244            12450001
                   0                  11                 121                1244            12450001
                   0                  10                 122                1244            12450001
                   0                  11                 122                1244            12450001
                   0                  10                 121                1244            12450002
                   0                  11                 121                1244            12450002

etc.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文