摆脱记忆错误以加入算法
我得到了一个数据集,坐在.txt文件中,由RDF三元组的1000万行组成,例如:
wsdbm:User0 wsdbm:follows wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0 rev:hasReview wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf wsdbm:User119 .
....
由于这些是RDF三元组,因此,在我们的情况下,我们
Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119
的目标是以SQL形式编写查询:
SELECT follows.subject, follows.object, friendOf.object,
likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
到目前为止,我创建了一个称为ProtertyTables的类,该类具有通过初始文件迭代的方法,并将每个主题,谓词和对象转换为一个整数,以改善Join并保存内存的计算时间:
class PropertyTables():
"""
This class holds all 4 Property Tables necessary for the required query.
Each Property Table is an instance of the class 'PropertyTable'.
"""
def __init__(self):
self.property_tables = defaultdict()
self.hash_map = HashDict()
def parse_file(self, file_path, remove_prefix = False):
data = open(file_path, 'r')
for line in data:
subj, prop, *obj = line.rstrip('\n.').split('\t')
obj = obj[0].rstrip()
if remove_prefix:
subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
self.hash_and_store(subj, prop, obj)
data.close()
该类propertytable,docstring中提到的属性:
class PropertyTable():
"""
This class represents a single Property Table, i.e. it holds every Subject and Object
"""
def __init__(self):
self.table = []
def insert(self, r, s):
# If r and s are already tuples, they get appended to the Property Table.
# Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the
# Property Tables when reading the data.
if type(r) == tuple:
self.table.append(r + s)
else:
self.table.append((r, s))
类hashdict()
是一个简单的字典,可以在加入后再次检索它们。
为了不走一篇帖子,我现在有一个单个哈希算法:
def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
ht = defaultdict(list)
# Create Hash Table for table1
for s in property_1.table:
ht[s[index_0]].append(s)
# Join Tables
joined_table = PropertyTable()
for r in property_2.table:
for s in ht[r[index_1]]:
joined_table.insert(s, r)
return joined_table
我使用此功能顺序加入每个表,给定以前的要求。
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)
结果对于小表来说是正确的,但是1000万行只是导致了不足的记忆错误,我正在寻找避免这种情况的方法。对此非常广泛的帖子,我感到抱歉,但是我想为了一些建议,需要一些细节!
编辑:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
53 68.0 MiB 68.0 MiB 1 @profile
54 def hash_and_store(self, subj, prop, obj):
55
56 68.0 MiB 0.0 MiB 1 hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
57
58 68.0 MiB 0.0 MiB 1 if prop not in self.property_tables:
59 self.property_tables[prop] = PropertyTable()
60 68.0 MiB 0.0 MiB 1 self.property_tables[prop].insert(hashed_subj, hashed_obj)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
32 68.1 MiB 68.1 MiB 1 @profile
33 def parse_file(self, file_path, remove_prefix = False):
34
35 68.1 MiB 0.0 MiB 1 data = open(file_path, 'r')
36
37
38
39
40
41 80.7 MiB 0.3 MiB 109311 for line in data:
42 80.7 MiB 0.0 MiB 109310 subj, prop, *obj = line.rstrip('\n.').split('\t')
43 80.7 MiB 0.5 MiB 109310 obj = obj[0].rstrip()
44
45 80.7 MiB 0.0 MiB 109310 if remove_prefix:
46 80.7 MiB 9.0 MiB 655860 subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
47
48 80.7 MiB 0.0 MiB 109310 if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
49 80.7 MiB 2.8 MiB 80084 self.hash_and_store(subj, prop, obj)
50
51 80.7 MiB 0.0 MiB 1 data.close()
Line # Mem usage Increment Occurrences Line Contents
=============================================================
38 80.7 MiB 80.7 MiB 1 @profile
39 def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
40
41 80.7 MiB 0.0 MiB 1 ht = defaultdict(list)
42
43 # Create Hash Table for table1
44
45 81.2 MiB 0.0 MiB 31888 for s in property_1.table:
46 81.2 MiB 0.5 MiB 31887 ht[s[index_0]].append(s)
47
48 # Join Tables
49
50 81.2 MiB 0.0 MiB 1 joined_table = PropertyTable()
51
52 203.8 MiB 0.0 MiB 45713 for r in property_2.table:
53 203.8 MiB 0.0 MiB 1453580 for s in ht[r[index_1]]:
54 203.8 MiB 122.6 MiB 1407868 joined_table.insert(s, r)
55
56 203.8 MiB 0.0 MiB 1 return joined_table
I got a dataset, sitting in a .txt file, consisting of 10 million rows in the form of RDF triples, like such:
wsdbm:User0 wsdbm:follows wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0 rev:hasReview wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf wsdbm:User119 .
....
Since these are RDF triples, in our case we have
Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119
My goal is to write a query in the SQL form:
SELECT follows.subject, follows.object, friendOf.object,
likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
So far, I create a class called PropertyTables, which has a method that iterates over the initial file and convert each subject, predicate and object into an integer to improve computational time on the join and save memory:
class PropertyTables():
"""
This class holds all 4 Property Tables necessary for the required query.
Each Property Table is an instance of the class 'PropertyTable'.
"""
def __init__(self):
self.property_tables = defaultdict()
self.hash_map = HashDict()
def parse_file(self, file_path, remove_prefix = False):
data = open(file_path, 'r')
for line in data:
subj, prop, *obj = line.rstrip('\n.').split('\t')
obj = obj[0].rstrip()
if remove_prefix:
subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
self.hash_and_store(subj, prop, obj)
data.close()
the class PropertyTable, mentioned in the docstring:
class PropertyTable():
"""
This class represents a single Property Table, i.e. it holds every Subject and Object
"""
def __init__(self):
self.table = []
def insert(self, r, s):
# If r and s are already tuples, they get appended to the Property Table.
# Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the
# Property Tables when reading the data.
if type(r) == tuple:
self.table.append(r + s)
else:
self.table.append((r, s))
The class HashDict()
is a simple dictionary that hashes values, so we can retrieve them again after the join.
To not go to far with one post, I have now a single hash join algorithm:
def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
ht = defaultdict(list)
# Create Hash Table for table1
for s in property_1.table:
ht[s[index_0]].append(s)
# Join Tables
joined_table = PropertyTable()
for r in property_2.table:
for s in ht[r[index_1]]:
joined_table.insert(s, r)
return joined_table
I use this function to sequentially join each table, given the requirements from before.
WHERE follows.object = friendOf.subject
AND friendOf.object = likes.subject
AND likes.object = hasReview.subject
join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)
The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this. I am sorry for this very extensive post, but I guess some details are necessary in order for some advice!
Edit:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
53 68.0 MiB 68.0 MiB 1 @profile
54 def hash_and_store(self, subj, prop, obj):
55
56 68.0 MiB 0.0 MiB 1 hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
57
58 68.0 MiB 0.0 MiB 1 if prop not in self.property_tables:
59 self.property_tables[prop] = PropertyTable()
60 68.0 MiB 0.0 MiB 1 self.property_tables[prop].insert(hashed_subj, hashed_obj)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
32 68.1 MiB 68.1 MiB 1 @profile
33 def parse_file(self, file_path, remove_prefix = False):
34
35 68.1 MiB 0.0 MiB 1 data = open(file_path, 'r')
36
37
38
39
40
41 80.7 MiB 0.3 MiB 109311 for line in data:
42 80.7 MiB 0.0 MiB 109310 subj, prop, *obj = line.rstrip('\n.').split('\t')
43 80.7 MiB 0.5 MiB 109310 obj = obj[0].rstrip()
44
45 80.7 MiB 0.0 MiB 109310 if remove_prefix:
46 80.7 MiB 9.0 MiB 655860 subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
47
48 80.7 MiB 0.0 MiB 109310 if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
49 80.7 MiB 2.8 MiB 80084 self.hash_and_store(subj, prop, obj)
50
51 80.7 MiB 0.0 MiB 1 data.close()
Line # Mem usage Increment Occurrences Line Contents
=============================================================
38 80.7 MiB 80.7 MiB 1 @profile
39 def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
40
41 80.7 MiB 0.0 MiB 1 ht = defaultdict(list)
42
43 # Create Hash Table for table1
44
45 81.2 MiB 0.0 MiB 31888 for s in property_1.table:
46 81.2 MiB 0.5 MiB 31887 ht[s[index_0]].append(s)
47
48 # Join Tables
49
50 81.2 MiB 0.0 MiB 1 joined_table = PropertyTable()
51
52 203.8 MiB 0.0 MiB 45713 for r in property_2.table:
53 203.8 MiB 0.0 MiB 1453580 for s in ht[r[index_1]]:
54 203.8 MiB 122.6 MiB 1407868 joined_table.insert(s, r)
55
56 203.8 MiB 0.0 MiB 1 return joined_table
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您问题的核心是:
遵循您的顶级问题语句,但使用较少通用的结构,我们可以做类似的事情:
说明:
conters,friendf,likes,likes,hasreview
),每个词典映射受对象元组的约束,object_of_follows,object_of_friendof,object_of_likes, object_of_hasreview
);例如:object_of_follows
是一个dict,它映射每个用户中的对象都遵循
映射到一组用户,每个用户都是中的主题,请的代码>
)在
friendf
atobject_of_follows
(换句话说,是中一个或多个主题的对象,
contersobject_of_hasreview
中包含每个唯一结果的多个结果行
contrys.subject,closts.Object,object,friendsof.object,likes.object, hasreview.Object
,如查询中指定的1000万行的测试代码:
输出:
较小规模的样本运行中的输入/输出:
参数
输入(存储在表中):
输出
The core of your question is this:
Following your top-level problem statement but with a less generic structure, we can do something like this:
Explanation:
follows, friendOf, likes, hasReview
), each a dictionary mapping subject to a tuple of objectsobject_of_follows, object_of_friendOf, object_of_likes, object_of_hasReview
); for example:object_of_follows
is a dict that maps each user that is an object infollows
to a set of users, each of which is a subject infollows
that follows the objectobject_of_friendOf
is a dict that maps each object (user) infriendOf
to a set of users, each of which is a subject (user) associated with the object infriendOf
and is inobject_of_follows
(in other words, is an object for one or more subjects infollows
)object_of_hasReview
into multiple result rows containing each unique resultfollows.subject, follows.object, friendsOf.object, likes.object, hasReview.object
as specified in the queryTest code for 10 million lines:
Output:
Here's input/output from a smaller-scale sample run:
Params
Input (after storing in tables):
Output