加载多个 CSV 文件(孤岛)以组成 Tensorflow 联合数据集
我正在处理预处理数据,这些数据已经被隔离到单独的 csv 文件中,以表示联合学习的单独本地数据。
为了在 TensorFlow Federated 上使用这些多个 CSV 正确实现联合学习,我只是尝试使用 iris 数据集中的玩具示例重现相同的方法。但是,当尝试使用 tff.simulation.datasets.TestClientData 方法时,我收到错误:
TypeError: can't pickle _thread.RLock objects
当前代码如下,首先加载三个虹膜数据集 CSV 文件(每个 50 个样本) )从文件名 iris1.csv、iris2.csv 和 iris3.csv 转换为字典:
silos = {}
for silo in silos_files:
silo_name = silo.replace(".csv", "")
silos[silo_name] = pd.read_csv(silos_path + silo)
silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)
使用张量创建新字典:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))
最后,尝试将 Tensorflow 数据集转换为 Tensorflow联合数据集:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
这会引发错误:
TypeError Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2 silos_tf
3 )
/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
59 """
60 py_typecheck.check_type(tensor_slices_dict, dict)
---> 61 tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
62 structures = list(tensor_slices_dict.values())
63 example_structure = structures[0]
...
/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
167 reductor = getattr(x, "__reduce_ex__", None)
168 if reductor:
--> 169 rv = reductor(4)
170 else:
171 reductor = getattr(x, "__reduce__", None)
TypeError: can't pickle _thread.RLock objects
我也尝试使用 Python 字典而不是 OrderedDict,但错误是相同的。在本实验中,我使用 Google Colab 和 此笔记本作为使用 TensorFlow 2.8.0 和 TensorFlow Federated 版本 0.20.0 运行的参考。我还使用了之前的这些问题作为参考:
'tensorflow_federated.python.simulation' 使用 tff-nightly 时没有属性 'FromTensorSlicesClientData'
我不确定这是否是一个适用于玩具示例之外的案例的好方法,如果有的话关于如何为 TFF 测试带来已经孤立的数据的建议,我很感激。
I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.
To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData
, I am getting the error:
TypeError: can't pickle _thread.RLock objects
The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:
silos = {}
for silo in silos_files:
silo_name = silo.replace(".csv", "")
silos[silo_name] = pd.read_csv(silos_path + silo)
silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)
Creating a new dict with tensors:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))
Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That raises the error:
TypeError Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2 silos_tf
3 )
/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
59 """
60 py_typecheck.check_type(tensor_slices_dict, dict)
---> 61 tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
62 structures = list(tensor_slices_dict.values())
63 example_structure = structures[0]
...
/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
167 reductor = getattr(x, "__reduce_ex__", None)
168 if reductor:
--> 169 rv = reductor(4)
170 else:
171 reductor = getattr(x, "__reduce__", None)
TypeError: can't pickle _thread.RLock objects
I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:
Is there a reasonable way to create tff clients datat sets?
I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我使用类
tff.simulation.datasets.TestClientData
在 github 中搜索了公共代码,然后我发现了以下实现(来源):我从中了解到片段表明
tff.simulation.datasets.TestClientData
类需要由 numpy 数组组成的 OrderedDict 作为参数,而不是张量的字典(如我之前的实现),现在我更改了代码以下:后跟:
正确运行为以下输出:
I did some search of public code in github using class
tff.simulation.datasets.TestClientData
, then I found the following implementation (source here):I understood from this snippet that the
tff.simulation.datasets.TestClientData
class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:Followed by:
That correctly runs as the following output: