加载多个 CSV 文件(孤岛)以组成 Tensorflow 联合数据集

发布于 2025-01-12 11:14:54 字数 2862 浏览 5 评论 0原文

我正在处理预处理数据,这些数据已经被隔离到单独的 csv 文件中,以表示联合学习的单独本地数据。

为了在 TensorFlow Federated 上使用这些多个 CSV 正确实现联合学习,我只是尝试使用 iris 数据集中的玩具示例重现相同的方法。但是,当尝试使用 tff.simulation.datasets.TestClientData 方法时,我收到错误:

TypeError: can't pickle _thread.RLock objects

当前代码如下,首先加载三个虹膜数据集 CSV 文件(每个 50 个样本) )从文件名 iris1.csv、iris2.csv 和 iris3.csv 转换为字典:

    silos = {}
    for silo in silos_files:
        silo_name = silo.replace(".csv", "")
        silos[silo_name] = pd.read_csv(silos_path + silo)
        silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)

使用张量创建新字典:

    silos_tf = collections.OrderedDict()
    for key, silo in silos.items():
        silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))

最后,尝试将 Tensorflow 数据集转换为 Tensorflow联合数据集:

tff_dataset = tff.simulation.datasets.TestClientData(
    silos_tf
)

这会引发错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
      1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2     silos_tf
      3 )

/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
     59     """
     60     py_typecheck.check_type(tensor_slices_dict, dict)
---> 61     tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
     62     structures = list(tensor_slices_dict.values())
     63     example_structure = structures[0]

...

/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

TypeError: can't pickle _thread.RLock objects

我也尝试使用 Python 字典而不是 OrderedDict,但错误是相同的。在本实验中,我使用 Google Colab 和 此笔记本作为使用 TensorFlow 2.8.0 和 TensorFlow Federated 版本 0.20.0 运行的参考。我还使用了之前的这些问题作为参考:

Is有没有合理的方法来创建 tff 客户端数据集?

'tensorflow_federated.python.simulation' 使用 tff-nightly 时没有属性 'FromTensorSlicesClientData'

我不确定这是否是一个适用于玩具示例之外的案例的好方法,如果有的话关于如何为 TFF 测试带来已经孤立的数据的建议,我很感激。

I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.

To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData, I am getting the error:

TypeError: can't pickle _thread.RLock objects

The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:

    silos = {}
    for silo in silos_files:
        silo_name = silo.replace(".csv", "")
        silos[silo_name] = pd.read_csv(silos_path + silo)
        silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)

Creating a new dict with tensors:

    silos_tf = collections.OrderedDict()
    for key, silo in silos.items():
        silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))

Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:

tff_dataset = tff.simulation.datasets.TestClientData(
    silos_tf
)

That raises the error:

TypeError                                 Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
      1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2     silos_tf
      3 )

/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
     59     """
     60     py_typecheck.check_type(tensor_slices_dict, dict)
---> 61     tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
     62     structures = list(tensor_slices_dict.values())
     63     example_structure = structures[0]

...

/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

TypeError: can't pickle _thread.RLock objects

I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:

Is there a reasonable way to create tff clients datat sets?

'tensorflow_federated.python.simulation' has no attribute 'FromTensorSlicesClientData' when using tff-nightly

I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

給妳壹絲溫柔 2025-01-19 11:14:55

我使用类 tff.simulation.datasets.TestClientData 在 github 中搜索了公共代码,然后我发现了以下实现(来源):

def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
    ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:

    """Transform dataset to be fed to fedjax
    :param clientsData: dataset for each client
    :param clientsDataLabels:
    :param ds_info: dataset information
    :param train: True if processing train split
    :return: dataset for each client cast into TestClientData
    """
    num_clients = ds_info['num_clients']

    client_data = collections.OrderedDict()

    for i in range(num_clients if is_train else 1):
        client_data[str(i)] = collections.OrderedDict(
            x=clientsData[i],
            y=clientsDataLabels[i])

    return tff.simulation.datasets.TestClientData(client_data)

我从中了解到片段表明tff.simulation.datasets.TestClientData需要由 numpy 数组组成的 OrderedDict 作为参数,而不是张量的字典(如我之前的实现),现在我更改了代码以下:

silos_tf = collections.OrderedDict()
for key, silo in silos.items():
    silos_tf[key] = collections.OrderedDict(
            x=silo.drop(columns=["variety"]).values,
            y=silo["variety"].values)
    

后跟:

tff_dataset = tff.simulation.datasets.TestClientData(
    silos_tf
)

正确运行为以下输出:

>>> tff_dataset.client_ids
['iris3', 'iris1', 'iris2']

I did some search of public code in github using class tff.simulation.datasets.TestClientData, then I found the following implementation (source here):

def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
    ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:

    """Transform dataset to be fed to fedjax
    :param clientsData: dataset for each client
    :param clientsDataLabels:
    :param ds_info: dataset information
    :param train: True if processing train split
    :return: dataset for each client cast into TestClientData
    """
    num_clients = ds_info['num_clients']

    client_data = collections.OrderedDict()

    for i in range(num_clients if is_train else 1):
        client_data[str(i)] = collections.OrderedDict(
            x=clientsData[i],
            y=clientsDataLabels[i])

    return tff.simulation.datasets.TestClientData(client_data)

I understood from this snippet that the tff.simulation.datasets.TestClientData class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:

silos_tf = collections.OrderedDict()
for key, silo in silos.items():
    silos_tf[key] = collections.OrderedDict(
            x=silo.drop(columns=["variety"]).values,
            y=silo["variety"].values)
    

Followed by:

tff_dataset = tff.simulation.datasets.TestClientData(
    silos_tf
)

That correctly runs as the following output:

>>> tff_dataset.client_ids
['iris3', 'iris1', 'iris2']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文