如何从C++导出数据用pybind11包裹的对象将副本量最小的熊猫数据框架包装到熊猫数据框架中?

发布于 2025-01-20 06:02:54 字数 4182 浏览 0 评论 0原文

我有一个 C++ 类 Plants ,其中包含嵌套/链接/...对象形式的数据。 这个类可以在 python 中通过 pybind11 访问。

我想以“表格”形式访问所包含数据的一些信息,并使用 pandas (在 DataFrame 中)使用它。 由于会有很多花,因此导出应该尽可能高效,因此我想避免不必要的数据复制/转换。

下面的代码说明了我的第一次尝试。 请注意,在此示例中,高度是 32 位整数。 如果我将其存储在 STL 向量中,似乎会转换为 python int (64 位),因此我的 DataFrame 具有该列的 dtype int64 。 我可以通过使用带有 int32 的特征向量来“解决这个问题”,并且生成的列确实具有 int32 dtype。

现在我想知道,在这些情况下进行了多少转换/复制/...以及是否有更好的方法。 我正在查看 py::capsule 和 py::array_t 功能,也许我必须为每列设置一个胶囊并使用原始指针/数组来存储数据放在那里然后把它传出去。

这是 C++ 代码;我用它构建它

c++ -O3 -Wall -shared -std=c++17 -fPIC $(python3 -m pybind11 --includes) -I /usr/include/eigen3 todataframe.cpp -o plants.so
#include <Eigen/Core>
#include <cstddef>
#include <memory>
#include <pybind11/eigen.h> // support returning eigen vectors
#include <pybind11/pybind11.h>
#include <pybind11/stl.h> // support returning stl vectors, strings, ...
#include <string>
#include <vector>

namespace py = pybind11;

// the C++ data structures
struct Tree {
  std::string name;
  std::int64_t age;
  std::int32_t height_cm;
};
struct Rose {
  std::int32_t height_cm;
  std::string name;
  std::vector<Tree *> trees_close_by{};
};
struct Plants {
  std::vector<Rose> roses{};
  std::vector<Tree> trees{};
};

// python binding code
PYBIND11_MODULE(plants, m) {
  m.doc() = "Show all available plants";

  py::class_<Plants>(m, "Plants")
      .def(py::init([]() {
             auto plants_ptr = std::make_unique<Plants>();
             plants_ptr->roses.push_back(Rose{13, "Petunia"});
             plants_ptr->roses.push_back(Rose{17, "Alberta"});
             plants_ptr->trees.push_back(Tree{"Joseph", 12, 100});
             plants_ptr->trees.push_back(Tree{"Georg", 2, 12});
             return plants_ptr;
           }),
           py::return_value_policy::take_ownership)
      .def("__repr__",
           [](const Plants &self) {
             return "Plants(num roses: " + std::to_string(self.roses.size()) +
                    ", num trees: " + std::to_string(self.trees.size());
           })
      // This function returns a dict which is suitable for DataFrame to
      // "ingest".
      .def(
          "as_dict",
          [](const Plants &self, bool use_eigen_vectors) {
            const auto number_of_plants = self.roses.size() + self.trees.size();
            std::vector<std::string> names;
            names.reserve(number_of_plants);
            std::vector<std::int32_t> heights;
            heights.reserve(number_of_plants);

            for (const auto &rose : self.roses) {
              names.push_back(rose.name);
              heights.push_back(rose.height_cm);
            }
            for (const auto &tree : self.trees) {
              names.push_back(tree.name);
              heights.push_back(tree.height_cm);
            }
            pybind11::dict data{};
            data["names"] = names;

            // There are two different vector kinds to use -- Eigen and stl.
            if (use_eigen_vectors) {
              // In the Eigen case, make a copy of the data, since the stl
              // vector will delete the data when going out of scope.
              Eigen::Matrix<std::int32_t, Eigen::Dynamic, 1> heights_eigen =
                  Eigen::Matrix<std::int32_t, Eigen::Dynamic, 1>::Map(
                      heights.data(), heights.size());
              data["heights"] = heights_eigen;

            } else {
              data["heights"] = heights;
            }
            return data;
          },
          py::arg("use_eigen_vector") = false);
}

,这里是一些 python 代码,使用这个模块并创建一个 DataFrame:

import plants
import pandas as pd

p = plants.Plants()

print("column as stl vector")
df = pd.DataFrame(p.as_dict(use_eigen_vector=False))
print(df.dtypes)

print("column as Eigen vector")
df = pd.DataFrame(p.as_dict(use_eigen_vector=True))
print(df.dtypes)

I have a C++ class Plants that contains data in form of nested/linked/... objects.
This class is accessible in python via pybind11.

I want to access some information of the contained data in "tabular" form and work with it using pandas (in a DataFrame).
Since there will be a lot of flowers, this export should be as efficient as possible, hence I want to avoid unnecessary copies/conversions of the data.

The code below illustrates my first attempts.
Note, that in this example the height is a 32 bit integer.
If I store that in an STL vector, there seems to be a conversion to the python int (which is 64 bit), hence my DataFrame has dtype int64 for that column.
I can "work around that" by using an Eigen vector with int32, and the resulting column really has int32 dtype.

Now I am wondering, how much conversion/copying/... is going on in these cases and whether there is an even better approach.
I was looking at the py::capsule and py::array_t feature and maybe I have to setup one capsule per column and work with raw pointers/arrays to store the data in there and pass that out.

Here is the C++ code; I build it with

c++ -O3 -Wall -shared -std=c++17 -fPIC $(python3 -m pybind11 --includes) -I /usr/include/eigen3 todataframe.cpp -o plants.so
#include <Eigen/Core>
#include <cstddef>
#include <memory>
#include <pybind11/eigen.h> // support returning eigen vectors
#include <pybind11/pybind11.h>
#include <pybind11/stl.h> // support returning stl vectors, strings, ...
#include <string>
#include <vector>

namespace py = pybind11;

// the C++ data structures
struct Tree {
  std::string name;
  std::int64_t age;
  std::int32_t height_cm;
};
struct Rose {
  std::int32_t height_cm;
  std::string name;
  std::vector<Tree *> trees_close_by{};
};
struct Plants {
  std::vector<Rose> roses{};
  std::vector<Tree> trees{};
};

// python binding code
PYBIND11_MODULE(plants, m) {
  m.doc() = "Show all available plants";

  py::class_<Plants>(m, "Plants")
      .def(py::init([]() {
             auto plants_ptr = std::make_unique<Plants>();
             plants_ptr->roses.push_back(Rose{13, "Petunia"});
             plants_ptr->roses.push_back(Rose{17, "Alberta"});
             plants_ptr->trees.push_back(Tree{"Joseph", 12, 100});
             plants_ptr->trees.push_back(Tree{"Georg", 2, 12});
             return plants_ptr;
           }),
           py::return_value_policy::take_ownership)
      .def("__repr__",
           [](const Plants &self) {
             return "Plants(num roses: " + std::to_string(self.roses.size()) +
                    ", num trees: " + std::to_string(self.trees.size());
           })
      // This function returns a dict which is suitable for DataFrame to
      // "ingest".
      .def(
          "as_dict",
          [](const Plants &self, bool use_eigen_vectors) {
            const auto number_of_plants = self.roses.size() + self.trees.size();
            std::vector<std::string> names;
            names.reserve(number_of_plants);
            std::vector<std::int32_t> heights;
            heights.reserve(number_of_plants);

            for (const auto &rose : self.roses) {
              names.push_back(rose.name);
              heights.push_back(rose.height_cm);
            }
            for (const auto &tree : self.trees) {
              names.push_back(tree.name);
              heights.push_back(tree.height_cm);
            }
            pybind11::dict data{};
            data["names"] = names;

            // There are two different vector kinds to use -- Eigen and stl.
            if (use_eigen_vectors) {
              // In the Eigen case, make a copy of the data, since the stl
              // vector will delete the data when going out of scope.
              Eigen::Matrix<std::int32_t, Eigen::Dynamic, 1> heights_eigen =
                  Eigen::Matrix<std::int32_t, Eigen::Dynamic, 1>::Map(
                      heights.data(), heights.size());
              data["heights"] = heights_eigen;

            } else {
              data["heights"] = heights;
            }
            return data;
          },
          py::arg("use_eigen_vector") = false);
}

and here is some python code, using this module and creating a DataFrame:

import plants
import pandas as pd

p = plants.Plants()

print("column as stl vector")
df = pd.DataFrame(p.as_dict(use_eigen_vector=False))
print(df.dtypes)

print("column as Eigen vector")
df = pd.DataFrame(p.as_dict(use_eigen_vector=True))
print(df.dtypes)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文