什么是好的模型，使用 Open IceCat 导入数据并填充目录数据库中的产品

发布于 2024-12-18 10:07:51 字数 2182 浏览 2 评论 0原文

我正在致力于构建一个基于 OpenIcecat 的产品目录（Icecat Open Catalog），并且我正在向可能有这方面经验的人寻求建议，或者可能有其他类似服务的经验（比如 C-Net 可能）。

我的问题是，填充产品目录数据库的好模型是什么？

这是我到目前为止所拥有的...

我获取了整个目录的 XML feed
我根据类别 ID 提取了我需要的产品的数据
此时，我将所有数据插入到一个表中，所以现在我有一个类似“打印机猫”的表格，其中包含图像的 URL 以及该类别中每个产品的 XML 的 id...很简单，

这就是我遇到问题/担忧的地方... 我发现很容易临时编写一个脚本来对每个 XML 文件和图像使用 GET 请求...然后可以将它们转储到目录中，但 Icecat 不希望您抓取大量数据。我的类别包含数千种（例如超过 40k 种）产品。

我觉得我需要做的是获取产品的 XML 并抓取图像并存储它们。我有这种感觉，因为这是一个显而易见的解决方案，这就是客户一直要求的……但这并不意味着它是正确的。因此，然后我可以解析单个 XML，将描述、SKU 等提取到一个表中，以便我可以构建目录，例如与 Magento 一起使用，稍后根据需要添加/更改等（价格、相关产品等）。）看起来很简单，但是在大约 3-4k GET 请求之后，我会启动，因为我正在抓取大量数据，一旦我拥有了整个目录（我想要的类别的目录），那么获取更新就很容易了文件（XML..和相比而言较小）并进行相应的更改...这将是一个很好的观点，但首先需要获取所有数据并首先构建产品表。

所以这就是我所讨论的......

一个想法是根据需要实时获取数据，但这不是客户或我自己所希望的。客户想要目录，这是可以理解的……我注意到实时会增加性能，并且无法（轻松）插入许多解决方案。但是，扩展“实时”想法...使用 XML 数据的实时 GET，然后在数据传入时使用某种逻辑存储数据，例如“如果本地不存在...获取它然后存储”它;如果它在本地存在，则检查它是否是最新信息...如果不更新它'...当然，如果我要检查它是否是最新的，那么存储确实没有意义数据，因为无论如何我每次都会发出请求......也可能只是获取它并扔掉它，这似乎效率低下。

-或者-

也许一切都是实时的：产品是实时获取和显示的，当管理员查看产品进行操作时，它会实时呈现，等等。始终根据元数据实时获取所需的内容它位于已经从“主”目录文件填充的数据库中...描述了 Icecat 可用的整个目录，但这不会插入到许多解决方案中，并且会影响性能，再加上一些主机无论如何都不会让我们得到......这么多的限制在这里，但听起来像是一个很棒的解决方案，以确保您始终拥有超当前信息（尽管这里不需要）

这就是我已经要去的地方......

我有基于主目录的元数据（超过500K 项）。我已经根据所需的类别填充了表格......现在我正朝着这个方向发展：构建一个应用程序（工具），可以更好地完善我正在使用的内容，例如单个类别。然后生成一个作业“使用类别 ID 并获取所有 XML 提要”...然后“使用 cat.ID（可能再次相同）然后获取图像”...然后，获取相同的 Cat.ID。通过获取 SKU、描述、图像文件名等来识别和构建产品并构建目录。此时，在工作流程中，我拥有所有信息，并且可以使用 SKU（或需要的内容）从其他提要中获取价格等、操作描述、根据需要重命名图像 (SEO) 等。

然后我需要建立一个模型，用于从不同的源更新价格和运输重量...在这种情况下是 Synnex，但似乎更容易，因为运输和价格应该是实时的...所以不同的故事和更少的数据一次，只有我在想的购物车里的东西。

仍然不确定如何去做这件事..据说其他人已经通过翻录 Icecat 存储库为同一个客户端构建了这样的目录，但从未为未来的操作等制作/提供工具...这就是我要去的地方。另外，旧目录非常旧/陈旧，我从未见过“证据”证明他们确实撕毁了数据并构建了目录，无论如何都不是完整的目录。

好的，为了解决这个困惑...

我使用的源有一个包含多个类别的超过 600,000 个产品的存储库。我只需要大约 45,000 种产品（多个类别）。事实上，下载每个 xml 文件需要几个小时，大约每小时 1000 个（我们从过去的经验中知道这一点）。

部分问题在于，并非每个 XML 文件都完全相同，我们需要不同类别的不同信息。这些要求很可能会改变（一开始可能会改变更多）。所以我不能只用一个模式来存储所有这些。一旦下载了 45,000 个（左右）文件，我们只需在将来获取更改/更新即可。因此，我真正想做的是构建一个仅包含我们需要的类别的本地存储库，以便我们可以更有效地使用它们。我们也不打算立即使用相关类别，因此我希望这些文件位于本地，以便我们回去时也能使用。

原文

I am working on building a catalog of products based on OpenIcecat's (Icecat Open Catalog) and I am looking for advice from someone who may have experience with this, or possibly experience with another similar service (like C-Net maybe).

My question is, what is a good model for populating a product catalog's database?

Here is what I have so far...

I GET the XML feed of the entire catalog
I extract the data about the products that I need based on the Category ID
At this point, I inserted all the data into a table, so now I have a table for like 'Printer cats', this contains the URL to the images and the id for the XML for each product in the category...Easy enough

Here is where I run into question/concern...
I find it easy to adhoc a script to use a GET request for each XML file and image...then could dump them into directories, but Icecat does NOT want you to rip very large amounts. My categories contain thousands (over 40k for instance) of products.

What I feel I need to do is GET the XML for a product and grab the image and store them. I feel this way because it's an obvious solution and thats what the client keeps asking for...doesn't mean that it's correct though. So, then I could parse the individual XML to extract the description, SKU, etc. to a table so I can build the catalog, like for use with Magento, later adding/changing etc. as needed (prices, related product, etc.) Seems easy enough, but after around 3-4k GET requests or so I get booted because I'm ripping to much data, once I have the entire catalog (my catalog of wanted categories) then it will be easy enough to grab the update files (XML..and small in comparison) and make changes accordingly...this would be a great point to be at, but first need to get all the data and build the product table(s) first.

So here is what I kick around...

One idea is to get the data in real time as needed, but this is not desired by the client or myself. The client wants the catalog, understandable...and I notice that real time adds a performance hit and does not plug in to (easily) many solutions. But, expanding on the 'real-time' idea...use real time GET of XML data, and then store the data as it comes in with some logic like 'if it's not present locally...go get it and then store it; if it is present locally then check if it is up-to-date info...if not update it'...of course if I'm gonna check if it's up-to-date then there really is no point in storing the data because I'm doing a request every time no matter what...may as well just fetch it and throw it away, which seems inefficient.

-or-

Maybe everything is in real time: The products are fetched and displayed in real time, when the admin views products for manipulation it is presented in real-time, etc. Always grabbing whats needed in real time based on the meta-data that is in the database that was (is) already populated from the 'main' catalog file...that describes the entire catalog available from Icecat, but this don't plug into many solutions and will take a performance hit, plus some hosts wont let us GET anyhow...so many limitations here, but sounds like an awsome solution to be sure you always have super current info (which is not needed here though)

Here is where I am sort of headed already...

I have the meta-data based on the main catalog (over 500K items). I have already populated tables based on desired categories...now I am kind of headed towards this: Building an app (tool) that will better refine, such as a single category, what I am working with. Then produce a job 'use category ID and get all XML feeds'...then 'use cat.ID (probably same again) and then fetch images'...then, take same Cat. ID and build products by grabbing SKU, Desc., Image filename, etc. and build a catalog. At this point in the workflow I have all info and can use SKU (or whats needed) to grab price, etc. from other feed, manipulate descriptions, rename images if need (SEO) or whatever.

Then I will need to build a model for updating prices and shipping weights from a different feed...Synnex in this case, but seems much easier because shipping and price should be real-time...so different story and much less data at once, only whats in the cart I'm thinking.

Still not sure how to go about doing this..supposedly others have built a catalog like this for the same client by ripping the Icecat repository, but never make/provide tools for future manipulation etc...which is where I am headed. Plus the old catalog is very old/stale and I have never seen 'proof' that they actually did rip the data and build a catalog, not the full set anyhow.

OK, to help with the confusion...

The source I am using has a repository of over 600,000 products in many categories. I only need about 45,000 products (over several categories). As it is, it takes hours to download the xml file for each, like around 1000 per hour (we know this from past experience).

Part of the problem is that not every XML file is exactly the same, and we need different information from different categories. These requirements are most likely going to change (probably more at first). So I cannot just have a single schema to store all of them. Once the 45,000 (or so) files are downloaded, we only have to get the changes/updates in the future. So what I am really trying to do is build a local repository of only the categories that we need, so we can work with them more efficiently. We don't plan to use the related categories right away either, so I want the files locally for when we go back to do that too.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别念他 2024-12-25 10:07:51

你解决这个问题了吗？您是否尝试过安装 http://sourceforge.net/projects/icecatpim/ 很难获得但是它允许您将产品数据库下载到本地 mysql 数据库中，一旦您拥有本地数据库，您可以编写一个 cron 脚本来根据需要定期更新数据库。然后使用您自己的程序访问/操作数据。

还有这个项目可以帮助
http://code.google.com/p/icecat-import/
它有将数据写入本地数据库的代码，最初设计是为了下载完整的数据库，但缺少大量数据。

我已经发布了数据库 innodb 版本的补丁。
http://sourceforge.net/tracker/?func=detail&aid=3578405&group_id=312187&atid=1314070
我希望这有帮助。数据库中有很多错误的引用，您必须小心，

这里有一个用于检查和设置数据库引用完整性的备忘单。

//# is the db holding refrential integrity.
SELECT * FROM Aclass a LEFT JOIN Bclass b ON a.column=b.column WHERE b.column IS NULL;

//# Set NULL on update
UPDATE Aclass a LEFT JOIN Bclass b ON a.column=b.column SET a.column=NULL WHERE b.column   IS NULL; 

//# Cascade on delete 
DELETE FROM A WHERE id NOT IN (SELECT DISTINCT B.id FROM B);

did you ever get this resolved ? have you tried to install http://sourceforge.net/projects/icecatpim/ it's very difficult to get going but it allows you to download the product database into a local mysql database, once you have the local db, you could write a cron script to periodically update the db as needed. Then access/manipulate the data using your own program.

there is also this project which could help
http://code.google.com/p/icecat-import/
it has code to write data to a local db, it was originally designed to download the full db but there is a lot of data missing.

I've posted a patch for a innodb version of the database.
http://sourceforge.net/tracker/?func=detail&aid=3578405&group_id=312187&atid=1314070
I hope that helps. there are a lot of bad references in the database that you have to watch out for,

here is a cheat sheet for checking and setting db referential integrity.

//# is the db holding refrential integrity.
SELECT * FROM Aclass a LEFT JOIN Bclass b ON a.column=b.column WHERE b.column IS NULL;

//# Set NULL on update
UPDATE Aclass a LEFT JOIN Bclass b ON a.column=b.column SET a.column=NULL WHERE b.column   IS NULL; 

//# Cascade on delete 
DELETE FROM A WHERE id NOT IN (SELECT DISTINCT B.id FROM B);

回复收藏 0 原文