关于转换从屏幕抓取工具收集的数据的建议
大家好,
我有我的屏幕抓取工具 (scrapy) 收集多个房地产网站上的房地产列表数据。它们都有几个常见的字段,例如价格、建筑面积等。但是,与所有抓取的数据一样,这些字段的值目前相当不理想。例如,在 price
中,我有诸如 $1,000,000,000
之类的明显值,但我也有诸如 $1,000,000,000 Price on Ask
和 Price on询问
。因此,目前,我将所有抓取的字段以字符形式存储在数据库中。
我想将数据库中的这些字符串字段从字符转换为适当的类型,例如字符串到 int,以便我可以相应地对它们进行索引。有人可以给我一些建议,开始转换数据的合理程序和方法是什么?
good day folks,
I have my screen scraper (scrapy) collecting data of property listings on several property websites. They all have several common fields like price, floor area etc. However, like all scraped data, the values for the fields are rather undesirable right now. For instance, in price
, I have obvious values like $1,000,000,000
, but I also have stuff like $1,000,000,000 Price on Ask
and Price on Ask
. So currently, I stored all my scraped fields as char in my database.
I would like to transform these string fields in my database from characters to the appropriate type e.g string to int, so I can index them accordingly. Can someone offer me some advice what would be sensible procedure and method to begin transforming the data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您想扔掉“询问价格”字符串吗?或者说这是有价值的信息吗?
如果数据中有很多噪音,并且完全没有兴趣,我会运行一个过滤器来删除所有非数字。
但是,如果时间允许,我更喜欢使用模式匹配显式处理数据(示例代码是 PHP):
然后我就有了为某些字符串设置标志的结构。另外,当数据中出现新字符串时,我的脚本会变得嘈杂,我可以决定它是否重要。
(注意:将 $price 设置为 null 意味着在数据库中放入 NULL,而不是零。)
You want to throw away the "Price On Ask" string? Or is that valuable information?
If there is a lot of noise in the data, and it is all of no-interest, I'd run a filter to remove all non-digits.
But, if time allows, I prefer to process the data explicitly with pattern matching (sample code is PHP):
I then have the structure to set flags for some of the strings. Also, when a new string appears in the data my script gets noisy and I can decide if it matters or not.
(Note: setting $price to null implies putting a NULL in the database, not a zero.)