数据识别、解析、过滤和转换——GUI?
寻找一个非基于云的开源应用程序来进行数据转换;尽管对于一个专为数据转换而构建的杀手级(我的意思是杀手级)应用程序,我可能愿意花费高达 1000 美元。
我看过Perl,Kapow Katalyst, Pentaho Kettle 等等。
Perl、Python、Ruby 这些显然都是语言,但找不到任何仅用于处理数据的框架/DSL;这意味着它们确实不是一个很好的开发环境,意味着没有用于构建 RegEx、输入/输出(CSV、XML、JDBC、REST 等)的内置 GUI,也没有用于测试行和数据行的调试器 - 它们是也不错,只是不是我想要的,这是一个为复杂数据转换而构建的 GUI;也就是说,如果 GUI/应用程序文件采用脚本语言,而不只是存储在一些人类不可读的 XML/ASCII 文件中,我会很高兴。
Kapow Katalyst 是为通过 HTTP(HTML、CSS、RSS、JavaScript 等)访问数据而设计的,它有一个漂亮的 GUI 用于转换非结构化文本,但这不是它的核心价值提供,而且是这样,太贵了。它在遍历文档名称空间路径方面做得很好;猜测它只是后端的 XPath,因为语法看起来是相同的。
Pentaho Kettle 对于大多数常见数据存储的输入/输出有一个很好的 GUI,并且有自己的处理数据处理的方法;这没关系,而且学习曲线很小。 Kettle 的调试器还可以,因为数据很容易看到,但是错误和异常没有与输出线程化,并且没有办法真正调试问题;这意味着您无法重新加载输出/错误/异常,但可以查看系统反馈。话虽如此,Kettle 数据转换是 _______ 好吧,这么说吧,它让我感觉我一定错过了一些东西,因为我对“如果不可能,就用 JavaScript 编写转换”感到完全困惑;嗯,什么?
那么,有什么建议吗?请意识到我还没有真正指定任何 转换,但如果你真的使用产品进行数据处理,我想了解一下;我想,甚至是优秀的。
但总的来说,目前我正在寻找一种能够处理 1000-100,000 行和 10-100 列的产品。如果它能够分析数据集那就太酷了,这是 Kettle 的一个功能,但不是很好。我还希望内置单元测试,这意味着我能够构建数据控制集,并运行针对控制集所做的更改。然后,我希望能够在构建转换时选择性地过滤掉行和列,而无需更改构建;例如,我通过转换运行数据集,过滤结果,下一次运行这些数据集会在第一次“逻辑”出现时自动阻止;这反过来意味着需要“查看”的数据更少,并且每次增强迭代的运行时间减少;如果当我过滤掉行/列时,应用程序正在跟踪这些行/列(并且输出被过滤掉),那就太好了。并进行单元测试/突出显示任何更改。如果我做了一个会影响应用程序日志的更改,并且它能够根据我“破坏分支”来跟踪单元测试 - 它会给我一个警告,让我转储存储的数据分支......和/或跟踪下一代输出差异的主键,甚至尝试使用模糊逻辑来匹配它们。是的,我知道这是一个白日梦,但是嘿,我想我应该问一下,以防万一有一些我从未见过的东西。
请随意发表评论,我很乐意回答任何问题或提供其他信息。
Looking for a non-cloud based open source app for doing data transformation; though for a killer (and I mean killer) app just built for data transformations, I might be willing to spend up to $1000.
I've looked at Perl, Kapow Katalyst, Pentaho Kettle, and more.
Perl, Python, Ruby which are clearly languages, but unable to find any frameworks/DSLs just for processing data; meaning they're really not a great development environments, meaning there's no built GUI's for building RegEx, Input/Output (CSV, XML, JDBC, REST, etc.), no debugger for testing rows and rows of data -- they're not bad either, just not what I'm looking for, which is a GUI built for complex data transformations; that said, I'd love if the GUI/app file was in a scripting language, and NOT just stored in some not human readable XML/ASCII file.
Kapow Katalyst is made for accessing data via HTTP (HTML, CSS, RSS, JavaScript, etc.) it's got a nice GUI for transforming unstructured text, but that's not its core value offering, and is way, way too expensive. It does an okay job of traversing document namespace paths; guessing it's just XPath on the back-end, since the syntax appears to be the same.
Pentaho Kettle has a nice GUI for INPUT/OUTPUT of most common data stores, and its own take on handling data processing; which is okay, and just has a small learning curve. Kettle's debugger is ok, in that the data is easy to see, but the errors and exceptions are not threaded with the output, and there no way to really debug an issue; meaning you can't reload the output/error/exception, but are able to view the system feedback. All that said, Kettle data transformation is _______ well, let's just say it left me feeling like I must be missing something, because I was completely puzzled by "if it's not possible, just write the transformation in JavaScript"; umm, what?
So, any suggestions? Do realize that I haven't really spec'd out any transformations, but figure if you really use a product for data munging, I'd like to know about it; even excel, I guess.
In general though, currently I'm looking for a product that's able to handle 1000-100,000 rows with 10-100 columns. It'd be super cool if it could profile data sets, which is a feature Kettle sort of does, but not super well. I'd also like built in unit testing, meaning I'm able to build out control sets of data, and run changes made against the control set. Then I'd like to be able to selectively filter out rows and columns as I build out the transformation without altering the build; for example, I run a data set through transformation, filter the results, and the next run those sets are automatically blocked at the first "logical" occurrence; which in turn would mean less data to "look at" and a reduced runtime per each enhanced iteration; what would be crazy nice is if as I'd filtering out the rows/columns the app is tracking those, (and the output was filtered out). and unit tested/highlighted any changes. If I made a change that would effect the application logs and it's ability to track the unit tests based on me "breaking a branch" - it'd give me a warning, let me dump the data stored branch... and/or track the primary keys for difference in next generation of output, or even attempt to match them using fuzzy logic. And yes, I know this is a pipe dream, but hey, figured I'd ask, just in case there's something out there I've just never seen.
Feel free to comment, I'd be happy to answer any questions, or offer additional info.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Google Refine?
Google Refine?
Talend 将需要超过 5 分钟的时间,也许接近 1 小时左右才能开始连接基本转换,并能够满足您保留版本控制转换的要求。您描述了一个 Pipeline 流程,当您知道如何操作时,可以在 Talend 中轻松完成该流程,项目中具有多个输入和输出,因为相同的原始数据会经过各种转换和过滤,直到达到您想要的最终输出。然后,您可以安排作业对类似数据重复该过程。返回并花更多时间使用 Talend,我确信您会成功实现您所需要的。
我也恰好是 Google Refine 的提交者之一,也在日常工作中使用 Talend。实际上,有时我首先在 Google Refine 中为 Talend 进行转换建模。 (有时甚至使用 Refine 对中断的 ETL 转换本身执行清理!哈哈)我可以告诉您,我使用 Talend 的经验在 Google Refine 的一些功能中发挥了一小部分作用。例如,Talend 和 Google Refine 都有用于转换的表达式编辑器的概念(如果需要,Talend 可以使用 Java 语言)。
Google Refine 永远不会成为 ETL 工具,因为我们设计它并不是为了在 ETL 通常用于大型数据仓库后端处理和数据处理的领域进行竞争。转变。然而,我们设计 Google Refine 是为了补充 Talend 等现有 ETL 工具,允许轻松实时预览,以便就您的转换和清理做出明智的决策,如果您的数据不是非常大,那么您可以选择在 Refine 本身内执行您需要的操作。
Talend will need more than 5 minutes of your time, perhaps closer to about 1 hour to begin to wire up a basic transformations and being able to fulfill your requirement to keep versioned control transformations as well. You described a Pipeline process that can be done easily in Talend when you know how, where you have multiple inputs and outputs in a project as the same raw data goes through various transformations and filtering, until it arrives as final output as you desired. Then you can schedule your jobs to repeat the process over similar data. Go back and spend more time with Talend, and you'll succeed in what you need, I'm sure.
I also happen to be one of the committers of Google Refine and also use Talend in my daily work. I actually sometimes model my transformations for Talend first in Google Refine. (Sometimes even using Refine to perform cleanup on borked ETL transforms themselves! LOL ) I can tell you that my experience with Talend played a small part in a few of the features of Google Refine. For instance, both Talend and Google Refine have the concept of an expression editor for your transformations (Talend goes down to Java language for this if need be).
Google Refine will never be an ETL tool, in the sense that we have not designed it to compete in that space were ETL is typically used for large data warehouse backend processing & transformations. However, we designed Google Refine to compliment existing ETL tools like Talend by allowing easy live previewing to make informed decisions about your transformations and cleanup, and if your data isn't incredibly huge, then you might opt to perform what you need within Refine itself.
我不确定你到底想要做什么类型的数据或到底是什么样的转换,但如果它主要是数学转换,也许你可以尝试 FreeMat、Octave 或 SciLab。如果是更多数据仓库风格的修改,请尝试开源 ETL 工具,例如 Clover、Talend,JasperETL 社区版本,或 Jitterbit。
I'm not sure exactly what kind of data or exactly what kind of transformations you're trying to do, but if it's primarily mathematical transformation, perhaps you can try FreeMat, Octave, or SciLab. If it's more data-warehouse-style munging, try open source ETL tools like Clover, Talend, JasperETL Community Edition, or Jitterbit.