用Java设计高性能状态机
我正在开始编写一个 Java 库来实现高性能有限状态机。 知道有很多库,但我想从头开始编写自己的库,因为几乎所有库都构造了针对一次仅处理一个进行优化的自动机。
我 了解 SO 社区中涉足状态机设计的人们在实现此类高性能库时认为最重要/最好的设计原则是什么。
注意事项
- 生成的自动机通常不会很大。 (~ 100-500 个状态)。
- 不过,实施应该能够扩展。
- 实施应该能够实现快速转换(最小化、确定性等)。
- 希望实现 DFA、NFA、GNFA、PDA 以及可能的树自动机。如果可能的话,希望在单一界面下。
- 应该在内存使用和性能之间取得良好的平衡。
目前对我来说有关设计的问题是:
应该为
State
、Symbol
和Transition
类被定义?或者应该使用“隐藏”的内部结构。就我个人而言,我认为使用类会浪费大量内存,因为相同的信息可以以更简洁的形式存储。但是,这是否能够实现更快的转型?它还有其他优点/缺点吗?内部存储数据的最佳方式是什么?使用
HashMap
和HashSet
等数据结构可以实现分摊常量时间查找,但会涉及一定的开销。这是最好的方法吗?将转换信息存储为原始(或非)数组似乎浪费了大量内存。特别是当图书馆需要一次处理大量自动机时。不同数据结构的优缺点是什么?
我很感激任何意见。谢谢!
I am in the process of starting to write a Java library to implement high-performance Finite State Machines. I know there are a lot of libraries out there, but I want to write my own from scratch, as almost all the libraries out there construct automatons optimized for handling only one at a time.
I would like to know what the people in the SO community who have dabbled in state machine design feels are the most important / best design principles when it comes to implementing high-performance libraries like these.
Considerations
- The automatons generated are typically not massive. (~ 100-500 states).
- The implementation should be able to scale though.
- The implementation should enable fast transformations (minimization, determinization etc.).
- Looking to implement DFA, NFA, GNFA, PDA and possibly Tree Automata. Hopefully under a single interface if possible.
- Should have a good balance between memory use and performance.
Current questions regarding design for me at the moment are:
Should classes for
State
,Symbol
andTransition
be defined? Or should a "hidden" internal structure be used. Personally I feel that using classes as such would waste a lot of memory since the same information can be stored in a much more condensed form. But, does this enable faster transformations? Does it hold any other pros / cons?What would be the best way to store the data internally? Using data structures like
HashMap
andHashSet
enables amortized constant time lookups, but there is an element of overhead involved. Is this the best way? Storing the transition information as a primitive (or not) array seems to waste quite a bit of memory. Especially when the library needs to handle a lot of automatons at a time. What are the pros / cons of the different data structures?
I appreciate any input. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
那么你希望它有多快? brics.dk/automaton 中的代码确实声明了自己的 State 和 Transition 类,但显然,这些可以使用原语重写(哎呀,整个Transition类的状态显然很容易适合
long
)。问题是,例如,如果您将
Transition
类移动为简单的原语,那么您就不必再使用缓慢的HashMap默认 Java 集合:您可以使用 Trove 的
TLongObjectHashMap
(或TLongInt
... 或TLongLong
等)等库,这些库拥有默认的HashMap
大时代(Trove库基本上提供了超级高效的映射和集合,既快又小,当您使用基元时:您不会生成无数的垃圾,也不会生成常量不必要地围绕原语,因此减少 GC 等。如果您注重性能,那么您确实想检查 Trove...并且他们即将发布的 3.0 版本比 Trove 2.0 快 20%)。但它真的有用吗?显然该库已经足够快了。毫无疑问,通过不浪费地创建对象并使用实际上性能良好的集合,可以使速度更快,但尚不清楚这是否是可取的。
除此之外,我很确定上面的库不是线程安全的。 State 构造函数通过这样做创建一个唯一的 ID:
并且该构造函数是从... 90 个不同的地方调用的!
在多线程场景中不创建唯一ID的教科书示例(哎呀,即使使
next_id
易失性也不够,例如,您需要一个 AtomicInteger )。我不太了解这个库,但这个 ID 的东西对我来说看起来非常可疑。Well how fast do you want it to be? The code at brics.dk/automaton does declare its own State and Transition classes altough, obviously, these could be rewritten using primitives (heck, the entire Transition class's state apparently would easily fit on a
long
).Thing is, if you move, for example, the
Transition
class to simply a primitive, then you're not forced to use anymore the slowHashMap<Transition,...>
default Java collections: you can use libraries like Trove'sTLongObjectHashMap
(orTLongInt
... orTLongLong
, whatever) which owns the defaultHashMap
big times (the Trove libraries basically provides maps and sets that are super efficient, both fast and small, when you work with primitives: you don't generate countless garbage nor constant needless wrapping around primitives, so less GC etc. If you're into performance, then you do want to check Trove... And their 3.0 upcoming release is 20% faster than Trove 2.0).But is it really useful? Apparently that library is already plenty of fast. There's no doubt it can be made faster by not wastefully creating objects and by using collections that do actually perform well but it's not clear that it would be desirable.
Besides that, I'm pretty sure that the library above is not thread safe. The State constructor creates a unique ID by doing this:
and that constructor is called from... 90 different places!
Textbook example of a way to not create a unique ID in a multi-threaded scenario (heck, even making
next_id
volatile wouldn't be sufficient, you want, say, an AtomicInteger here). I don't know the library well enough but this ID thinggy looks very fishy to me.我有一些问题:
哪一个部分需要快,FSA 的输入、FSA 的构建,还是执行< FSA 的 /em>?
FSA的输入来自哪里?人类是否会输入状态和弧线,或者某种自动过程?真正的输入是否来自转换为 FSA 的正则表达式?
FSA 多久更改一次?一秒钟一次?一年一次?
你知道你需要什么。除了学术图灵机之外,我从未见过任何重要的状态机不是从文本表示开始的,无论是作为正则表达式还是结构化程序。
在我处理过的每种情况下,首选实现是将正则表达式直接转换为简单的结构化程序并编译它。
没有什么比这执行得更快了。
I have some questions:
Which part do you need to be fast, the inputting of the FSA, the building of the FSA, or the execution of the FSA?
Where does the input of the FSA come from? Does a human put in states and arcs, or some automatic process? Does the real input come from a regular expression that's converted to a FSA?
How often can the FSA change? Once a second? Once a year?
You know what you need. Aside from academic Turing machines, I've never seen a significant state machine that didn't start from a textual representation, either as a regular expression or a structured program.
In every case I've dealt with, the preferred implementation was to convert the regular expression directly into a simple structured program and compile it.
Nothing will execute any faster than that.