具有自然语言上下文的字符串分块算法
我有一个来自用户的任意大的文本字符串,需要将其分成 10k 块(可能是可调整的值)并发送到另一个系统进行处理。
- 块不能长于 10k(或其他任意值)
- 文本应该根据自然语言上下文进行分解
- 尽可能按标点符号拆分
- 如果不存在标点符号,则按空格分割
- 断言是最后的手段
我试图不重新发明轮子,在我从头开始之前有什么建议吗?
使用 C#。
I have a arbitrarily large string of text from the user that needs to be split into 10k chunks (potentially adjustable value) and sent off to another system for processing.
- Chunks cannot be longer than 10k (or other arbitrary value)
- Text should be broken with natural language context in mind
- split on punctuation when possible
- split on spaces if no punction exists
- break a word as a last resort
I'm trying not to re-invent the wheel with this, any suggestions before I roll this from scratch?
Using C#.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可能无法满足您需要的所有情况,但它应该可以帮助您上路。
This may not handle every case as you need, but it should get you on your way.
我确信这最终可能会比您预期的更困难(大多数自然语言的事情),但是请查看 Sharp 自然语言解析器。
我目前正在使用 SharpNLP,它运行得很好,但总是有“问题”。
如果这不是您要找的,请告诉我。
标记
I'm sure this will probably end up being more difficult than you're expecting (most natural language things), but check out Sharp Natural Language Parser.
I'm currently using SharpNLP, it works pretty well, but there's always 'gotcha's'.
Let me kow if this isn't what you're looking for.
Mark