NLP 项目,python 或 C++

发布于 2024-08-22 17:39:17 字数 349 浏览 7 评论 0原文

我们正在研究阿拉伯自然语言处理项目,我们的选择仅限于用 Python 或 C++(和 Boost 库)编写代码。我们正在考虑以下几点:

  • Python

    • 比 C++ 慢(正在努力使 Python 更快)
    • 更好的 UTF8 支持
    • 更快地编写测试和尝试不同的算法
  • C++

    • 比 Python 更快
    • 熟悉的代码,每个程序员都知道 C 或类似 C 的代码

项目完成后,将项目移植到其他编程语言应该不是很难。

您认为什么更好、更适合该项目?

We are working on Arabic Natural Language Processing project, we have limited our choices to either write the code in Python or C++ (and Boost library). We are thinking of these points:

  • Python

    • Slower than C++ (There is ongoing work to make Python faster)
    • Better UTF8 support
    • Faster in writing tests and trying different algorithms
  • C++

    • Faster than Python
    • Familiar code, every programmer knows C or C-like code

After the project is done, it should be not very hard to port the project to another programming languages.

What do you think is better and suitable for the project?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

笔落惊风雨 2024-08-29 17:39:17

尽管这是主观且有争议的,但有证据表明您可以用 python 编写成功的 NLP 项目,例如 NLTK< /a>.他们还提供不同语言的 NLP 功能比较:(


引用比较)

很多编程语言都被用于NLP。正如前言中所解释的,我们选择 Python 是因为我们相信它非常适合 NLP 的特殊要求。在这里,我们对几种编程语言进行了简要概述,以完成阅读文本并打印以 ing 结尾的单词的简单任务。我们从 Python 版本开始,我们相信它很容易解释,即使是非 Python 程序员也是如此:

import sys
for line in sys.stdin:
    for word in line.split():
        if word.endswith('ing'):
            print word

[...]

C 编程语言是一种高效的低级语言,在操作系统和网络软件中很流行:

#include <stdio.h>
#include <string.h>

int main(int argc, char **argv) {
   int i = 0;
   int c = 1;
   char buffer[1024];

   while (c != EOF) {
       c = fgetc(stdin);
       if ( (c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ) {
           buffer[i++] = (char) c;
           continue;
       } else {
           if (i > 2 && (strncmp(buffer+i-3, "ing", 3) == 0 || strncmp(buffer+i-3, "ING", 3) == 0 ) ) {
               buffer[i] = 0;
               puts(buffer);
           }
           i = 0;
       }
   }
   return 0;
}

编辑:我没有在 C++/Boost 中包含类似的代码,因此我添加了一个执行类似操作的代码示例,尽管与 Boost 文档。请注意,这不是最干净的版本。

// char_sep_example_1.cpp
#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

    int main()
    {
      std::string str = ";;Hello|world||-foo--bar;yow;baz|";
      typedef boost::tokenizer<boost::char_separator<char> > 
        tokenizer;
      boost::char_separator<char> sep("-;|");
      tokenizer tokens(str, sep);
      for (tokenizer::iterator tok_iter = tokens.begin();
           tok_iter != tokens.end(); ++tok_iter)
        std::cout << "<" << *tok_iter << "> ";
      std::cout << "\n";
      return EXIT_SUCCESS;
    }

Although this is subjective and argumentative, there is evidence that you can write a successful NLP project in python like NLTK. They also have a comparison of NLP functionality in different languages:


(Quoting from the comparison)

Many programming languages have been used for NLP. As explained in the Preface, we have chosen Python because we believe it is well-suited to the special requirements of NLP. Here we present a brief survey of several programming languages, for the simple task of reading a text and printing the words that end with ing. We begin with the Python version, which we believe is readily interpretable, even by non Python programmers:

import sys
for line in sys.stdin:
    for word in line.split():
        if word.endswith('ing'):
            print word

[...]

The C programming language is a highly-efficient low-level language that is popular for operating system and networking software:

#include <stdio.h>
#include <string.h>

int main(int argc, char **argv) {
   int i = 0;
   int c = 1;
   char buffer[1024];

   while (c != EOF) {
       c = fgetc(stdin);
       if ( (c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ) {
           buffer[i++] = (char) c;
           continue;
       } else {
           if (i > 2 && (strncmp(buffer+i-3, "ing", 3) == 0 || strncmp(buffer+i-3, "ING", 3) == 0 ) ) {
               buffer[i] = 0;
               puts(buffer);
           }
           i = 0;
       }
   }
   return 0;
}

Edit: I didn't include comparable code in C++/Boost, so I add a code sample that does something similar, although not identical from the Boost documentation. Note that this isn't the cleanest version.

// char_sep_example_1.cpp
#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

    int main()
    {
      std::string str = ";;Hello|world||-foo--bar;yow;baz|";
      typedef boost::tokenizer<boost::char_separator<char> > 
        tokenizer;
      boost::char_separator<char> sep("-;|");
      tokenizer tokens(str, sep);
      for (tokenizer::iterator tok_iter = tokens.begin();
           tok_iter != tokens.end(); ++tok_iter)
        std::cout << "<" << *tok_iter << "> ";
      std::cout << "\n";
      return EXIT_SUCCESS;
    }
酸甜透明夹心 2024-08-29 17:39:17

用 Python 编写它,对其进行分析,如果您需要加快其中某些部分的速度,请用 C++ 编写它们。 Python 和 C++ 非常相似,以至于“熟悉”C++ 的优势很快就会变得无关紧要。

我是作为一个主要使用 C++ 进行开发并且最近开始认真使用 Python 的人这样说的。我喜欢它们,但我可以让 Python 代码比 C++ 更快地运行。说真的,dict 在可用性方面胜过 std::map

PS 这里有一些有关如何从 Python 调用 C 代码的信息。

Write it in Python, profile it, and if you need to speed parts of it up, write them in C++. Python and C++ are similar enough that the "familiar" advantage with C++ will be irrelevant pretty quick.

I say this as someone who has developed primarily in C++ and has recently gotten serious with Python. I like them both, but I can get Python code working a lot faster than C++. Seriously, dict beats std::map in usability.

P.S. Here's some information on how to call C code from Python.

鹤仙姿 2024-08-29 17:39:17

这或多或少是对奥托·阿尔门丁格答案的答复/补充。如果你真的想用 C++ 实现一些(大致)类似于他的 Python 示例的东西,我认为这样的东西会更接近:

#include <string>
#include <iostream>

int main() { 
    std::string temp;
    while (std::cin>>temp) 
        if (temp.size()>2 && temp.substr(temp.size()-3, 3)=="ing")
           std::cout << temp;
}

这与 Python 所做的事情本质上是相同的,并且大约相同长度也是如此——C++有更多的语法“绒毛”,但它们的代码行数实际上完全相同(尽管毫无疑问,C++版本中的各个行是更长)。

请不要误会我的意思:我当然不是试图声称使用 C++ 进行开发将与使用 Python 一样快速或简单。我确实认为边距可能比这里提供的一些代码可能暗示的要小一点。

编辑:如果您确实想要声称 C++ 会更快、更容易,您可以提供如下代码:

for (std::string temp; std::cin>>temp; )
    temp.size()>2 && temp.substr(temp.size()-3, 3)=="ing" && std::cout << temp;

...以及一个事实上准确(尽管严重误导)的声明,例如:“C++ 代码只有语句数量是 Python 实现的一半。”

This is more or less a reply/supplement to Otto Almendinger's answer. If you honestly wanted to implement something (roughly) similar to his Python example in C++, I think something like this would be closer:

#include <string>
#include <iostream>

int main() { 
    std::string temp;
    while (std::cin>>temp) 
        if (temp.size()>2 && temp.substr(temp.size()-3, 3)=="ing")
           std::cout << temp;
}

This does essentially the same thing as the Python does, and is about the same length as well -- the C++ has more syntactic "fluff", but they have exactly the same number of lines of code that really do anything (though there's no question that the individual lines in the C++ version are longer).

Don't get me wrong: I'm certainly not trying to claim that development with C++ will be as quick or easy as with Python. I do think the margin might be a tad smaller than some of the code presented here might imply though.

Edit: If you did want to claim C++ would be faster and easier, you could present code like:

for (std::string temp; std::cin>>temp; )
    temp.size()>2 && temp.substr(temp.size()-3, 3)=="ing" && std::cout << temp;

...along with a factually accurate (though grossly misleading) claim like: "The C++ code has only half as many statements as the Python implementation."

堇年纸鸢 2024-08-29 17:39:17

熟悉的代码,每个程序员都知道C或类C代码

许多开发人员熟悉 C 或类似 C 的代码,但这并不意味着他们符合 C++。
缺乏经验的 C++ 开发人员可能会对如此复杂的项目造成很大损害,因此您必须格外小心。

我不能代表 python,但我听说它对初学者更友好。

我想说,你应该选择你(作为一个团队)最了解的语言。

Familiar code, every programmer knows C or C-like code

Many devs are familiar with C or C-like code, it doesn't make them C++ compliant.
Unexperienced C++ devs can do a lot of harm to such a complex project and you would have to take extra care.

I can't speak for python but I heard it's more beginner-friendly.

I'd say, once again, you should go for the language you (as a team) know best.

地狱即天堂 2024-08-29 17:39:17

IMO 选择 C/C++ 只是因为“熟悉”的因素。尽管 LOC 更多地采用 C/C++,但您将节省理解和测试的时间。

IMO go for C/C++ simply because of the 'familiar' factor. Though LOC's will be more in C/C++ you will save time in understanding and testing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文