非英语(泰米尔语)PDF 的 Camelot 抓取问题

发布于 2025-01-13 15:27:32 字数 1838 浏览 1 评论 0原文

Python Camelot 在英语方面很有魅力。但说到泰米尔语 它没有正确地刮掉单词。它或多或少地提供了接近字符的垃圾字符,我想了解问题是什么以及它如何捕获非英语数据。

迄今为止完成的工作: 我正在尝试从泰米尔纳德邦选举委员会的 PDF 中抓取数据。示例单页数据< /a>. 例如,单词

泰米尔语单词的图像

被抓取为 ெபயர்

参考:第一个表的 CSV 输出附在下面

"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"

用于抓取的代码:

# coding: utf8
import camelot

tables = camelot.read_pdf('2.pdf',  encoding='utf-8', pages= '1-end' )

tables
x = tables.n 
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv') 

为清楚起见添加/编辑,如 @tripleee 所指出的 对于非泰米尔语用户。 这是表格的标题 标题泰米尔语表格预期输出是 வ.எண் பெயர்‌ பாலினம்‌ பெயர்‌ கட்சி வாக்குகள்‌‌ % முடிவு 但是,输出已经到来 "வ.எண்。","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்றவாக்கள்","சதவதம் %",""

Python Camelot works a charm when it comes to English. But when it comes to Tamil
it's not scraping the words properly. It gives more or less junk characters close to the characters I would like to understand what the issue is and how it captures the non-English data.

Work Done So Far:
I am trying to scrape data from a PDF from the Tamil Nadu Election Commission. Sample single page data here.
For example, the word

image of Tamil word

is getting scraped as ெபயர்.

Reference: The CSV output just for the first table is attached below

"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"

Code used for scraping:

# coding: utf8
import camelot

tables = camelot.read_pdf('2.pdf',  encoding='utf-8', pages= '1-end' )

tables
x = tables.n 
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv') 

Addition / Edit for clarity as pointed out by @tripleee
For Non Tamil Users.
This is the header of the table
Header of Table in Tamil
The Expected output is
வ.எண் பெயர்‌ பாலினம்‌ பெயர்‌ கட்சி வாக்குகள்‌ % முடிவு
But , the output which has come
"வ.எண்.","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்ற வாக்கள்","சதவதம்
%",""

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文