非英语（泰米尔语）PDF 的 Camelot 抓取问题

发布于 2025-01-13 15:27:32 字数 1838 浏览 1 评论 0原文

Python Camelot 在英语方面很有魅力。但说到泰米尔语它没有正确地刮掉单词。它或多或少地提供了接近字符的垃圾字符，我想了解问题是什么以及它如何捕获非英语数据。

迄今为止完成的工作： 我正在尝试从泰米尔纳德邦选举委员会的 PDF 中抓取数据。示例单页数据< /a>. 例如，单词

被抓取为 ெபயர்。

参考：第一个表的 CSV 输出附在下面

"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"

用于抓取的代码：

# coding: utf8
import camelot

tables = camelot.read_pdf('2.pdf',  encoding='utf-8', pages= '1-end' )

tables
x = tables.n 
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')

为清楚起见添加/编辑，如 @tripleee 所指出的对于非泰米尔语用户。这是表格的标题预期输出是 வ.எண் பெயர்‌ பாலினம்‌ பெயர்‌ கட்சி வாக்குகள்‌‌ % முடிவு 但是，输出已经到来 "வ.எண்。","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்றவாக்கள்","சதவதம் %",""

原文

Python Camelot works a charm when it comes to English. But when it comes to Tamil
it's not scraping the words properly. It gives more or less junk characters close to the characters I would like to understand what the issue is and how it captures the non-English data.

Work Done So Far:
I am trying to scrape data from a PDF from the Tamil Nadu Election Commission. Sample single page data here.
For example, the word

is getting scraped as ெபயர்.

Reference: The CSV output just for the first table is attached below

"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"

Code used for scraping:

# coding: utf8
import camelot

tables = camelot.read_pdf('2.pdf',  encoding='utf-8', pages= '1-end' )

tables
x = tables.n 
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')

Addition / Edit for clarity as pointed out by @tripleee
For Non Tamil Users.
This is the header of the table

The Expected output is
வ.எண் பெயர்‌ பாலினம்‌ பெயர்‌ கட்சி வாக்குகள்‌ % முடிவு
But , the output which has come
"வ.எண்.","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்ற வாக்கள்","சதவதம்
%",""

分享到QQ

分享到微博