非英语(泰米尔语)PDF 的 Camelot 抓取问题
Python Camelot 在英语方面很有魅力。但说到泰米尔语 它没有正确地刮掉单词。它或多或少地提供了接近字符的垃圾字符,我想了解问题是什么以及它如何捕获非英语数据。
迄今为止完成的工作: 我正在尝试从泰米尔纳德邦选举委员会的 PDF 中抓取数据。示例单页数据< /a>. 例如,单词
被抓取为 ெபயர்
。
参考:第一个表的 CSV 输出附在下面
"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"
用于抓取的代码:
# coding: utf8
import camelot
tables = camelot.read_pdf('2.pdf', encoding='utf-8', pages= '1-end' )
tables
x = tables.n
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')
为清楚起见添加/编辑,如 @tripleee 所指出的 对于非泰米尔语用户。 这是表格的标题 预期输出是 வ.எண் பெயர் பாலினம் பெயர் கட்சி வாக்குகள் % முடிவு 但是,输出已经到来 "வ.எண்。","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்றவாக்கள்","சதவதம் %",""
Python Camelot works a charm when it comes to English. But when it comes to Tamil
it's not scraping the words properly. It gives more or less junk characters close to the characters I would like to understand what the issue is and how it captures the non-English data.
Work Done So Far:
I am trying to scrape data from a PDF from the Tamil Nadu Election Commission. Sample single page data here.
For example, the word
is getting scraped as ெபயர்
.
Reference: The CSV output just for the first table is attached below
"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"
Code used for scraping:
# coding: utf8
import camelot
tables = camelot.read_pdf('2.pdf', encoding='utf-8', pages= '1-end' )
tables
x = tables.n
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')
Addition / Edit for clarity as pointed out by @tripleee
For Non Tamil Users.
This is the header of the table
The Expected output is
வ.எண் பெயர் பாலினம் பெயர் கட்சி வாக்குகள் % முடிவு
But , the output which has come
"வ.எண்.","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்ற வாக்கள்","சதவதம்
%",""
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论