Image to Malayalam Text with Tesseract
First tinstall image magick
if You need to convert PDF made of images, convert each page to png with image magick
Windows Installer at https://github.com/UB-Mannheim/tesseract/wiki
After install upadate Env Var PATH C:\Program Files\Tesseract-OCR
test with
tesseract --list-langs
default langs availabe are eng and osd. more langs could be added.
extract text with
tesseract pdfPage-0.png OCR/pdfPage-0.txt
code for OCR of korean text. This works only if respective language date is available. confirm with tesseract --list-langs
tesseract pdfPage-1.png OCR/pdfPage-1.txt -l kor
for that appropriate mapper with name tessdata\lang.traindedata
must be present in tesseractfolder.
You can create using tools like jTessBoxEditor
(https://www.youtube.com/watch?v=-GBQcgA14PQ) is a Java box editor for Tesseract OCR data.
Training Tesseract 5 for a New Font
Malayalam
Download mal.traineddata
https://groups.google.com/g/tesseract-ocr/c/U1JjX5ZNn1Q/m/BCqy_2Ge3F4J
download from https://tesseract-ocr.github.io/tessdoc/Data-Files.html#latest-data-files-september-15-2017
or https://git.archive.org/archivecd/tessdata_fast
eg : https://git.archive.org/archivecd/tessdata_fast/-/blob/master/mal.traineddata?ref_type=heads
Copy to C:\Program Files\Tesseract-OCR\tessdata
Code for malayalam OCR
tesseract GSBVPNotice.jpeg GSBVPNotice.txt -l mal
|
|