Image to Malayalam Text with Tesseract

Image to Malayalam Text with Tesseract

First tinstall image magick

if You need to convert PDF made of images, convert each page to png with image magick

Windows Installer at

After install upadate Env Var PATH C:\Program Files\Tesseract-OCR

test with
tesseract --list-langs
default langs availabe are eng and osd. more langs could be added.

extract text with

tesseract pdfPage-0.png OCR/pdfPage-0.txt

code for OCR of korean text. This works only if respective language date is available. confirm with tesseract --list-langs
tesseract pdfPage-1.png OCR/pdfPage-1.txt -l kor

for that appropriate mapper with name tessdata\lang.traindedata must be present in tesseractfolder.

You can create using tools like jTessBoxEditor ( is a Java box editor for Tesseract OCR data.
Training Tesseract 5 for a New Font


Download mal.traineddata
download from


eg :

Copy to C:\Program Files\Tesseract-OCR\tessdata

Code for malayalam OCR
tesseract GSBVPNotice.jpeg GSBVPNotice.txt -l mal