ocr - I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible? -

September 15, 2011

i've tried using adobe acrobat x pro "recognize text in multiple files."

when start process , asks directory, i've chose c:, main hard drive.

it took hours load , when did, list of files generated included word documents well. adobe said couldn't proceed until removed problem files.

once removed pdfs adobe flagged having errors (like password protection) , prompt remained, assumed meant word documents in list.

so manually removed too. adobe still said couldn't proceed until problem files removed , there weren't remaining files in list adobe had flagged having issues.

my firm trying make sure pdfs have searcheable. currently, , aren't. our goal make them searchable without removing them varied locations.

i think can using combination of

regular java : list files in directory match given criterium (e.g. name ends '.pdf')
itext : iterate on pdf document , extract images
tess4j : port of tesseract (google ocr engine) java, turn extracted images text

unless mistaken, tesseract offers crude version of workflow you. 1 pdf @ time. you'd still need windows/linux scripting pipe in files of given directory.

Search This Blog

Single

ocr - I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible? -

Comments

Post a Comment

Popular posts from this blog

neo4j - finding mutual friends in a cypher statement starting with three or more persons -

php - How to remove letter in front of the word laravel -

linux - Why does bash short curcuit fail in crontab? -