data extraction - How can I extract a table from a badly formatted PDF? -


my client needs have csv name,surname,dob accounting database.

the problem is, accounting software "in cloud" (hence, in else's computer , freely accessible in world) , webapp can generate badly formatted "welcome card pdf", this

hi <newline> <lots of spaces>my name %name% <lots of spaces> %surname% <lots of newlines , spaces simulate text alignment right>i born in %dob <newpage> 

so, can 500 pages pdf unusable content.

is there way extract data such file?

it important know if have multiple times or once 1 500 page file. assume once.

in case, pdf converted xml (if @ possible) or text file (many converters available - google).

then important know if 'records' formatted same way - format: .... firstname...lastname...dob...addressline1.... (where ... stuff don't want)

are there 'labels' or 'tags' tell next thing 'address line 1' or if value missed can tell?

if structure same , can tell if value not on record have fighting chance write regex expressions transform decent format. otherwise hard might able harvest lot (if not all) info.


Comments

Popular posts from this blog

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - No viable overloaded operator for references a map -