data extraction - How can I extract a table from a badly formatted PDF? -

- January 15, 2012

my client needs have csv name,surname,dob accounting database.

the problem is, accounting software "in cloud" (hence, in else's computer , freely accessible in world) , webapp can generate badly formatted "welcome card pdf", this

hi <newline> <lots of spaces>my name %name% <lots of spaces> %surname% <lots of newlines , spaces simulate text alignment right>i born in %dob <newpage>

so, can 500 pages pdf unusable content.

is there way extract data such file?

it important know if have multiple times or once 1 500 page file. assume once.

in case, pdf converted xml (if @ possible) or text file (many converters available - google).

then important know if 'records' formatted same way - format: .... firstname...lastname...dob...addressline1.... (where ... stuff don't want)

are there 'labels' or 'tags' tell next thing 'address line 1' or if value missed can tell?

if structure same , can tell if value not on record have fighting chance write regex expressions transform decent format. otherwise hard might able harvest lot (if not all) info.

Search This Blog

Shefl

data extraction - How can I extract a table from a badly formatted PDF? -

Comments

Post a Comment

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - Rendering a QGraphicsScene to QImage results in objects being placed on a side of QImage -