data extraction - How can I extract a table from a badly formatted PDF? -
my client needs have csv name,surname,dob accounting database.
the problem is, accounting software "in cloud" (hence, in else's computer , freely accessible in world) , webapp can generate badly formatted "welcome card pdf", this
hi <newline> <lots of spaces>my name %name% <lots of spaces> %surname% <lots of newlines , spaces simulate text alignment right>i born in %dob <newpage>
so, can 500 pages pdf unusable content.
is there way extract data such file?
it important know if have multiple times or once 1 500 page file. assume once.
in case, pdf converted xml (if @ possible) or text file (many converters available - google).
then important know if 'records' formatted same way - format: .... firstname...lastname...dob...addressline1.... (where ... stuff don't want)
are there 'labels' or 'tags' tell next thing 'address line 1' or if value missed can tell?
if structure same , can tell if value not on record have fighting chance write regex expressions transform decent format. otherwise hard might able harvest lot (if not all) info.
Comments
Post a Comment