More pdf2kmymoney (overflos/wrapping lines)

Thu Dec 31 21:41:13 GMT 2020

Jack,

It is quite common in bank statement PDF’s to have transactions be formatted like this (I hope the alignment works, I will format as fixed-font to try to help):

MM/DD/YY   Payee Name                 Amount paid          Running balance

           Additional info about payment

           Can be multiple lines

MM/DD/YY   Next Payee Name            Amount Paid          Running balance

MM/DD/YY   DEPOSIT                    Amount deposited      Running Balance

So when the PDF is translated to text, those “additional info” line(s) appear as separate physical lines without the MM/DD/YY header or any money amounts following.

Depending heavily on the PDF construction, I have also (but rarely) seen the money amounts (paid or deposited and balance) show up on the SECOND line after conversion of the PDF to text.  The pdftotext “-layout” switch has improved over time to where I seldom see this any more, but it can happen.

Like I said, it can get complicated.

Peter

From: KMyMoney <kmymoney-bounces at kde.org> On Behalf Of Jack
Sent: Thursday, December 31, 2020 3:14 PM
To: kmymoney at kde.org
Subject: Re: More pdf2kmymoney (overflos/wrapping lines)

I started this yesterday, and I know there have been additional posts since, but I think this particular point hasn't been resolved.

On 12/30/20 8:59 PM, pjfarley3 at earthlink.net <mailto:pjfarley3 at earthlink.net>  wrote:

In my experience pdftotext does not “overflow lines”.  That is probably “extra information” (i.e., “Memo” field data) related to the transaction on the previous line.  That is quite common in bank statements.  You have to expect such lines and be prepared to attach them  to the prior transaction.   I do it as the “Memo” field in my output. 

Aaron would have to confirm, but I suspect he refers to a case where a single table row as shown in the PDF has two rows of text in each cell, becuase there is just too much text for one line.  Because PDF knows only about where exactly on the page any text is, but  not why it is there (no information about things like tables) the text output would have two lines.  The first would have the first line of text from each cell, and the send would have the second line of text from each cell.  Putting them back together is theoretically possible, but only if there is some way to know that the second line is not a new row (missing header info?) or part of a manually controlled cleanup phase of the conversion. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kmymoney/attachments/20201231/76c9a31f/attachment.htm>