<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">I started this yesterday, and I know
there have been additional posts since, but I think this
particular point hasn't been resolved.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 12/30/20 8:59 PM, <a
class="moz-txt-link-abbreviated"
href="mailto:pjfarley3@earthlink.net">pjfarley3@earthlink.net</a>
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:000901d6df18$9749e5a0$c5ddb0e0$@earthlink.net">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style>@font-face
{font-family:Helvetica;
panose-1:2 11 6 4 2 2 2 2 2 4;}@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}div.WordSection1
{page:WordSection1;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">In my experience pdftotext does not
“overflow lines”. That is probably “extra information” (i.e.,
“Memo” field data) related to the transaction on the previous
line. That is quite common in bank statements. You have to
expect such lines and be prepared to attach them to the prior
transaction. I do it as the “Memo” field in my output. </div>
</blockquote>
Aaron would have to confirm, but I suspect he refers to a case where
a single table row as shown in the PDF has two rows of text in each
cell, becuase there is just too much text for one line. Because PDF
knows only about where exactly on the page any text is, but not why
it is there (no information about things like tables) the text
output would have two lines. The first would have the first line of
text from each cell, and the send would have the second line of text
from each cell. Putting them back together is theoretically
possible, but only if there is some way to know that the second line
is not a new row (missing header info?) or part of a manually
controlled cleanup phase of the conversion.<span
style="font-size:12.0pt;font-family:"Helvetica",sans-serif"><o:p></o:p></span>
<blockquote type="cite"
cite="mid:000901d6df18$9749e5a0$c5ddb0e0$@earthlink.net">
<div class="WordSection1">
<div style="border:none;border-left:solid blue 1.5pt;padding:0in
0in 0in 4.0pt">
<div>
<div> </div>
</div>
</div>
</div>
</blockquote>
</body>
</html>