<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

    <title></title>

  </head>

  <body bgcolor="#ffffff" text="#000000">

    On 12/22/2011 09:44 AM, Panks wrote:

    <blockquote

cite="mid:CAKi+3nrhRXF6HnDwJex-oaqOv9o1eK-_ZUSttzuZgtk1TyohoQ@mail.gmail.com"

      type="cite">

      <meta http-equiv="Context-Type" content="text/html;

        charset=ISO-8859-1">

      <div>

        <blockquote>

          <div>

            <div>

              <div><br>

                <br>

              </div>

            </div>

            Very great. Lot of thanks for sharing your progress. For

            poppler you may like to have a look at <a

              moz-do-not-send="true"

              href="http://people.freedesktop.org/%7Eaacid/docs/qt4/">http://people.freedesktop.org/~aacid/docs/qt4/</a>

            and for implementations using it <a moz-do-not-send="true"

href="http://mail.kde.org/pipermail/okular-devel/2011-May/009429.html">http://mail.kde.org/pipermail/okular-devel/2011-May/009429.html</a>

            ( <a moz-do-not-send="true"

              href="http://quickgit.kde.org/index.php?p=okular.git&a=summary">http://quickgit.kde.org/index.php?p=okular.git&a=summary</a>

            ).<br>

            <br>

            For the initial skeleton what means the very first code to

            start a PDF-importer with I could provide some helping hands

            to get it done. We could start with creating a branch in our

            git and add a calligra/filters/words/pdfimport directory and

            then copy over the Ascii-filter + rename + adapt the

            CMakeLists.txt + link against libpoppler and create the

            first lines of code that use libpoppler to have a look first

            code that extracts content from a PDF and writes it into a

            ODT. You can ping me at IRC or write a mail to get started

            on this :-)<br>

            <br>

          </div>

        </blockquote>

      </div>

      <br>

      <div>Hello Sebastian,

        <div><br>

        </div>

        <div>I did little bit of modification in code on my system, I

          created a new direcory pdfimport inside <span>calligra/filters/words.I

            copied import files, cmakefile and .desktop file from ascii

            directory and renamed them to pdfimport.</span></div>

        <div> this is my CMakeList.txt - <a moz-do-not-send="true"

            href="http://paste.kde.org/176486/">http://paste.kde.org/176486/</a>

        </div>

        <div> and this is word_pdf_import.desktop file - <a

            moz-do-not-send="true" href="http://paste.kde.org/176498/">http://paste.kde.org/176498/</a>

        </div>

        <div>I added the line </div>

        <div>> add_subdirectory( pdfimport ) </div>

        <div>in CMakeList.txt in <span>calligra/filters/words

            directory. </span><span>I tried building the code after this

            without doing much modification to pdfimport.cpp and

            pdfimport.h (the code in them was same as asciiimport.cpp

            amd asciiimport.h). Build was successful but I didn't see

            any change in filter after launching calligraword, I mean

            the 'Open Document' window still wasn't showing the pdf

            documents neither there was any entry as pdf in drop

            down list of filter. So, What all changes do I need to do

            and in which all file to at least make pdf file visible in

            'Open Document' dialog and make it accept it?</span></div>

        <div> <br>

        </div>

      </div>

    </blockquote>

    <br>

    Looks all correct. Did you do a "kbuildsycoca4" so the new

    desktop-file is proper picked-up?<br>

    <br>

    Back then it was also needed to define in the PdfImport.cpp the

    proper libname. So something like;<br>

    <br>

    K_PLUGIN_FACTORY(PdfImportFactory,

    registerPlugin<PdfImport>();)<br>

    K_EXPORT_PLUGIN(PdfImportFactory("wordspdfimport",

    "calligrafilters"))<br>

    <br>

    Not sure if that is needed any longer but it certainly cannot harm.<br>

    <br>

    <blockquote

cite="mid:CAKi+3nrhRXF6HnDwJex-oaqOv9o1eK-_ZUSttzuZgtk1TyohoQ@mail.gmail.com"

      type="cite">

      <div>

        <div> </div>

        <div> and, second thing, I was going through the code of

          asciiimport.cpp, in that code  <span>the input file has been

            passed to a QTextStream object and appropriate codec is set

            to the object.</span></div>

        <div>

          <div>    QTextStream stream(&in);</div>

          <div>    stream.setCodec(codec);</div>

          <div><br>

          </div>

          <div>and after that using a QString the lines are being

            appended to the document-</div>

        </div>

        <blockquote>

          <div>

            <div>QString line = stream.readLine();.</div>

            <div>bodyWriter->addTextSpan(line);</div>

          </div>

        </blockquote>

        <div><br>

        </div>

        whereas using poppler there is no such straing forward option to

        get the text line by line, I think.  </div>

    </blockquote>

    <br>

    Correct. Text-files are simple compared to PDF-files. The later can

    have formatings (bold, italic, underline, different font-sizes,

    font-color, etc. pp) and even images. Our target would be to take

    all that over. But step by step. We can start with simple things

    like the pure text and some basic formatings and later go on to e.g.

    images.<br>

    <blockquote

cite="mid:CAKi+3nrhRXF6HnDwJex-oaqOv9o1eK-_ZUSttzuZgtk1TyohoQ@mail.gmail.com"

      type="cite">

      <div>

        <div> One method I could think of was to go to each pdf page one

          by one and use 

          <div> QString text(const QRectF &rect, TextLayout)  </div>

          <div> function to get the text within a rectangle into a

            QString, but in this case what value of rect should I pass

            to the function and apart from this what other method I can

            use to fetch the text out of pdf using poppler? Please give

            some suggestion.

            <div> <br>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

    It looks as poppler Qt is not enough for us to to anything more put

    extracting the pure plain-text :-(<br>

    <br>

    What we ideally like to have is something like

    <a class="moz-txt-link-freetext" href="http://cgit.freedesktop.org/poppler/poppler/tree/poppler/TextOutputDev.h">http://cgit.freedesktop.org/poppler/poppler/tree/poppler/TextOutputDev.h</a>

    . So an own OutputDev that does compared to the ArthurOutputDev not

    render by drawing it using a QPainter but by producing proper ODF

    out of it.<br>

    <br>

    poppler ships with

    <a class="moz-txt-link-freetext" href="http://cgit.freedesktop.org/poppler/poppler/tree/utils/">http://cgit.freedesktop.org/poppler/poppler/tree/utils/</a> which is a

    nice show-case how to output to a HTML file. I guess that's a good

    starting point. We could first investigate what would be needed to

    create our own OdtOutputDevice and then just create it :-)<br>

    <br>

    May I suggest to commit early and often. Means it would really rock

    if you can create a branch for out work and commit what you have so

    far (doesn't need to compile or work) with something like;<br>

    <br>

    # create branch<br>

    git checkout master -b filter-words-pdfimport-panks<br>

    # add your new filter<br>

    git add filters/words/pdfimport<br>

    #commit everything<br>

    git commit -a<br>

    # and push the branch upstream<br>

    git push<br>

    <br>

    Hope the above steps work. git is rather tricky sometimes if not all

    times :-/<br>

    <br>

  </body>

</html>