Could you help me in parsing of .DOC files

Jaroslaw Staniek staniek at kde.org
Thu Apr 14 12:21:42 BST 2011


2011/4/14 Yuriy Kardapolov <clotofdarkness at hotmail.com>:

> Project background:
>
> I need to read .doc files in asp.net. It's needed for our project
> (converter).
> I have downloaded documentation from Microsoft about msword file format.
> But the instruction is very tangled and contains just description of
> different msword structures.
> I can read compound file format (OLE2) and get any stream from it such as
> "WordDocument" "Table1" "Table0" etc.
> I can get text from "WordDocument" stream. As I know there is all text of
> whole documents.
> Also I have download wvWare 2 but can't compile it.
> What I want is to know how parse the .DOC files and get text formatting such
> as font name,color,size,boldness etc.
>

Hi
You have native tools for that, e.g.
http://msdn.microsoft.com/en-us/library/15s06t57%28v=vs.80%29.aspx

If you reuse at converters written in C/C++, you'll have hybrid
solution, not a .net solution.
In the same time, I am afraid you won't have chance to get that much
support from calligra-devel as developers that intend to convert (or
read structures of) documents using C/C++ solutions (that are more
compatible with calligra filters for example). Developers here just
typically have no access to development environments you use. Or just
they do not work with the tools on daily basis.

> Could you advise me how to read text formatting? What structures should I
> read for that in my .NET project?

You may want to look at the document format specifications to
understand how the formatting is encoded, then you'll have chance to
find code that does that for you.

-- 
regards / pozdrawiam, Jaroslaw Staniek
 http://www.linkedin.com/in/jstaniek
 Kexi & Calligra (kexi-project.org, identi.ca/kexi, calligra-suite.org)
 KDE Software Development Platform on MS Windows (windows.kde.org)



More information about the calligra-devel mailing list