New Latex parser: Texla

Davide Valsecchi valsecchi.davide94 at gmail.com
Fri Feb 19 10:24:51 UTC 2016


Dear all,

Today I want to show you a new project in WikiToLearn.

Until now the latex conversion to markdown was based on Plastex, a parsing
library, and further elaborated by my MediaWikiRenderer to handle pages,
links and index creation.

Plastex lib is beatiful, but it's real scope is convert latex to html. It
makes a lot of slight changes at the sources that make my work basically
patching and preparsing what the library doesn't understand. Moreover it
doesn't understand a lot of packages, making the automatical conversion of
documents, without my manual adjustments a mirage.

So, finally I decided to write a Latex Parser on my own. I know, you
probably are thinking that it's rewriting of the wheel and there are plenty
of useful libs outside there, but what I wanted  was not a complex parser
that undestands the latex and converts it in hundreds of format, I wanted a
simple parser, able to parse *every* commands and environment, simply
tokenizing the tex.

In two weeks of effort I created the bases for TeXLa. Basically it splitted
the tex in a tree of *Blocks*. It's main feature is that it's able to
understand every latex commands, also those that do not exist. That's
because the main parser don't try to catch the content of an environment or
command: it extracts the encountered command with all options, and search
for a *parser hook*, a function that is declared to be able of parsing that
precise commands.

When I want to handle a precise environment I implement a new Block type,
inherited from the base Block class, and a parse function that recives from
the parser the tex to analyze.
If I want to analyze the content of the environment I simply call
recursively the parser passing the tex to analyze, and I add blocks as
children of parsed environment block.

What happens if I have a command with a lot of options, with nested
parenthesis? The parser cycle catch the type of the command and call the
declared parser hooks with all the tex after the command (it doen't know
what to parse!). The design Block uses some helper functions I had created
to extract the options from parenthesis in a nested way. Then the tex left
to parse is returned to the parser with the new block.

Adding a command is, usualy, only 20 line. Look at that
https://github.com/WikiToLearn/texla/blob/master/Blocks/NoteBlocks.py.

Now I'm missing automatic macro expansion and the renderer part. Most of
MediaWikiRenderer will be reused.

Thanks for your attention!

Bye
Davide
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/wikitolearn-tech/attachments/20160219/22a66022/attachment.html>


More information about the WikiToLearn-Tech mailing list