REs to parse HTML (was: Re: creating a content system)

Thu Aug 11 22:29:27 CEST 2005

El jue, 11-08-2005 a las 07:55 -0600, Aaron J. Seigo escribió:
> > Just as a quick tip: these things I solved with regexps.  HTML parsing
> > is much faster and accurate that way.  In theory using KDom may seem to
> > be the correct route.  In practice, KDom will need to incorporate
> > intelligence to parse broken HTML files, a thing that's way simpler to
> > do with regular expressions.
> 
> if you truly did manage to do a good job of this with regular expressions, 
> i'll be impressed. did you simply remove all tags and fulltext the remaining 
> items?

Not exactly.

>  if so, we're losing a TON of important information (like titles and 
> headers =).

Well, I used REs to extract fulltext from the <body> area, removing
javascript but preserving XML comments. And <title> for the Title
property, although I was working on an alpha version.  REs would have
helped me a lot if I had wanted to extract all of the dublin core
metadata attributes present on each HTML file that did have them.  

But since what I was building is mostly a FTS/metadata indexing/search
engine, I guess what I did was absolutely right on target for my problem
domain.

>  and i'm not sure what sort of broken HTML KDom would have a 
> problem with exactly. you're not doing layout, just looking for user visible 
> text and some basic markup hints like headers and whatnot.

that's something you can easily do with REs as well.  REs aren't widely
used in this case only because complex regexps tend to be kind of slow
in real-world usage.  Slow as in "don't slow my computer down with text
processing, we're only in 1980" style.

Anyways, I first toyed with the idea of passing the file through
xmllint, then loading it in one of Python's XML parsers, but that went
sour as soon as I threw a few HTML files from my own information
collection.  Thus the RE way I tried and worked pretty well.

> 
> _______________________________________________
> Klink mailing list
> Klink at kde.org
> https://mail.kde.org/mailman/listinfo/klink
-- 
Manuel Amador                   <rudd-o at amautacorp.com>
http://www.amautacorp.com/            +593 (4) 220-7010