REs to parse HTML (was: Re: creating a content system)

Aaron J. Seigo aseigo at kde.org
Thu Aug 11 23:09:38 CEST 2005


On Thursday 11 August 2005 02:29, Manuel Amador wrote:
> El jue, 11-08-2005 a las 07:55 -0600, Aaron J. Seigo escribió:

> But since what I was building is mostly a FTS/metadata indexing/search
> engine, I guess what I did was absolutely right on target for my problem
> domain.

probably =)

> >  and i'm not sure what sort of broken HTML KDom would have a
> > problem with exactly. you're not doing layout, just looking for user
> > visible text and some basic markup hints like headers and whatnot.
>
> that's something you can easily do with REs as well.  REs aren't widely
> used in this case only because complex regexps tend to be kind of slow
> in real-world usage.  Slow as in "don't slow my computer down with text
> processing, we're only in 1980" style.

heh.. well, i've seen REs impact even real world systems. but beyond that, you 
can certain use an RE to grab all headings, but it's also useful to know what 
text follows that. you can, of course, also do this with REs, but at some 
point it just becomes easier to use a decent parser that spits out a DOM for 
traversal.

> Anyways, I first toyed with the idea of passing the file through
> xmllint, then loading it in one of Python's XML parsers, but that went
> sour as soon as I threw a few HTML files from my own information
> collection. 

sour in which way? as in the parsers barfed on them? this would not be a 
surprise  because HTML is not XML, which is to say they are not guaranteed to 
be well formed with XML is.
 
-- 
Aaron J. Seigo
GPG Fingerprint: 8B8B 2209 0C6F 7C47 B1EA  EE75 D6B7 2EB1 A7F1 DB43

Full time KDE developer sponsored by Trolltech (http://www.trolltech.com)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/klink/attachments/20050811/9386eb3c/attachment.pgp


More information about the Klink mailing list