REs to parse HTML (was: Re: creating a content system)
Aaron J. Seigo
aseigo at kde.org
Thu Aug 11 23:09:38 CEST 2005
On Thursday 11 August 2005 02:29, Manuel Amador wrote:
> El jue, 11-08-2005 a las 07:55 -0600, Aaron J. Seigo escribió:
> But since what I was building is mostly a FTS/metadata indexing/search
> engine, I guess what I did was absolutely right on target for my problem
> domain.
probably =)
> > and i'm not sure what sort of broken HTML KDom would have a
> > problem with exactly. you're not doing layout, just looking for user
> > visible text and some basic markup hints like headers and whatnot.
>
> that's something you can easily do with REs as well. REs aren't widely
> used in this case only because complex regexps tend to be kind of slow
> in real-world usage. Slow as in "don't slow my computer down with text
> processing, we're only in 1980" style.
heh.. well, i've seen REs impact even real world systems. but beyond that, you
can certain use an RE to grab all headings, but it's also useful to know what
text follows that. you can, of course, also do this with REs, but at some
point it just becomes easier to use a decent parser that spits out a DOM for
traversal.
> Anyways, I first toyed with the idea of passing the file through
> xmllint, then loading it in one of Python's XML parsers, but that went
> sour as soon as I threw a few HTML files from my own information
> collection.
sour in which way? as in the parsers barfed on them? this would not be a
surprise because HTML is not XML, which is to say they are not guaranteed to
be well formed with XML is.
--
Aaron J. Seigo
GPG Fingerprint: 8B8B 2209 0C6F 7C47 B1EA EE75 D6B7 2EB1 A7F1 DB43
Full time KDE developer sponsored by Trolltech (http://www.trolltech.com)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/klink/attachments/20050811/9386eb3c/attachment.pgp
More information about the Klink
mailing list