[Kget] Changes to LinkImporter

František Žiačik ziacik at gmail.com
Sun Aug 24 15:21:40 CEST 2008


Hello everyone.

Recently I was going through the KGet's LinkImporter sources. I've been using this feature to download stuff from rapidshare and have found it not quite functional when importing from web pages.

Due to this, I have made some changes to url regex and code inside. As I don't have appropriate permissions to commit changes to svn, I'd like someone who has to review this and consider commiting my code.

I have replaced 

static QString REGULAR_EXPRESSION = "((http|https|ftp|ftps)+([\\:\\w\\d:#@%/;$()~_?\\+-=\\\\.&])*)";

with

static QString REGULAR_EXPRESSION = "(\\w+[:]//)?(((([\\w-]+[.]){1,}(ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|com|cr|cs|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|edu|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gd|ge|gf|gg|gh|gi|gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|int|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|mg|mh|mil|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|net|nf|ng|ni|nl|no|np|nr|nt|nu|nz|om|org|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|sv|st|sy|sz|tc|td|tf|tg|th|tj|tk|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw|aero|biz|coop|info|museum|name|pro|travel))|([0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))([:][0-9]*)?([?/][\\w~#\\-;%?@&=/.+]*)?(?!\\w)";

This regex now matches 99% of urls, including those that do not start with http:// (e.g. rapidshare.com/files/something.rar) and those with query string. Still, there should be a look behing at the begining, (?<![\\w@]), unfortunately Qt does not support this, so if anybody knows how to do this, make the correction.

Also, I have changed the slotReadFile function to the following. Before, there was a readLine(200), which caused some lines to be broken (some htmls do not contain even one new line). Also, there was only one indexIn called for each line, causing multiple links on one line to be ignored.

void LinkImporter::slotReadFile(const QUrl &url)
{
    QRegExp rx(REGULAR_EXPRESSION);

    QFile file(url.path());
    if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
        return;

    QTextStream in(&file);
    quint64 size = file.size();
    quint64 position = 0;

    while (!in.atEnd()) {
        QString line = in.readLine();

		int regexPos = 0;
		quint64 lastPosition = position;

        while ((regexPos = rx.indexIn(line, regexPos)) > -1) {

	    QString link = rx.capturedTexts()[0];

        if (!link.contains("://"))
			link = QString("http://") + link;

            QUrl auxUrl(link);

            if(!link.isEmpty() && auxUrl.isValid() && m_transfers.indexOf(link) < 0 && !auxUrl.scheme().isEmpty() && !auxUrl.host().isEmpty()) {
                m_transfers << link;
            }

		    regexPos += rx.matchedLength();
		    position = lastPosition + regexPos;

		    emit progress(position * 100 / size);
        }

		position += line.size();
        
		emit progress(position * 100 / size);
    }

    if(!m_url.isLocalFile()) {
        file.remove();
    }
}

Regards,
Frantisek Ziacik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/kget/attachments/20080824/db2fce34/attachment.htm 


More information about the Kget mailing list