kde forum data

Gregor Leban gleban at gmail.com
Fri Aug 24 19:14:50 UTC 2012


Hi,
thank you all for helping with this.
The way in which I tried accessing all the posts is by repeatedly asking
for a rss feed, each time with a different start offset.
I start with
https://forum.kde.org/search.php?keywords=&terms=all&author=&tags=&sv=0&sc=1&sf=all&sk=t&sd=a&feed_type=RSS2.0&feed_style=BASIC&st=0&submit=Search&countlimit=100&start=0

after processing this feed  I increase the start to 100, then 200, etc. In
this way, theoretically I should be able to get the whole history from 2004
to now. When I was trying to download the posts yesterday I noticed that
phpBB has a limit in the start parameter. I can only go up to start =
20.000. After that I get an error page. Since this is an issue only for the
offline mode (when we need to import the past data) I solved the problem
using the parameter t in calling the search.php. Using this parameter I can
get rss feed for all posts in a particular topic. Topic ids currently go
from 0 to 100.000 so I just needed to do 100.000 url calls :) It's done now
- I've downloaded the whole history and I won't need to do it again. Since
I searched by topic ids I did get the whole history (even the posts with id
< 90.000).

Thanks again for looking into this.
Best,
gregor

On Fri, Aug 24, 2012 at 4:11 PM, Stuart Jarvis <jarvis at kde.org> wrote:

> On Friday 24 Aug 2012 22:41:05 Ben Cooksley wrote:
> > On Fri, Aug 24, 2012 at 2:51 AM, Stuart Jarvis <jarvis at kde.org> wrote:
> > > Hi everyone,
> >
> > Hi Stuart,
> >
> > > I guess kde-ww is the right list to ask this. Please see the query
> below
> > > from one of our partners in the ALERT project*
> > >
> > > Any ideas why the RSS would be limited to post 89447?
> >
> > Can't think of any particular reason off the top of my head - there is
> > certainly no deliberate constraint on getting RSS feeds of older
> > material.
> > However I can't say it was specifically designed to return older content.
>
> Thanks for getting back to me. It is an unusual use case. The idea is to
> provide a non-invasive way for the ALERT system to collect the archives of
> a
> project with the added benefit that the same parser can be used for live
> updates.
>
> > It is likely they are striking a limit on the number of search results
> > returned (as the RSS feed is powered by our Sphinx search backend).
> > Could we have some details on how they are conducting the RSS feed
> > retrieval so I can debug why this is happening?
>
> Some more details from Gregor Leban (copied in):
>
> ---
> yes, it is strange.
> Here is for example the feed that i have in the ascending time order for
> the whole history:
>
> https://forum.kde.org/search.php?keywords=&terms=all&author=&tags=&sv=0&sc=1&sf=all&sk=t&sd=a&st=0&feed_type=RSS2.0&feed_style=HTML&countlimit=100&submit=Search
>
>
> as you can see, the oldest post si from Fri, 21 May 2004 03:02:50 GMT and
> has
> this url:
> https://forum.kde.org/viewtopic.php?f=119&t=66734&p=89443#p89443
>
> and the p argument in the url is the post id. Do you know when KDE started
> using forums - in 2004 or sooner?
> ---
>
> So in this case, the number of results is already limited to 100. Could
> this
> be an issue with changes to forum software in the past (my memory on the
> forum
> history is a bit hazy)
>
> Cheers,
> Stu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.kde.org/mailman/private/kde-www/attachments/20120824/7e1614cb/attachment.html>


More information about the kde-www mailing list