DVCS crashes and KJob (bug 172309)

Sun Oct 12 21:40:30 UTC 2008

On 13.10.08 00:09:01, Evgeniy Ivanov wrote:
> Andreas Pakulat wrote:
> > On 11.10.08 17:15:46, Evgeniy Ivanov wrote:
> >> Hi all,
> >> I've found the following thing in KJob and I think it is causing most
> >> part of our DVCS crashes:
> >> http://api.kde.org/4.x-api/kdelibs-apidocs/kdecore/html/kjob_8cpp-source.html#l00186
> >> Look, start() is implemented in DVCSjob and it can emit result() (or
> >> connect KProcess to slots emitting it). But it can emit result() before
> >> loop will be execed and there will be no signals to stop it (deadlock or
> >> crash depending on deferred deletion).
> >> I have committed DVCS tests (vcs/dvcs/tests) which prove something
> >> really wrong with deferred deletion, see dvcsjobTest.cpp:
> >> First I have a deadlock in first job->exec(). I debugged and saw DVCSjob
> >> emitted result() before loop is execed. QProcess::waitForStarted()
> >> before starting a process fixed it.
> > 
> > Interesting that this doesn't seem to be a problem with the svn plugin. It
> > also has code that calls emitResult inside start() and AFAIK its also uses
> > with KJob::exec(). However it could be that svn is just more relaxed in the
> > case of wrong inputs, for example svn info just checks wether the given
> > location is valid (i.e. KUrl::isValid) and if its not it might still fail
> > further down the road...
> 
> QProcess::error() signal is delivered very fast (faster, then finished)
> The problem is with error() signal or when I set failed without running.

But that sounds like a problem in your code. There's no speed difference in
delivering signals to slots. A signal is always a direct function call
(unless you do signal slots across thread boundaries or force delayed
signal delivering via queued connections). So whats the problem? You start
a process and at some point the process errors, then a slot in your job is
called, which in turn should call emitResult after setting a failed
message. I can't see how that would pose a problem.

> >> Then I added another exec for the same job. Depending on
> >> "job->setAutoDelete(false)" from the dvcsjobTest.cpp, tests got crash or
> >> deadlock.
> >> I think the reason is really in KJob::exec(). IMHO we should connect
> >> result() to KJob::takeResult(){resultIsGot = true;} and do loop.exec()
> >> only if resultIsGot == false. #qt confirms it.
> > 
> > I think you're right. In theory the eventloop itself could actually check
> > that but that would need to wait until Qt4.5 is out. However I don't think
> > we need a new slot, simply put the flag-setting into emitResult, that one
> > is always called anyway. So the attached patch should work, can you please
> > test that?
> 
> Did. It looks like your patch (see below, what I added) solves the
> problem with tests.
> 
> +    if( !d->isFinished ) {
> +        loop.exec();
>          d->isFinished = false;    // the Job can be used for
> multiple 		//processes (not the same time), so we should restore initial
> state.
> +    }

A job cannot be re-used. Its a fire-and-forget/fetch result thing. You can
only start a job once, let it run to its end or error out. After that you
can fetch data (sometimes even before that) and then throw it away.

> But it's a heisenbug: we removed the problem in tests, but it's still in
> the code.

Huh? I thought the problem was that kjob doesn't quit the event loop, when
start() calls emitResult? So the patch should fix both the test and the
real code. If it doesn't, then the test is useless (for the described
problem, it still shows a potential deadlock in KJob).

> I spoke with Thiago in irc, he doesn't think such problem with KJob can
> occur, but tests confirm we're true.

Well, he's wrong. The code is pretty clear, the job runs start(), which
might emit a signal and then it runs the event loop. This can potentially
lead to a deadlock of the application because that event loop is never
ended due to the finished signal being delivered back to KJob before it
enters its loop. The signals are not queued-connected.

Anyway, before we post to kde-core-devel lets make sure we understand the
problem properly.

> It will be cool if someone try to run a test from vcs/dvcs/tests to
> confirm it fails without KJob's patch.
> I tried helgrind... But my hardware didn't like it...

Hellgrind is a multi-thread debugging tool, as far as I understood dvcs
doesn't use multiple threads. If it does thats pretty strange as QProcess
is already asynchronous and hence there's no need for threads.

> I don't have any idea what's wrong now and can't reproduce it in tests.
> Will continue next weekend. If you have any tips, what can be done you're
> welcome to share :)

I don't know the code. Maybe I can convince myself that this bug is more
important than playing games (I'm currently a bit addicted to a new one I
got) tomorrow and look deeper into this.

> >> But I don't understand how deleteLater kills an application. gdb shows
> >> we stay in loop.exec() from KJob, so we do not return to the main loop
> >> to activate deletion, right?
> > 
> > If you want to know the answer to that I suggest to look through Qt's event
> > loop code. I'm not sure wether the delete-event is scheduled only for the
> > main event loop, it might be that KJob's event loop also executed all
> > deferred-delete's.
> 
> I guess I will *have* to do that =(
> But docs say deferred deletion should be executed in the loop, where we
> used deleteLater() and I believe trolls have tests for it.

For our case that is correct. The deleteLater calls
QCoreApplication::postEvent which adds the event to the current
eventDispatcher. Which in this case should be the QApplication event
dispatcher.

Alltogether I still don't understand what exactly is happening. I'll try to
have a look at the tests tomorrow.

Andreas

-- 
Try to get all of your posthumous medals in advance.