scheduler issue

Wolfgang Reissenberger sterne-jaeger at t-online.de
Tue Dec 8 07:52:40 GMT 2020


I’m fully with you, up to a certain extend each module should manage it’s own failures and at least try to restart its last procedure for a certain amount of times.

On the other hand, we need an orchestration layer that coordinates actions - as we currently have inside Capture and Scheduler. They are far from perfect, since the states are spreaded over the code. But the general logic is OK, that a failure in a module is handed back to the orchestration layer. 

What I would try to avoid is that one module actively controls others. What we have (and that’s good from my perspective) that modules listen to state changes from others and draw their own conclusions.

So in general, I think our architecture is appropriate, but I would really appreciate if we clean up the orchestration layer - i.e. Capture and Scheduler - and introduce dedicated state machines there.

Wolfgang

> Am 08.12.2020 um 08:28 schrieb Eric Dejouhanet <eric.dejouhanet at gmail.com>:
> 
> Thanks Wolfgang, 
> 
> I acknowledge what you analyse. 
> 
> I added a test to verify that guiding would always be restarted if it was required (and disabled) at the time Focus was starting its procedure. But that makes Focus responsible for restarting Guide. So we are saying that it should be the same for Capture, with the risk that requests are mixed up. Doesn't that mean that Guide should integrate that task? Should we distinguish a failure, managed by Scheduler, and a procedure follow-up, managed by the module itself? 
> eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com> - https://astronomy.dejouha.net <https://astronomy.dejouha.net/>
> De: sterne-jaeger at t-online.de
> Envoyé: 8 décembre 2020 07:51
> À: hy at murveit.com
> Cc: eric.dejouhanet at gmail.com; mutlaqja at ikarustech.com
> Objet: Re: scheduler issue
> 
> Dear all,
> I’m not 100% sure, but for me it looks like the problem is located in the Capture module which - only in the case of a MF - is responsible for restarting the guiding procedure. 
> 
> What we can see from the logs, the Capture module receives a  GUIDE_CALIBRATION_ERROR event. For this event it calls  processGuidingFailed(). There is the critical part:
> 
>     else if (meridianFlipStage == MF_GUIDING)
>     {
>         if (++retries >= 3)
>         {
>             appendLogText(i18n("Post meridian flip calibration error. Aborting..."));
>             abort();
>         }
>     }
>     autoGuideReady = false;
> 
> 
> The meridianFlipStage == MF_GUIDING is true, but it happens for the first time, hence it does not abort.
> 
> My explanation (without testing) is, that Capture is idling. The Scheduler also does nothing, since it only reacts upon guiding problems when it starts a job.
> 
> As a result, simply nothing happens - as we could see from the logs.
> 
> If somebody wants to fix it in the next two weeks, I could assist and advise. As a first step, I would warmly recommend creating a new test case for MF testing calibration error handling.
> 
> All the best
> Wolfgang
> 
>> Wolfgang Reissenberger
> 
> www.sterne-jaeger.de <http://www.sterne-jaeger.de/>
> TSA-120 + FSQ-85 | Avalon Linear + M-zero | Moravian G2-8300 + ASI 1600mm pro
> 
>> Am 08.12.2020 <tel:08122020> um 07:08 schrieb Hy Murveit <murveit at gmail.com <mailto:murveit at gmail.com>>:
>> 
>> Just to be clear, at this point I'm not taking ownership of this, but really just acting as a bug reporter.
>> If one of you wants to jump in and fix, or put it on a future TODO list, that would be great from my perspective.
>> On the other hand, if someone thinks I should fix it, then I'd need coaching. 
>> I'd prefer the former, but would do the latter if necessary.
>> 
>> Hy
>> 
>> On Mon, Dec 7, 2020 at 9:47 PM Eric Dejouhanet <eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com>> wrote:
>> Hello Hy, 
>> 
>> From the code you pasted, state change is done before emitting state, so that seems proper. I think Jasem nailed it, there is a state management missing either in Capture, to report abort from guiding recovery failure, or Scheduler, to manage guiding failure. The fact it follows the meridian flip means, in my opinion, that we progressively approach the root cause of the remaining stability issues, and that we need to really isolate that flip process as much as we can. 
>> eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com> - https://astronomy.dejouha.net <https://astronomy.dejouha.net/>
>> De: murveit at gmail.com <mailto:murveit at gmail.com>
>> Envoyé: 7 décembre 2020 22:56
>> À: eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com>
>> Répondre à: hy at murveit.com <mailto:hy at murveit.com>
>> Cc: hy at murveit.com <mailto:hy at murveit.com>; mutlaqja at ikarustech.com <mailto:mutlaqja at ikarustech.com>; sterne-jaeger at t-online.de <mailto:sterne-jaeger at t-online.de>
>> Objet: Re: scheduler issue
>> 
>> Eric et al,
>> 
>> Not clear on how this works, but the code is copied below.
>> Note the emit newStatus().
>> What is the state change you are referring to?
>> 
>> FWIW, I've included the code snippet below (pretty complicated code to follow--there's a state machine in guide.cpp <http://guide.cpp/> -- I guess that's either controlling the internal guider or PHD2), and there's a calibration state machine in internalguider.cpp <http://internalguider.cpp/>. From what I can tell, I don't think it's the issue you described, but I'm not too familiar with these state machines, nor the scheduler.
>> 
>> Hy
>> 
>> 
>> From internalguider.cpp <http://internalguider.cpp/>, line 505
>> void InternalGuider::processCalibration()
>> {
>>     pmath->performProcessing();
>> 
>>     if (pmath->isStarLost())
>>     {
>>         emit newLog(i18n("Lost track of the guide star. Try increasing the square size or reducing pulse duration."));
>>         reset();
>> 
>>         calibrationStage = CAL_ERROR;
>>         emit newStatus(Ekos::GUIDE_CALIBRATION_ERROR);
>>         emit calibrationUpdate(GuideInterface::CALIBRATION_MESSAGE_ONLY, i18n("Calibration Failed: Lost guide star."));
>>         return;
>>     }
>>     ...
>> 
>> From internalguider.cpp <http://internalguider.cpp/>, line 542
>> void InternalGuider::reset()
>> {
>>     state = GUIDE_IDLE;
>>     //calibrationStage = CAL_IDLE;
>>     connect(guideFrame, SIGNAL(trackingStarSelected(int, int)), this, SLOT(trackingStarSelected(int, int)),
>>             Qt::UniqueConnection);
>> }
>> 
>> guide.cpp:2381 <http://guide.cpp:2381/>
>>         connect(guider, &Ekos::GuideInterface::newStatus, this, &Ekos::Guide::setStatus);
>> 
>> guide.cpp:1935 <http://guide.cpp:1935/>
>> void Guide::setStatus(Ekos::GuideState newState)
>> {
>>     if (newState == state)
>>     {
>>         // pass through the aborted state
>>         if (newState == GUIDE_ABORTED)
>>             emit newStatus(state);
>>         return;
>>     }
>> 
>>     GuideState previousState = state;
>> 
>>     state = newState;
>>     emit newStatus(state);
>> 
>>     switch (state)
>>     {
>>         ...        
>> 
>>         case GUIDE_IDLE:
>>         case GUIDE_CALIBRATION_ERROR:
>>             setBusy(false);
>>             manualDitherB->setEnabled(false);
>>             break;
>> 
>> 
>> On Mon, Dec 7, 2020 at 12:31 PM Eric Dejouhanet <eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com>> wrote:
>> Could it be that the emission of the guide failure is done before Guide's state is changed? That was the case for Focus, and Scheduler's immediate reaction for a new autofocus was thus rejected.
>> 
>> Unfortunately I haven't had time to check the log yet. 
>> eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com> - https://astronomy.dejouha.net <https://astronomy.dejouha.net/>
>> De: murveit at gmail.com <mailto:murveit at gmail.com>
>> Envoyé: 7 décembre 2020 21:04
>> À: mutlaqja at ikarustech.com <mailto:mutlaqja at ikarustech.com>
>> Répondre à: hy at murveit.com <mailto:hy at murveit.com>
>> Cc: hy at murveit.com <mailto:hy at murveit.com>; sterne-jaeger at t-online.de <mailto:sterne-jaeger at t-online.de>; eric.dejouhanet at gmail.com <mailto:eric.dejouhanet at gmail.com>
>> Objet: Re: scheduler issue
>> 
>> It was indeed after a meridian flip.
>> Jo also sent the .analyze file to me (attached) and here's a zoom'ed in screen shot from the time of the issue.
>> 
>> 
>> 
>> Hy
>> 
>> On Mon, Dec 7, 2020 at 11:43 AM Jasem Mutlaq <mutlaqja at ikarustech.com <mailto:mutlaqja at ikarustech.com>> wrote:
>> Hello Hy,
>> 
>> Do you know why it was calibrating? this wasn't after a meridian flip correct? What's happening is that scheduler handles calibration failures IF it was in the steps..i.e.
>> 
>> Track --> Focus --> Align --> Guide --> Capture. If at "Guide" calibration fails then it handles that. Right now, after capturing, the scheduler just LOGS the guide calibration results but does not handle them. Capture module should have been aborted if calibration fails, and then that would have been handled by the scheduler... but again, what would cause calibration in the middle of capture? meridian flip?
>> 
>> --
>> Best Regards,
>> Jasem Mutlaq
>> 
>> 
>> 
>> On Sun, Dec 6, 2020 at 11:30 PM Hy Murveit <murveit at gmail.com <mailto:murveit at gmail.com>> wrote:
>> Eric, Jasem,
>> 
>> Reporting a possible scheduler bug.
>> 
>> Jo (@ElCorazon) sent me a log
>> https://www.dropbox.com/s/n8icvn90fhunjfl/log_20-53-07.txt.gz?dl=0 <https://www.dropbox.com/s/n8icvn90fhunjfl/log_20-53-07.txt.gz?dl=0>
>> which I analyzed and my conclusion is that star detection caused guider calibration to fail at 01:15:54 (see below)
>> 
>> [2020-12-05T01:15:54.614 CST INFO ][     org.kde.kstars.ekos.guide <http://org.kde.kstars.ekos.guide/>] - "Lost track of the guide star. Try increasing the square size or reducing pulse duration."
>> [2020-12-05T01:15:54.617 CST DEBG ][   org.kde.kstars.ekos.capture <http://org.kde.kstars.ekos.capture/>] - Guiding state changed from "Calibrating" to "Calibration error"
>> 
>> and (ignoring that issue) the scheduler recognized, I suppose, that guider failed
>> 
>> [2020-12-05T01:15:54.624 CST DEBG ][ org.kde.kstars.ekos.scheduler <http://org.kde.kstars.ekos.scheduler/>] - Guide State "Calibration error"
>> 
>> but the scheduler didn't seem to restart the guiding calibration. Basically nothing happens until 1:57:56 when I assume Jo restarted things.
>> 
>> [2020-12-05T01:57:56.852 CST INFO ][ org.kde.kstars.ekos.scheduler <http://org.kde.kstars.ekos.scheduler/>] - Scheduler is stopping...
>> 
>> I assume the scheduler should try and restart the guider, but there are no .guide nor .scheduler messages between 1:15:54 and 1:57:56
>> 
>> Hy
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kstars-devel/attachments/20201208/b7855b8e/attachment-0001.htm>


More information about the Kstars-devel mailing list