Stop wedge-reset from failing healthy queued backlog

70cd1641 by Isaac Bythewood · 24 days ago

Stop wedge-reset from failing healthy queued backlog

Manual recrawl/rerun-lighthouse fan-outs across multiple tabs queued more
work than the 2-worker slow pool could drain in one cycle. The wedge
reset then saw queued rows with NULL started_at and immediately marked
them "Crawl timed out or was interrupted" — even though they were just
waiting their turn. Now only running rows past the cutoff are reset, and
manual re-triggers also clear any prior error so the UI doesn't show a
stale failure while the new run is pending.

modified CLAUDE.md

@@ -73,4 +73,4 @@ Split settings: `status/settings/__init__.py` (shared), `development.py`, `produ### ProductionDocker Compose runs three services: `web` (Gunicorn+Uvicorn), `worker` (scheduler), `email` (Exim relay). Deploys via `git push server master` triggering a post-receive hook.Docker Compose runs a single `web` service. `entrypoint.py` spawns Gunicorn (Uvicorn workers) and the `scheduler` management command side-by-side in that container — if either process exits, the container stops and Docker restarts it. Deploys via `git push server master` triggering a post-receive hook.

modified properties/management/commands/scheduler.py

@@ -72,30 +72,31 @@ class Command(BaseCommand):        self._stop.set()    def reset_wedged_states(self):        """Flip stale running/queued states back to idle.        """Flip running rows back to idle once they've overrun their deadline.        Runs every cycle to catch threads that overran their deadline.        The startup path in handle() also wipes state unconditionally to        cover crashes.        Only "running" rows count as wedged. "queued" rows are waiting their        turn in the thread pool and will be picked up when a worker frees up;        flipping them here would mark healthy backlog as failed whenever the        user fans out manual re-triggers.        The startup path in handle() also wipes any leftover queued/running        state unconditionally to cover crashes.        """        now = timezone.now()        crawl_cutoff = now - timezone.timedelta(seconds=900)        lh_cutoff = now - timezone.timedelta(seconds=300)        Property.objects.filter(            crawl_state__in=["queued", "running"],        ).filter(            Q(crawl_started_at__isnull=True) | Q(crawl_started_at__lt=crawl_cutoff)            crawl_state="running",            crawl_started_at__lt=crawl_cutoff,        ).update(            crawl_state="idle",            last_crawl_error="Crawl timed out or was interrupted",        )        Property.objects.filter(            lighthouse_state__in=["queued", "running"],        ).filter(            Q(lighthouse_started_at__isnull=True)            | Q(lighthouse_started_at__lt=lh_cutoff)            lighthouse_state="running",            lighthouse_started_at__lt=lh_cutoff,        ).update(            lighthouse_state="idle",            last_lighthouse_error="Lighthouse run timed out or was interrupted",

modified properties/views.py

@@ -260,7 +260,8 @@ def property_recrawl(request, property_id):        )    property_obj.next_run_at_crawler = timezone.now()    property_obj.save(update_fields=["next_run_at_crawler"])    property_obj.last_crawl_error = None    property_obj.save(update_fields=["next_run_at_crawler", "last_crawl_error"])    return JsonResponse({"ok": True, **_serialize_status(property_obj)})

@@ -286,7 +287,10 @@ def property_rerun_lighthouse(request, property_id):        )    property_obj.next_lighthouse_run_at = timezone.now()    property_obj.save(update_fields=["next_lighthouse_run_at"])    property_obj.last_lighthouse_error = None    property_obj.save(        update_fields=["next_lighthouse_run_at", "last_lighthouse_error"]    )    return JsonResponse({"ok": True, **_serialize_status(property_obj)})