Self checks won't catch objective shifts because from local perspective the subgoal looks fine. A sort of "upper management llm" working off of summaries might do better at detecting rabbit holes.

Regular bad behavior is easier to notice.

My model is that when the chat is long enough, llms just don't care much about instructions.