Approach needs to work without detection and has to work generally for any divergence from instruction
Could not allow the llm to see its past behavior, but that sounds insane Could have the llm work off of summaries instead of direct output, but that seems like the summarizer would have the same issue.
Could tell the llm its someone else being dropped in place and should correct any poor behaviors Could have an llm validator check its performance and edit or redo This sort of thing in the past has caused good behavior to be 'corrected' into worse and usually more complicated behavior.
Could make instructions louder / closer to the front (doesn't seem to work
Could rotate models to maybe break up consistency. This'd also reduce desired consistency.