💭 any time an llm fails that should be saved as a test case. Run the llm multiple times with that history to look for failure rate. Make modifications to the context/system message and rerun to quantify improvement