💭 gradually losing my mind trying to figure out why the ai performs well in one eval but badly in another despite not seeing any differences to cause it