I wish I had a crystal ball to know how this will play out in the future, will a different AI create it from the SOP? How will humans fit in?
For me with any new release of my winery production software I re-ran every job put into my clients production systems from job #1, about 200,000 jobs; Before going into production we checked all the balances, product compositions and inventory from this new software matches what the old system currently says. Takes about an hour to re-run fifteen years production and even a milliliter, or milligram difference was enough to trigger a stop/check. We had an extensive set of input data I could also feed in, to ensure mistakes were caught there too.
I expect other people do it there own way, but as a business this would be the low bar of testing I would expect to be done.