I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?
It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.
I can't give any anecdotal evidence on ChatGPT/Gemini/Bard, but I've been running small LLMs locally over the past few months and have amazing experience with these two models:
Thank you for the suggestions – really helpful for my hobby project. Can't run anything bigger than 7B on my local setup, which is a fun constraint to play with.