Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I realize that this is essentially a ridiculous question, but has anyone offered a qualitative evaluation of these benchmarks? Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?


>Like, I feel that GPT-4 (pre-turbo) was an extremely powerful model for almost anything I wanted help with. Whereas I feel like Bard is not great. So does this mean that my experience aligns with "HellaSwag"?

It doesn't mean that at all because Gemini Turbo isn't available in Bard yet.


I am not sure what Gemini Turbo is. Perhaps you meant Gemini Ultra? Because Gemini Pro (which is in this table) is currently accessible in Bard.


Yes, that's what I meant.


I get what you mean, but what would such "qualitative evaluation" look like?


I think my ideal might be as simple as a few people who spend a lot of time with various models describing their experiences in separate blog posts.


I see.

I can't give any anecdotal evidence on ChatGPT/Gemini/Bard, but I've been running small LLMs locally over the past few months and have amazing experience with these two models:

- https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B (general usage)

- https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instr... (coding)

OpenChat 3.5 is also very good for general usage, but IMO NeuralHermes surpassed it significantly, so I switched a few days ago.


Thank you for the suggestions – really helpful for my hobby project. Can't run anything bigger than 7B on my local setup, which is a fun constraint to play with.


Thanks! I’ve had a good experience with the deepseek-coder:33b so maybe they’re on to something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: