Christmas is here for LocalLLM fellas!
As the festive season approaches, the LocalLLM community is buzzing with the release of Nous Research's latest creation - Nous Hermes 2 on Yi 34B. This cutting-edge AI model isn't just an upgrade; it's a leap into the future of artificial intelligence. Let's dive into what makes Nous Hermes 2 so special.
What is Nous-Hermes-2-Yi-34B?
Nous-Hermes-2-Yi-34B, is the latest model developed by Nous Research. It not only surpasses its predecessors but also sets new benchmarks in the wider AI community.
The debut of Nous Research's Nous Hermes 2 on Yi 34B has been a game-changer in the world of artificial intelligence. Released just before Christmas, this isn't just a simple tweak to existing technology. It's a complete reinvention that pushes the boundaries of what we thought AI could do. In this detailed look, we're going to explore the standout features of Nous Hermes 2, dive into its impressive achievements, and discuss what all of this could mean for the future of AI.
How Well Does Nous-Hermes-2-Yi-34B Perform?
Nous Hermes 2 isn't just a step ahead of its earlier versions in the Hermes series; it's in a league of its own compared to the broader AI community.
GPT4All Benchmarks for Nous-Hermes-2-Yi-34B
The GPT4All benchmark tests AI models across a wide variety of tasks, and the performance of Nous Hermes 2 here is quite eye-opening. It's not just good at one thing; it excels across the board. Let's break down some of the key results:
- Arc Challenge: Here, the model scored an accuracy of 60.67% and a normalized accuracy of 64.16%. These numbers show that it's got a strong grip on complex reasoning tasks.
- BoolQ: It achieved an impressive accuracy of 88.59%, which really highlights its ability to understand and respond to complicated questions.
- OpenbookQA: This was a bit of a tougher challenge, with the model scoring 35.20%. It indicates that while it's doing great, there's still room for it to grow and get even better.
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 0.6067 | _ | 0.0143 |
acc_norm | 0.6416 | _ | 0.0140 | ||
arc_easy | 0 | acc | 0.8594 | _ | 0.0071 |
acc_norm | 0.8569 | _ | 0.0072 | ||
boolq | 1 | acc | 0.8859 | _ | 0.0056 |
hellaswag | 0 | acc | 0.6407 | _ | 0.0048 |
acc_norm | 0.8388 | _ | 0.0037 | ||
openbookqa | 0 | acc | 0.3520 | _ | 0.0214 |
acc_norm | 0.4760 | _ | 0.0224 | ||
piqa | 0 | acc | 0.8215 | _ | 0.0089 |
acc_norm | 0.8303 | _ | 0.0088 | ||
winogrande | 0 | acc | 0.7908 | _ | 0.0114 |
Average: 76.00%
AGIEval Benchmarks for Nous-Hermes-2-Yi-34B
The AGIEval benchmark focuses on higher-level intelligence and reasoning capabilities. In these tests, Nous Hermes 2 continued to shine:
- AGIEval Aqua Rat: The model scored 31.89%, pointing to some areas where it could develop further.
- AGIEval LSAT LR: It really showed off its logical reasoning skills here, with a high score of 70.78%.
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 0.3189 | _ | 0.0293 |
acc_norm | 0.2953 | _ | 0.0287 | ||
agieval_logiqa_en | 0 | acc | 0.5438 | _ | 0.0195 |
acc_norm | 0.4977 | _ | 0.0196 | ||
agieval_lsat_ar | 0 | acc | 0.2696 | _ | 0.0293 |
acc_norm | 0.2087 | _ | 0.0269 | ||
agieval_lsat_lr | 0 | acc | 0.7078 | _ | 0.0202 |
acc_norm | 0.6255 | _ | 0.0215 | ||
agieval_lsat_rc | 0 | acc | 0.7807 | _ | 0.0253 |
acc_norm | 0.7063 | _ | 0.0278 | ||
agieval_sat_en | 0 | acc | 0.8689 | _ | 0.0236 |
acc_norm | 0.8447 | _ | 0.0253 | ||
agieval_sat_en_without_passage | 0 | acc | 0.5194 | _ | 0.0349 |
acc_norm | 0.4612 | _ | 0.0348 | ||
agieval_sat_math | 0 | acc | 0.4409 | _ | 0.0336 |
acc_norm | 0.3818 | _ | 0.0328 | ||
Average: 50.27% |
BigBench Benchmarks for Nous-Hermes-2-Yi-34B
BigBench is all about putting AI models to the test with some really tough reasoning challenges. In these tests, Nous Hermes 2 proved why it's considered a top-tier AI model:
- Bigbench Causal Judgement: It scored 57.37%, demonstrating a solid ability to make sense of cause-and-effect relationships.
- Bigbench Movie Recommendation: Here, it got a score of 52.00%. This test was more about understanding personal tastes and preferences, and the score suggests the model has a good handle on these more subjective areas.
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 0.5737 | _ | 0.0360 |
bigbench_date_understanding | 0 | multiple_choice_grade | 0.7263 | _ | 0.0232 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 0.3953 | _ | 0.0305 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 0.4457 | _ | 0.0263 |
exact_str_match | 0.0000 | _ | 0.0000 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 0.2820 | _ | 0.0201 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 0.2186 | _ | 0.0156 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 0.4733 | _ | 0.0289 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 0.5200 | _ | 0.0224 |
bigbench_navigate | 0 | multiple_choice_grade | 0.4910 | _ | 0.0158 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 0.7495 | _ | 0.0097 |
bigbench_ruin_names | 0 | multiple_choice_grade | 0.5938 | _ | 0.0232 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 0.3808 | _ | 0.0154 |
bigbench_snarks | 0 | multiple_choice_grade | 0.8066 | _ | 0.0294 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 0.5101 | _ | 0.0159 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 0.3850 | _ | 0.0154 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 0.2160 | _ | 0.0116 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 0.1634 | _ | 0.0088 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 0.4733 | _ | 0.0289 |
Average: 46.69% |
TruthfulQA Benchmarks for Nous-Hermes-2-Yi-34B
The TruthfulQA benchmark tests how well AI models can deal with detailed, context-rich questions. Nous Hermes 2 scored 43.33% on mc1 and 60.34% on mc2 in this benchmark. These results really show how it can handle complex questions and provide sophisticated, nuanced answers.
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 0.4333 | _ | 0.0173 |
mc2 | 0.6034 | _ | 0.0149 |
Why These Results Matter?
So, what do these scores and numbers mean for us? First off, they tell us that Nous Hermes 2 is not just good at one type of task. It's versatile and can adapt to a wide range of challenges. This versatility is crucial for AI to be useful in real-world situations, where it needs to handle all kinds of different problems and questions.
Interested in testing out more Local LLMs? You can try them out at Anakin AI without downloading them!
- Mistral 7B and 8x7B: the hottest names for Open Source LLMs!
- Dolphin-2.5-Mixtral-8x7b: get a taste the wild west of uncensored Mixtral 8x7B!
- OpenHermes-2.5-Mistral-7B: One of the best performing Mistral-7B fine tune models, give it a shot!
- OpenChat, now you can build Open Source Lanugage Models, even if your data is imperfect!
Other models include:
- GPT-4: Boasting an impressive context window of up to 128k, this model takes deep learning to new heights.
- Google Gemini Pro: Google's AI model designed for precision and depth in information retrieval.
- DALLE 3: Create stunning, high-resolution images from textual descriptions.
- Stable Diffusion: Generate images with a unique artistic flair, perfect for creative projects.
The scores in areas like the Arc Challenge and BoolQ also highlight the model's advanced understanding capabilities. It's not just processing information; it's making sense of it in a way that's closer to how humans think. This kind of advanced understanding is key for tasks like problem-solving, decision-making, and even creative work.
But perhaps what's most exciting about Nous Hermes 2 is the potential it shows. Even in areas where it didn't score as high, like OpenbookQA, we see opportunities for growth and improvement. AI technology is still evolving, and models like Nous Hermes 2 are leading the charge. As it continues to learn and improve, there's no telling what kind of tasks it might be able to handle in the future.
Hugging face card for Nous-Hermes-2-Yi-34B-GGUF.
Conclusion: Looking to the Future
The success of Nous Hermes 2 on Yi 34B isn't just about the model itself. It's a sign of things to come in the field of AI. As we continue to develop and refine AI technology, we can expect to see models that are even more intelligent, versatile, and useful in our everyday lives. The possibilities are endless, and with models like Nous Hermes 2 leading the way, the future of AI looks brighter than ever.