Nous-Hermes-2 on Yi-34B: Breaking New Ground in AI Performance

Christmas is here for LocalLLM fellas!

As the festive season approaches, the LocalLLM community is buzzing with the release of Nous Research's latest creation - Nous Hermes 2 on Yi 34B. This cutting-edge AI model isn't just an upgrade; it's a leap into the future of artificial intelligence. Let's dive into what makes Nous Hermes 2 so special.

What is Nous-Hermes-2-Yi-34B?

Nous-Hermes-2-Yi-34B, is the latest model developed by Nous Research. It not only surpasses its predecessors but also sets new benchmarks in the wider AI community.

Announcing Nous Hermes 2 on Yi 34B for Christmas!

This is version 2 of @NousResearch's line of Hermes models, and Nous Hermes 2 builds on the Open Hermes 2.5 dataset, surpassing all Open Hermes and Nous Hermes models of the past, trained over Yi 34B with others to come!…
— Teknium (e/λ) (@Teknium1) December 26, 2023

The debut of Nous Research's Nous Hermes 2 on Yi 34B has been a game-changer in the world of artificial intelligence. Released just before Christmas, this isn't just a simple tweak to existing technology. It's a complete reinvention that pushes the boundaries of what we thought AI could do. In this detailed look, we're going to explore the standout features of Nous Hermes 2, dive into its impressive achievements, and discuss what all of this could mean for the future of AI.

How Well Does Nous-Hermes-2-Yi-34B Perform?

Nous Hermes 2 isn't just a step ahead of its earlier versions in the Hermes series; it's in a league of its own compared to the broader AI community.

GPT4All Benchmarks for Nous-Hermes-2-Yi-34B

GPT4All Benchmark for Nous-Hermes-2-Yi-34B

The GPT4All benchmark tests AI models across a wide variety of tasks, and the performance of Nous Hermes 2 here is quite eye-opening. It's not just good at one thing; it excels across the board. Let's break down some of the key results:

Arc Challenge: Here, the model scored an accuracy of 60.67% and a normalized accuracy of 64.16%. These numbers show that it's got a strong grip on complex reasoning tasks.
BoolQ: It achieved an impressive accuracy of 88.59%, which really highlights its ability to understand and respond to complicated questions.
OpenbookQA: This was a bit of a tougher challenge, with the model scoring 35.20%. It indicates that while it's doing great, there's still room for it to grow and get even better.

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.6067	_	0.0143
		acc_norm	0.6416	_	0.0140
arc_easy	0	acc	0.8594	_	0.0071
		acc_norm	0.8569	_	0.0072
boolq	1	acc	0.8859	_	0.0056
hellaswag	0	acc	0.6407	_	0.0048
		acc_norm	0.8388	_	0.0037
openbookqa	0	acc	0.3520	_	0.0214
		acc_norm	0.4760	_	0.0224
piqa	0	acc	0.8215	_	0.0089
		acc_norm	0.8303	_	0.0088
winogrande	0	acc	0.7908	_	0.0114

Average: 76.00%

AGIEval Benchmarks for Nous-Hermes-2-Yi-34B

The AGIEval benchmark focuses on higher-level intelligence and reasoning capabilities. In these tests, Nous Hermes 2 continued to shine:

AGIEval Aqua Rat: The model scored 31.89%, pointing to some areas where it could develop further.
AGIEval LSAT LR: It really showed off its logical reasoning skills here, with a high score of 70.78%.

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	0.3189	_	0.0293
		acc_norm	0.2953	_	0.0287
agieval_logiqa_en	0	acc	0.5438	_	0.0195
		acc_norm	0.4977	_	0.0196
agieval_lsat_ar	0	acc	0.2696	_	0.0293
		acc_norm	0.2087	_	0.0269
agieval_lsat_lr	0	acc	0.7078	_	0.0202
		acc_norm	0.6255	_	0.0215
agieval_lsat_rc	0	acc	0.7807	_	0.0253
		acc_norm	0.7063	_	0.0278
agieval_sat_en	0	acc	0.8689	_	0.0236
		acc_norm	0.8447	_	0.0253
agieval_sat_en_without_passage	0	acc	0.5194	_	0.0349
		acc_norm	0.4612	_	0.0348
agieval_sat_math	0	acc	0.4409	_	0.0336
		acc_norm	0.3818	_	0.0328
Average: 50.27%

BigBench Benchmarks for Nous-Hermes-2-Yi-34B

BigBench is all about putting AI models to the test with some really tough reasoning challenges. In these tests, Nous Hermes 2 proved why it's considered a top-tier AI model:

Bigbench Causal Judgement: It scored 57.37%, demonstrating a solid ability to make sense of cause-and-effect relationships.
Bigbench Movie Recommendation: Here, it got a score of 52.00%. This test was more about understanding personal tastes and preferences, and the score suggests the model has a good handle on these more subjective areas.

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	0.5737	_	0.0360
bigbench_date_understanding	0	multiple_choice_grade	0.7263	_	0.0232
bigbench_disambiguation_qa	0	multiple_choice_grade	0.3953	_	0.0305
bigbench_geometric_shapes	0	multiple_choice_grade	0.4457	_	0.0263
		exact_str_match	0.0000	_	0.0000
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	0.2820	_	0.0201
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	0.2186	_	0.0156
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	0.4733	_	0.0289
bigbench_movie_recommendation	0	multiple_choice_grade	0.5200	_	0.0224
bigbench_navigate	0	multiple_choice_grade	0.4910	_	0.0158
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	0.7495	_	0.0097
bigbench_ruin_names	0	multiple_choice_grade	0.5938	_	0.0232
bigbench_salient_translation_error_detection	0	multiple_choice_grade	0.3808	_	0.0154
bigbench_snarks	0	multiple_choice_grade	0.8066	_	0.0294
bigbench_sports_understanding	0	multiple_choice_grade	0.5101	_	0.0159
bigbench_temporal_sequences	0	multiple_choice_grade	0.3850	_	0.0154
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	0.2160	_	0.0116
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	0.1634	_	0.0088
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	0.4733	_	0.0289
Average: 46.69%

TruthfulQA Benchmarks for Nous-Hermes-2-Yi-34B

The TruthfulQA benchmark tests how well AI models can deal with detailed, context-rich questions. Nous Hermes 2 scored 43.33% on mc1 and 60.34% on mc2 in this benchmark. These results really show how it can handle complex questions and provide sophisticated, nuanced answers.

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	0.4333	_	0.0173
		mc2	0.6034	_	0.0149

Why These Results Matter?

So, what do these scores and numbers mean for us? First off, they tell us that Nous Hermes 2 is not just good at one type of task. It's versatile and can adapt to a wide range of challenges. This versatility is crucial for AI to be useful in real-world situations, where it needs to handle all kinds of different problems and questions.

Interested in testing out more Local LLMs? You can try them out at Anakin AI without downloading them!

Mistral 7B and 8x7B: the hottest names for Open Source LLMs!

Mixtral | AI Powered | Anakin.ai

Supports Mixtral 7B and 8x7B. Mixtral AI’s next-generation conversational AI uses intelligent Q&A capabilities to solve your tough questions.

Anakin.aiallen-dolph81

Dolphin-2.5-Mixtral-8x7b: get a taste the wild west of uncensored Mixtral 8x7B!

Dolphin 2.5 Mixtral 8x7B - Chatbot Online | AI Powered | Anakin.ai

Want to experience the latested, uncensored version of Mixtral 8x7B? Having trouble running Dolphin 2.5 Mixtral 8x7B locally? Try out this online chatbot to experience the wild west of LLMs online!

Anakin.aiAnnie55

OpenHermes-2.5-Mistral-7B: One of the best performing Mistral-7B fine tune models, give it a shot!

Open Hermes 2.5 - Chat with OpenHermes 2.5 Online | AI Powered | Anakin.ai

Chat with OpenHermes 2.5 Mistral 7B, a cutting-edge AI model, shows marked performance improvements across many benchmarks!

Anakin.aiAnnie0

OpenChat, now you can build Open Source Lanugage Models, even if your data is imperfect!

Openchat | AI Powered | Anakin.ai

OpenChat is an innovative library of open-source language models, fine-tuned with C-RLFT - a strategy inspired by offline reinforcement learning.

Anakin.aiallen-dolph4

Other models include:

GPT-4: Boasting an impressive context window of up to 128k, this model takes deep learning to new heights.
Google Gemini Pro: Google's AI model designed for precision and depth in information retrieval.
DALLE 3: Create stunning, high-resolution images from textual descriptions.
Stable Diffusion: Generate images with a unique artistic flair, perfect for creative projects.

Start for free

The scores in areas like the Arc Challenge and BoolQ also highlight the model's advanced understanding capabilities. It's not just processing information; it's making sense of it in a way that's closer to how humans think. This kind of advanced understanding is key for tasks like problem-solving, decision-making, and even creative work.

But perhaps what's most exciting about Nous Hermes 2 is the potential it shows. Even in areas where it didn't score as high, like OpenbookQA, we see opportunities for growth and improvement. AI technology is still evolving, and models like Nous Hermes 2 are leading the charge. As it continues to learn and improve, there's no telling what kind of tasks it might be able to handle in the future.

Hugging face card for Nous-Hermes-2-Yi-34B-GGUF.

Conclusion: Looking to the Future

The success of Nous Hermes 2 on Yi 34B isn't just about the model itself. It's a sign of things to come in the field of AI. As we continue to develop and refine AI technology, we can expect to see models that are even more intelligent, versatile, and useful in our everyday lives. The possibilities are endless, and with models like Nous Hermes 2 leading the way, the future of AI looks brighter than ever.