Open source LLMs // hack, deploy and build applications

sbagency
5 min readNov 29, 2023

--

https://arxiv.org/pdf/2311.16989.pdf

Open-Source LLMs vs. ChatGPT:
1. General Capabilities: Llama-2-chat-70B variant exhibits enhanced capabilities in general conversational tasks, surpassing the performance of GPT-3.5-turbo; UltraLlama matches GPT-3.5-turbo’s performance in its proposed benchmark.

2. Agent Capabilities (using tools, self-debugging, following natural language feedback, exploring environment): Lemur-70B-chat surpasses the performance of GPT-3.5-turbo when exploring the environment or following natural language feedback on coding tasks. AgentLlama-70B achieves comparable performance to GPT-3.5-turbo on unseen agent tasks. Gorilla outperforms GPT-4 on writing API calls.

3. Logical Reasoning Capabilities: fine-tuned models (e.g., WizardCoder, WizardMath) and pre-training on higher quality data models (e.g., Lemur-70B-chat, Phi-1, Phi-1.5) show stronger performance than GPT-3.5-turbo.

4. Modeling Long-Context Capabilities: Llama-2-long-chat-70B outperforms GPT-3.5-turbo-16k on ZeroSCROLLS.

5. Application-specific Capabilities:
- query-focused summarization (fine-tuning on training data is better)
- open-ended QA (InstructRetro shows improvement over GPT3)
- medical (MentalLlama-chat-13 and Radiology-Llama-2 outperform ChatGPT)
- generate structured responses (Struc-Bench outperforms ChatGPT)
- generate critiques (Shepherd is almost on-par with ChatGPT)

6. Trust-worthy AI:
- hallucination: during finetuning — improving data quality during fine-tuning; during inference — specific decoding strategies, external knowledge augmentation (Chain-of-Knowledge, LLM-AUGMENTER, Knowledge Solver, CRITIC, Prametric Knowlege Guiding), and multi-agent dialogue.
- safety: GPT-3.5-turbo and GPT-4 models remain at the top for safety evaluations. This is largely attributed to Reinforcement Learning with Human Feedback (RLHF). RL from AI Feedback (RLAIF) could help reduce costs for RLHF. [post link]

https://bdtechtalks.com/2023/11/29/open-source-llm-vs-chatgpt/
https://www.linkedin.com/posts/alexwang2911_besides-gpt-from-openai-there-are-many-powerful-activity-7135573039412367361-yRJ4
https://haven.run/
https://www.linkedin.com/posts/justus-mattern-a04230184_im-super-excited-to-share-the-public-launch-ugcPost-7135275148395319296-Z8Vp

Over the last months, we’ve identified two big pain points that make it hard to work with open source models:
First, open source models work best when they are trained for specific use cases, but the fine-tuning process with existing tools is dreadful. We have found that most of our time is spent setting up infrastructure to go from finishing a training run to actually testing our models, rather than actually writing code and improving our models
Secondly, hosting custom models is expensive. Running a single Llama-7B model in float16 requires at minimum an A10 GPU, which costs $700+ per month. To run ten or a hundred specialized models for common tasks, this would lead to a monthly AWS bill of $7,000 or $70,000, respectively.

We built Haven to solve these problems: Our platform offers a super simple way to fine-tune models without managing infrastructure or writing code, and to test and run them with low costs and without any additional work.

https://www.linkedin.com/posts/mary-anne-hartley-324832210_meditron-opensource-openaccess-activity-7135408165017243648-fvzN
https://actu.epfl.ch/news/epfl-s-new-large-language-model-for-medical-knowle/
https://arxiv.org/pdf/2311.16079.pdf

Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs’ medical knowledge and reasoning capacities, the resulting models are either closedsource (e.g., PaLM, GPT-4) or limited in scale (≤ 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia’s Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

https://github.com/epfLLM/meditron
https://medium.com/p/9c0f5abc296c
https://www.linkedin.com/posts/sophiamyang_building-ai-chatbots-with-mistral-and-llama2-activity-7135575208228225024-KX53
https://blog.perplexity.ai/blog/introducing-pplx-online-llms
https://simonwillison.net/2023/Nov/29/llamafile/
https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE
https://arxiv.org/pdf/2311.16867.pdf
https://argilla.io/blog/notus7b/
https://starling.cs.berkeley.edu/
https://github.com/deepseek-ai/DeepSeek-LLM
https://github.com/bionic-gpt/bionic-gpt#try-it-out
https://www.linkedin.com/posts/alexwang2911_%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F-%3F%3F%3F-is-free-and-activity-7136692671061884929-iWBW

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet