Small models vs large ones? // synthetic data is used to train
SLMs can solve privacy, latency and on-device inference challenges
This blog post introduces SmolLM, a family of high-performance small language models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It discusses data curation, model evaluation, and usage. Small language models are gaining interest due to their ability to operate on local devices, reduce inference costs, and enhance user privacy. Techniques like distillation, quantization, and training from scratch on large datasets are used. SmolLM models, trained on a meticulously curated high-quality corpus called SmolLM-Corpus, outperform other models in their size categories on benchmarks testing common sense reasoning and world knowledge.
The dataset includes:
- Cosmopedia v2: 28B tokens of synthetic textbooks and stories
- Python-Edu: 4B tokens of educational Python samples
- FineWeb-Edu: 220B tokens of educational web samples
The blog post details the curation process of each dataset subset, improvements from Cosmopedia v1 to v2, and the strategies used for better data generation. Cosmopedia v2 involved predefined topics, web page retrieval using search tools, and generation styles tailored to different audiences. The models were trained on various audience-targeted texts, resulting in 39 million synthetic documents with diverse topics.
The FineWeb-Edu dataset consists of 1.3T tokens of educational web pages filtered from FineWeb, using an educational quality classifier trained on synthetic data. Stack-Edu-Python was similarly curated using an educational classifier for Python samples from The Stack dataset, retaining 4B tokens from 40B.
SmolLM models were trained on these data mixtures, with 135M and 360M models trained on 600B tokens and the 1.7B model on 1T tokens. Hyperparameters and model architectures were optimized for efficiency, with evaluations showing SmolLM models outperforming others in their parameter categories on diverse benchmarks. The models also excelled in Python coding tasks.
Instruction tuning with publicly available datasets further enhanced model performance. SmolLM models are designed to run efficiently on local hardware, including smartphones, with memory requirements suitable for a wide range of devices. The post concludes by emphasizing that SmolLM models achieve high performance with efficient training on high-quality datasets, balancing size and performance effectively.
This press release announces Aporia’s 2024 Guardrails Benchmark report, highlighting the company’s AI control platform performance. Key points include:
1. Aporia’s Multi-SLM (Small Language Model) detection engine outperforms competitors like NVIDIA NeMo, GPT-4o, and GPT 3.5 in accuracy and latency for AI hallucination detection.
2. Performance metrics:
— Average latency: 0.34 seconds
— 90th percentile latency: 0.43 seconds
— Hallucination detection rate: 98% (compared to NeMo’s 91% and GPT-4o’s 94%)
3. Aporia’s decentralized strategy uses multiple SLMs instead of a single LLM, distributing workload and improving reliability.
4. The company focuses on empowering engineers and organizations to deploy secure and reliable AI applications without compromising performance.
5. Aporia’s Guardrails also address other AI safety concerns, including handling sensitive data, preventing prompt injections, and maintaining conversation relevance.
6. The full report is available on Aporia’s website, along with a 14-day free trial offer.
7. Aporia is recognized as a Technology Pioneer by the World Economic Forum and is trusted by major companies like Bosch, Lemonade, Levi’s, Munich RE, and Sixt.
The press release emphasizes Aporia’s commitment to advancing AI deployment standards and providing a solution for responsive and secure AI applications.