Glossary term
Glossary term
Agentic Systems
Open-source code LLM family (3B-15B) from HuggingFace and ServiceNow, trained on The Stack dataset covering 619 programming languages.
StarCoder2-15B (2024) is trained on 3.3-4.3 trillion tokens of the Stack V2 dataset spanning 619 programming languages - 7x larger than the original StarCoder training set, achieving scores close to CodeLlama-34B on HumanEval.
StarCoder is integrated into VSCode via the HuggingFace Code extension and is used by 100,000+ developers for local code completion without sending code to external APIs, important for regulated industries.
BigCode (the collaborative project behind StarCoder, involving HuggingFace, ServiceNow, and 600+ contributors) publishes full training data, model weights, and governance documents under the BigCode OpenRAIL-M licence - the first large code model with full transparency.