Hugging Face and ServiceNow’s BigCode partnership is making great progress in the development of large programming language models (LLMs) with the development of StarCoder and StarCoderBase, with an emphasis on ethical principles.
StarCoder and StarCoderBase were developed in collaboration with GitHub and trained on its freely licensed data set, which includes over 80 programming languages, Git commits, GitHub problems, and Jupyter notebooks.
StarCoder was trained with 1 trillion tokens and has a 8,192-token context window. It creates realistic code and works with a variety of programming languages. It is distributed under the OpenRAIL-M license, which places legal restrictions on its usage and modification. Furthermore, like other LLMs, StarCoder has the potential to generate inaccurate or biased information, and it is critical to recognize these limitations and strive toward overcoming them.
While the StarCoderBase model surpasses other open Code LLMs in numerous prominent programming benchmarks, it is on par with, if not better than, closed models like as OpenAI’s code-Cushman-001. Its context length, which exceeds 8,000 tokens, enables it to process more input than any other open LLM now available.
The researchers also disclosed OpenRAIL license of the model’s code, which includes intermediate checkpoints. Furthermore, all training and preprocessing code is released under the Apache 2.0 license. A thorough framework for testing computer programs, a new dataset for training and assessing PII-removal methods, and a tool to identify the source of the produced code inside the dataset are among the additional materials made accessible.
The sources for this piece include an article in MarkTechPost.