Scale AI to develop testing framework for Pentagon's LLMs
The Pentagon's Chief Digital and Artificial Intelligence Office (CDAO) has teamed up with Scale AI, a San Francisco-based company. Together, they will develop a reliable testing and evaluation (T&E) framework for large language models (LLMs). These LLMs could play a significant role in military planning and decision-making. The one-year contract aims to create a comprehensive T&E system for generative AI inside the Defense Department, ensuring its safe deployment by measuring model performance and providing real-time feedback for warfighters.
Addressing the complexities of generative AI testing
Generative AI, which includes LLMs that can produce text, images, software code, and other media based on human prompts, poses unique challenges for T&E processes. Unlike traditional systems with established safety standards, generative AI lacks universally accepted guidelines. To address these complexities, Scale AI will develop "holdout datasets" with the help of Department of Defense (DOD) insiders who can provide response pairs and review them through multiple layers.
Iterative process to refine datasets and evaluate models
The T&E process for LLMs will be iterative, involving the creation and refinement of datasets relevant to the DOD's needs. Experts will then evaluate existing LLMs against these datasets. As holdout datasets are established, evaluations can be conducted to develop model cards—short documents detailing the best use context and performance measurement information for various machine learning models. This approach will help establish a baseline understanding of model performance, strengths, and limitations.
Automating model evaluation and feedback
The development process aims to automate as much as possible, allowing for quick assessments of new models as they emerge. The goal is for models to provide signals to CDAO officials when they deviate from the domains they have been tested against. Scale AI's statement explains that this work will allow the DOD to mature its T&E policies for generative AI by "measuring and assessing quantitative data" through benchmarking and gathering qualitative feedback from users.
Collaboration with industry leaders
Scale AI has previously partnered with Microsoft, Meta, OpenAI, the US Army, the Defense Innovation Unit, General Motors, and NVIDIA. Alexandr Wang, Scale AI's CEO, said in a statement, "Testing and evaluating generative AI will help the DoD understand the strengths and limitations of the technology, so it can be deployed responsibly." This partnership aims to increase the resilience and robustness of AI systems in classified environments. This will ensure LLM technology adoption "in secure settings."