Stanford and MosaicML Researchers Announce the Release of PubMed GPT, a Purpose-Built AI Model Trained to Interpret Biomedical Language

On Dec 18, 2022

The development of large language models (LLMs) has significantly advanced artificial intelligence (AI). With the use of learned knowledge, these models have astounding possibilities for revolutionizing fields such as general-purpose natural language generation, image generation, and speech synthesis. However, there is still some work that needs to be done to understand domain-specific models that can be employed for specialized industrial goals like medical or law. Domain-specific models are those that are trained primarily on data from a certain subdomain to create more precise and effective systems.

To comprehend how data composition affects domain-specific models when it comes to particular downstream tasks, Stanford’s Center for Research on Foundation Models (CRFM) recently worked on investigating such models. As part of their research, a team from CRFM worked with MosaicML to create PubMed GPT, an AI model that showcases the abilities of industry-specific LLMs, specifically for the field of biomedicine. Researchers from CRFM trained a 2.7B parameter GPT on biomedical papers from PubMed using the MosaicML Cloud platform. This GPT-style model performs well on several biomedical NLP tasks, including cutting-edge performance on the MedQA biomedical question-answering challenge.

PubMed GPT uses a HuggingFace GPT model as its foundation and employs a unique biomedical tokenizer trained using the Pile dataset’s PubMed Abstracts and PubMed Central sections. The intention behind the model design was to keep things as straightforward as possible to highlight the effectiveness of off-the-shelf LLM training formulas. This would also make it possible to train cutting-edge GPT models for other domain-specific applications using the same component, such as legal text.

Meet Hailo-8™: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification (Sponsored)

Pubmed GPT utilizes MosaicML Cloud infrastructure for quick and efficient training. The model uses the PyTorch framework and the MosaicML Composer and Streaming Dataset libraries for training. The researchers also employed MosaicML’s Composer library for training LLMs more accurately and at a lower cost. With no limitations on the model code, this open-source library makes it simple to train large custom models parallelly over hundreds of GPUs. It allows room for simple testing adjustments, considerably enhancing PubMed GPT’s training effectiveness. The custom 100GB training dataset was managed by MosaicML’s new StreamingDataset package. The team was able to test several PubmedGPT tokenization strategies without having to regenerate the dataset, thanks to the library’s exceptional performance and versatility.

PubMed GPT was evaluated on several question-and-answer benchmarks, with one key benchmark being the MedQA-USMLE, which consists of question-and-answer pairs derived from prior Medical Licensing Exams provided to doctors in the United States. Additionally, the researchers manually evaluated its generations for a task consisting of question summarizing. The researchers used several prior CRFM and biomedical models, including DRAGON, GPT-Neo, Galactica, and PubMedBERT, to compare their findings.

The researchers concluded that LLMs are very flexible when trained on domain-specific data and can produce considerable enhancements. However, because of the large number of parameters in PubMed GPT, this performance comes with a certain cost. Trade-offs exist between model complexity, cost and specialized architectures, and domain knowledge. The researchers also concluded that domain-specific data is superior to general-purpose data for pre-training LLMs. Furthermore, targeted models employ fewer resources to produce greater quality. Due to the careful selection of domain-specific data, PubMed GPT outperforms some models even when trained on a smaller dataset. Although LLMs can produce results of higher quality with fewer data and computational requirements than previously thought, there are still significant issues with model size and training costs. The researchers nevertheless provide a more practical and economical approach by effectively implementing models on the MosaicML Cloud.

The main takeaway from their research is that even basic LLMs trained on domain-specific data can compete with and surpass expert-designed model architectures. Future work will focus on increasing the scope of downstream tasks, improving the model, and assessing it against a larger collection of biomedical NLP tasks. Although the results from PubMed GPT are an exciting initial step in creating models that could govern biomedical research, their work should only be utilized for research purposes since the model is not fitted for production. The model was made public to assist biomedical NLP applications and outline the best practices for developing and utilizing domain-specific language models. The insights obtained while training this biomedical model will be useful in achieving state-of-the-art performance in other fields, including law and finance. The ultimate goal is to create interactive AI systems that support trustworthy interactions while encouraging interaction with human experts.

Check out the Stanford Blog and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.

Credit: Source link