Latest AI Research Introduces Contrast-Consistent Search (CCS): A Novel Approach To Detect Knowledge From Model Representations
Language models are extensively used in real-world applications, which brings exciting new opportunities. However, this also increases the stakes of AI research and introduces new risks. Many researchers have highlighted that the text generated by language models is not accurate. One of the reasons behind this is that the truth is a valuable trait for many tasks; it is common for models to learn internal representations linked to the truth during training.
There are situations in which these goals can lead language models to produce erroneous text. For instance, if they train a model to produce writing in the style of humans, the model may pick up on stylistic quirks that lead to misunderstandings. Or, if they educate a chatbot to maximize a reward-like interaction, it may eventually learn to produce enticing but untrue language. This occurs whenever there is a discrepancy between a training goal and reality.
Human supervision may be less helpful in preventing this misalignment if models are applied to increasingly complicated areas. Because this is not a limitation of the model itself but rather the training purpose, increasing the size of existing models is unlikely to help.
In a recent paper, researchers from UC Berkeley and Peking University propose an unsupervised method of employing models to find answers to queries. According to the team, instead of explicitly and externally expressing truth, it makes more sense to look for the model’s internal, implicit “beliefs” or “knowledge.” To solve this challenge, the researchers make use of the fact that logical consistency properties must be met by any model’s representation of the truth. To put this concept into action, they introduce Contrast-Consistent Search (CCS), a technique that discovers a linear projection of the hidden states that is robust to negations.
They find that CCS can accurately retrieve information from model representations while being a simple method that doesn’t require access to any labels or model outputs. The resulting classifier has a 50% lower accuracy standard deviation than zero-shot and is less susceptible to changes in the prompts. CCS achieves around a 4% improvement in accuracy over strong zero-shot baselines when assessed across 6 models and 10 question-answer datasets.
In addition, they try intentionally triggering models to produce false results, which should alter what models say and shouldn’t influence their latent knowledge. They discover that this reduces zero-shot accuracy by up to 9.5 percent without affecting the precision with which CCS works.
They meticulously dissect CCS in order to make sense of the patterns it identifies. Their findings demonstrate that the method generalizes to new contexts, implying that models may have a true representation independent of the job at hand and that CCS can approximatively uncover it.
In addition, CCS can be effective even when model outputs aren’t very informative. This means that it can leverage different features than the outputs and is therefore not limited to using the same ones. This is especially true when the hidden states in the middle layers of a network are utilized. Finally, the researchers show that truth representations are typically prominent in models, both in terms of being discovered with little data and by selecting the top principal component of a little altered representation space.
According to the team, the existing methods for making models honest rely heavily on human supervision to explicitly state what is correct. However, there are situations where providing supervision is impossible. They believe the encouraging results from their method in practice indicate that unsupervised approaches are both a manageable and uncharted area for study.
Check out the Paper. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.
Credit: Source link
Comments are closed.