A Closer Look at una-cybertron-7B-v2 for Summarization

Publish date: Dec 16, 2023

Tags:

There is some drama going on in the HuggingFace community boards around the possibility of data contamination on some of the top performing open source LLMs, affecting the reliability of their own leaderboard. One of the models under discussion is una-cybertron-7B-v2, a Mistral fine-tune that I have been using for summarization tasks with pretty good results. According to the author, the model is tamed using the Uniform Neural Alignment, an undisclosed technique on which he might publish a paper later. He also states the following:

UNA : Uniform Neural Alignment. It goes in the Attention and the multiperceptrons, and it does what it says. There are multiple phases… Juanako = UNAv1 only implemented at perceptron level. Cybertron = UNAv2 .. applied both MLP and Attention.. Xaberius = UNAv1.. meaning I can release a much more powerful version of it. Its based on platypus-34b and if u compare the performance.. its not too distant from it. And if u compare what UNA increases (Rationale/Logic capacity).. u’ll see the pattern across them.

A somewhat mysterious and not totally clear message, but interesting nonetheless. The underlying situation is that the community is divided, with some accusing the author of intentionally gaming the leaderboard while others vouch for the superior performance of his models, whatever the underlying fine-tune technique happens to be.

I originally selected the model for my use case (summarization) based on vibe checks, as I had the impression that it did produce better summaries compared to the base Mistral or even OpenHermes-2.5-neural-chat-7B-v3-1-7B blend, but maybe I was deceived by chance and flashy numbers? In this blog post I aim to sort this out by doing a formal comparison that will hopefully settle things down.

Summarization Task

To test the models against each other, we will perform a recursive summarization by parts task. This involves iteratively summarize a long document into a small set of notes. First, we summarize the paper into notes, then those notes into distilled notes, and further into extra-distilled notes, and so on. We will use the following prompt, applying it iteratively over small chunks of the document until the entire text is processed. Then, we will recursively apply this procedure over the results until we reach a set of summary notes with a total token length of less than 500.

SUMMARIZE_BY_PARTS_TEMPLATE = """You are an applied AI researcher specialized in the field of Large Language Models (LLMs), and you are currently reviewing the academic paper "{paper_title}". Your goal is to analyze the paper, identify the main contributions and most interesting findings, and write a bullet point list summary of it in your own words. This summary will serve as reference for future LLM researchers within your organization, so it is very important that you are able to convey the main ideas in a clear, complete and concise manner.

Read over the following section and take notes. Use a numbered list to summarize the main ideas. 

[...]
{content}
[...]

## Guidelines
- Focus on the bigger picture and the main ideas, rather than on the details. 
- Be sure to explain any new concept or term you introduce. Explain how things work clearly.
- Take notes of the most important numeric results and metrics (i.e. 30% accuracy, 4.5 times faster, etc.).
- If a table is presented just report back the main findings.
- Include examples in your notes that help clarify the main ideas.
- Highlight any practical applications or benefits of the paper's findings.
- Highlight unusual or unexpected findings.
- Take notes in the form of a numbered list. Do not include headers or any other elements.
- Do not include more than 10 items in your list.
- Your summary must be shorter than the original text. Remove any filler or duplicate content.
"""

Finally, to make the output notes easier to digest, we will run them through a final LLM call, applying the following copy-editor prompt for it:

NARRATIVE_SUMMARY_PROMPT = """You are an expert New York Times technology writer tasked with writing a summary of "{paper_title}". Your task is to read the following set of notes and convert them into an engaging paragraph.

{previous_notes}

## Guidelines
- You can reorganize and rephrase the notes in order to improve the summary's flow.
- Do not alter the meaning of the notes.
- Avoid repetition and filler content.
- Abstain from making unwarranted inferences.
- Avoid bombastic language. 
- Include metrics and statistics in your report  (i.e. 30% accuracy, 4.5 times faster, etc.).
- Include descriptions and explanations of any new concepts or terms. Describe how new models or methodologies work.
- Highlight any practical applications or benefits of the paper's findings.
- Highlight unusual or unexpected findings.
"""

Data

To keep things short, we will focus on summarizing three arXiv papers, the last of which is highly technical. We will manually inspect the results to decide which model performs the best job. The papers under analysis will be the following:

Chain-of-Verification Reduces Hallucination in Large Language Models

Abstract: Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Large Language Models Cannot Self-Correct Reasoning Yet

Abstract: Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues. Building upon this premise, this paper critically examines the role and efficacy of self-correction within LLMs, shedding light on its true potential and limitations. Central to our investigation is the notion of intrinsic self-correction, whereby an LLM attempts to correct its initial responses based solely on its inherent capabilities, without the crutch of external feedback. In the context of reasoning, our research indicates that LLMs struggle to self-correct their responses without external feedback, and at times, their performance might even degrade post self-correction. Drawing from these insights, we offer suggestions for future research and practical applications in this field

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Abstract: In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression.

Results

Below we look at the summaries produced by the different models and analyze their quality.

Chain-of-Verification Reduces Hallucination in Large Language Models

Analysis

In this first paper summary, we observe that Mistral Instruct forgot to emit the <|endoftext|> token, resulting in a never-ending stream of garbage. The other two models perform decently at summarization, although una-cybertron does a better job as it adhered to my instruction to reply in a single paragraph, include metrics, and produce a more readable text.

Model	Score
Mistral-7B-Instruct-v0.2	0
OpenHermes-2.5-neural-chat-7B-v3-1-7B	0.7
una-cybertron-7B-v2	0.9

Large Language Models Cannot Self-Correct Reasoning Yet

Analysis

Mistral-Instruct is able to produce a complete and coherent summary for this second paper. Good! Although it’s a bit longer than the one paragraph I requested and maintains a bullet-point-like format, the content seems to be of pretty high quality. OpenHermes-2.5-neural-chat and una-cybertron both do a good job, and I don’t think either summary is superior to the other.

Model	Score
Mistral-7B-Instruct-v0.2	0.8
OpenHermes-2.5-neural-chat-7B-v3-1-7B	0.9
una-cybertron-7B-v2	0.9

Results for White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Analysis

For the third paper, we again observe that Mistral Instruct fails to emit the termination token and produces a senseless stream of text. Both OpenHermes-2.5-neural-chat and una-cyberton do a decent job, although the latter model earns additional points for including numerical metrics, a requested, and for producing a summary that better captures the bigger picture and is more enjoyable to read.

Model	Score
Mistral-7B-Instruct-v0.2	0
OpenHermes-2.5-neural-chat-7B-v3-1-7B	0.7
una-cybertron-7B-v2	0.9

Concluding Remarks

Looking at aggregate results it does seem like there is something noteworthy about the una-cybertron model. It appears to be at least on par with the state-of-the-art OpenHermes-2.5-neural-chat-7B-v3-1-7B, and possibly even follows instructions slightly better.

Model	Score
Mistral-7B-Instruct-v0.2	0.27
OpenHermes-2.5-neural-chat-7B-v3-1-7B	0.76
una-cybertron-7B-v2	0.9