Recursive Summarization by Parts
Recursive Summarization by Parts: Efficient Long-Document Summarization
Note: The following was written by GPT-4 after providing it my
recursive_summarize_by_parts
function and asking it to turn it into a blog post. I have made minor edits to the text, including adding the images and formatting.
In the field of Natural Language Processing efficiently summarizing extensive texts poses a unique challenge. Long-context Large Language Models (LLMs) exist for such tasks, but they often come with limitations, including high cost and potential inadequacies in capturing essential details in vast texts. In this article we take a look at a new technique Recursive Summarization by Parts, which offers a more efficient and comprehensive approach to summarizing long-form documents. This technique can be implemented using small (~7B) size local LLMs, making it an economic and privacy-preserving alternative to other propietary solutions.
Essence of Summarization, 1925
Understanding Recursive Summarization by Parts
Recursive Summarization by Parts is an innovative method characterized by its iterative approach. Initially, the method segments a long document into manageable parts, summarizing each separately. Crucially, this process is not a one-off; these summaries are compiled into a shorter document, which is then segmented and summarized again. This recursive procedure continues until the document is condensed to the desired token length, ensuring a comprehensive and concise summarization of the entire text.
Key Components
-
Text Splitting and Summarization Iteration: The process begins with the document being divided into smaller segments using a specialized text splitter. Each segment is then summarized individually. For the text splitting, we use the
RecursiveCharacterTextSplitter
from langchain utilities.text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( chunk_size=750, chunk_overlap=22 )
The
summarize_doc_chunk
function plays a critical role in this process. It takes each document segment and utilizes a custom LLMChain function call to generate a summary. This function essentially serves as the engine driving the summarization of individual document segments.def summarize_doc_chunk(paper_title: str, document: str, model="local"): """Summarize a document chunk.""" summarizer_prompt = ChatPromptTemplate.from_messages( [("system", SUMMARIZE_BY_PARTS_TEMPLATE)] ) chain = LLMChain(llm=model, prompt=summarizer_prompt, verbose=False) summary = chain.run({"paper_title": paper_title, "content": document}) return summary
These summarized segments are then compiled and subjected to further rounds of segmentation and summarization, as demonstrated in the following functions.
The
recursive_summarize_by_parts
function is central to achieving the desired document length. It applies thesummarize_by_parts
function iteratively, continually compressing the document until it reaches the specified token limit.def summarize_by_parts(paper_title: str, document: str, model="local", verbose=False): """Summarize a paper by segments.""" doc_chunks = text_splitter.create_documents([document]) summary_notes = "" st_time = pd.Timestamp.now() for idx, current_chunk in enumerate(doc_chunks): summary_notes += summarize_doc_chunk(paper_title, current_chunk, model) + "\n" if verbose: time_elapsed = pd.Timestamp.now() - st_time print(f"{idx+1}/{len(doc_chunks)}: {time_elapsed.total_seconds():.2f} seconds") st_time = pd.Timestamp.now() return summary_notes def recursive_summarize_by_parts( paper_title: str, document: str, max_tokens=400, model="local", verbose=False ): """Recursively apply the summarize_by_segments function to a document.""" ori_token_count = len(token_encoder.encode(document)) token_count = ori_token_count + 0 if verbose: print(f"Starting tokens: {ori_token_count}") summaries_dict = {} token_dict = {} i = 1 while token_count > max_tokens: if verbose: print("------------------------") print(f"Summarization iteration {i}...") document = summarize_by_parts(paper_title, document, model, verbose) token_diff = token_count - len(token_encoder.encode(document)) token_count = len(token_encoder.encode(document)) frac = token_count / ori_token_count summaries_dict[i] = document token_dict[i] = token_count i += 1 if verbose: print(f"Total tokens: {token_count}") print(f"Compression: {frac:.2f}") if token_diff < 50: print("Cannot compress further. Stopping.") break return summaries_dict, token_dict
-
Summarization Directive (SUMMARIZE_BY_PARTS_TEMPLATE): A critical component, this directive guides the summarization model to focus on the main ideas, technical aspects, and key findings in a bullet-point format. On it we set the LLM persona, provide it instructions, and define the expected properties of the summary.
SUMMARIZE_BY_PARTS_TEMPLATE = """You are an applied AI researcher specialized in the field of Large Language Models (LLMs), and you are currently reviewing the whitepaper "{paper_title}". Your goal is to analyze the paper, identify the main contributions and most interesting technical findings, and write a bullet point list summary of it in your own words. This summary will serve as reference for future LLM researchers within your organization, so it is very important that you are able to convey the main ideas in a clear, complete and concise manner. Read over the following section and take notes. Use a numbered list to summarize the main ideas. [...] {content} [...] ## Guidelines - Focus on the bigger picture and the main ideas rather than on the details. Focus on technical descriptions and precise explanations. - Be sure to clearly explain any new concept or term you introduce. - Take notes of the most important numeric results and metrics. - If a table is presented report back the main findings. - Include examples in your notes that help clarify the main ideas. - Highlight any practical applications or benefits of the paper's findings. - Highlight unusual or unexpected findings. Do not miss any important detail. - Take note of important formulas, theorems, algorithms and equations. - Take notes in the form of a numbered list, each item an information-rich paragraph. Do not include headers or any other elements. - DO NOT include more than ten (10) items in your list. Any element beyond the tenth (10) will be discarded. - Your summary must be shorter than the original text. Remove any filler or duplicate content. - Adhere as closely as possible to the original text. Do not alter the meaning of the notes. - Ignore and skip any bibliography or references sections. ## Summary """
Applications and Benefits
The Recursive Summarization by Parts technique is not confined to any specific type of text, making it broadly applicable to various long-form documents. Its iterative nature ensures that even the most detailed and extensive texts are summarized effectively.
Practical Benefits
- Comprehensiveness: Ensures all vital aspects of a document are captured, minimizing the risk of missing critical information.
- Scalability: Capable of handling documents of various lengths and complexities.
- Resource Efficiency: More economical compared to using large-context LLMs, especially beneficial for lengthy documents.
- Privacy: Can be implemented using local LLMs, eliminating the need to share sensitive data with third parties.
Real-World Applications
- Versatile Text Analysis: Can be applied to diverse texts, from technical reports to comprehensive narratives.
- Detailed Information Processing: Ideal for contexts where deep understanding and concise summarization of extensive information are required.
In summary, the Recursive Summarization by Parts approach marks a significant advancement in NLPs capability to handle long-form texts. It offers a meticulous yet efficient method to distill vast amounts of information into concise summaries, making it an invaluable tool in the arsenal of anyone dealing with extensive textual data.