Financial News Sentiment Classification with GPT Models
Last updated: Jul 8, 2023
Financial News Sentiment Classification with GPT Models
In this short study I examine the performance of OpenAI’s GPT-3.5 and GPT-4 models for financial news sentiment classification. The models are evaluated under a few-shot framework over the Financial PhraseBank (FPB) dataset, which contains 4846 financial news headlines labeled as having neutral, positive, or negative sentiment by 4 annotators. The aim of the analysis is to identify if GPT-4 is needed to extract reliable information on this specific domain-dependent task, or if the more lightweight, scaled-down GPT-3.5 model is sufficient. This has important implications both in terms of runtime performance and total cost, as GPT-4 is >10x more expensive and significantly slower than its scaled-down counterpart.
I analyze two variants of the FPB dataset: one that includes only those observations where 4/4 annotators agree, and another more ambiguous one where only 3/4 annotators reach consensus. I expect GPT-4’s performance to shine in the latter case, as it should be able to better capture the subtleties of language associated with these more ambiguous instances. But let’s see what the data tells us.
from langchain.chat_models import ChatOpenAI
from langchain import LLMChain
from langchain.callbacks import get_openai_callback
import time
import pandas as pd
import numpy as np
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from dotenv import load_dotenv
from IPython.display import display, IFrame
from tqdm.notebook import tqdm
from datasets import load_dataset, logging
import warnings
pd.set_option('display.max_colwidth', 200)
warnings.filterwarnings("ignore")
logging.set_verbosity_error()
logging.disable_progress_bar()
## Load env variables (.env file).
_ = load_dotenv()
Financial Phrase Bank Dataset
We start by loading the financial phrasebank dataset and perform some basic preprocessing steps. We also take a look at the label distribution we will be predicting, along with a few examples.
## More sensible sentiment labels.
sentiment_map = {
0 : -1, # negative
1 : 0, # neutral
2 : 1 # positive
}
## Load data.
df_all = load_dataset("financial_phrasebank", "sentences_allagree")["train"].to_pandas()
df_75 = load_dataset("financial_phrasebank", "sentences_75agree")["train"].to_pandas()
df_75 = df_75[~df_75["sentence"].isin(df_all["sentence"])]
## Augment.
df_all["label"] = df_all["label"].apply(sentiment_map.get)
df_75["label"] = df_75["label"].apply(sentiment_map.get)
df_all[["gpt35", "gpt4"]] = np.nan
df_75[["gpt35", "gpt4"]] = np.nan
# Add a column to each dataframe to identify them.
df_all["dataset"] = "4/4 Annotators Agree"
df_75["dataset"] = "3/4 Annotators Agree"
df_combined = pd.concat([df_all, df_75])
## Print size.
print("Dataset sizes:")
print(f"- 4/4 annotators agree: {len(df_all)}")
print(f"- 3/4 annotators agree: {len(df_75)}")
Dataset sizes:
- 4/4 annotators agree: 2264
- 3/4 annotators agree: 1189
## Quick view at the dataset.
print("SAMPLE HEADLINES:\n")
for idx, row in df_combined[["sentence", "label", "dataset"]].sample(3).iterrows():
print(f"Dataset: {row['dataset']}")
print(f"Headline: {row['sentence']}")
print(f"Label: {row['label']}")
print("==")
SAMPLE HEADLINES:
Dataset: 4/4 Annotators Agree
Headline: The continued operations mean the structure after the restructuring of the Aspocomp group including Aspocomp Oulu and the headquarter operations .
Label: 0
==
Dataset: 4/4 Annotators Agree
Headline: Viking Line 's cargo revenue increased by 5.4 % to EUR 21.46 mn , and cargo volume increased by 2.4 % to 70,116 cargo units .
Label: 1
==
Dataset: 3/4 Annotators Agree
Headline: Russian Media Ventures ' minority shareholder Peter Hervy denied the plans to sell OVA Press , the daily said .
Label: 0
==
# Show distribution of labels.
fig = px.histogram(df_combined, x="label", color="label", facet_col="dataset", title="Label Distributions")
pio.write_html(fig, file="label_distributions.html", auto_open=False)
IFrame(src="label_distributions.html", width=800, height=400)
LLM Chain Setup
Now we set up the GPT-3.5 and GPT-4 LLM Chains to perform sentiment analysis and create the related prompt templates. Obviously we use langchain
🖤 for this.
## OpenAI LLM models.
llm35 = ChatOpenAI(model="gpt-35-turbo", temperature=0)
llm4 = ChatOpenAI(model="gpt-4", temperature=0)
## Template config for LLMChains.
def create_template(outputs, output_format):
""" Create template with I/O instructions. """
template = (
f"""You are an advanced AI model specialized in analyzing news headlines and extracting sentiment score for them, where the sentiment is defined on the context of expected market impact.
{output_format} There are 3 allowed values for sentiment scores: [-1, 0, 1]. -1 means negative sentiment, 0 means neutral sentiment, and 1 means positive sentiment.
EXAMPLE
-------
INPUT:"According to Gran, the company has no plans to move all production to Russia, although that is where the company is growing."
OUTPUT:{outputs[0]}
INPUT:"Tiimari operates 194 stores in six countries -- including its core Finnish market -- and generated a turnover of 76.5mln eur in 2005."
OUTPUT:{outputs[1]}
INPUT:"However , Biohit estimates its total net sales will continue to grow in 2009, and that favourable trends in net sales will lead to a profit in 2009."
OUTPUT:{outputs[2]}
INPUT:"Softbank incorporated applied for 4 new patents."
OUTPUT:{outputs[3]}
YOUR TURN
---------
""" + """
INPUT:{news}
OUTPUT:"""
)
return template
## Template config parameters.
output_base = [
"0",
"0",
"1",
"0",
]
output_cot = [
"""""Despite growth in Russia the company's decision not to relocate all production there suggests a neutral business strategy and carries no neutral sentiment."|0""",
"""""The company's extensive operations and significant turnover are reported as facts without any strong positive or negative implications, reflecting a neutral sentiment."|0""",
"""""Increasing sales trends are likely to make Biohit's stock appreciate, resulting in a positive sentiment."|1""",
"""""The headline does not provide enough information to assign a sentiment score."|0""",
]
output_instructions = [
"Your output is a single number, which is the sentiment score.",
"Your output is a one-sentence explanation of the expected market impact, followed by the numerical score. These outputs are separated by a '|' (i.e. explanation|score).",
]
template_base = create_template(output_base, output_instructions[0])
template_cot = create_template(output_cot, output_instructions[1])
## Create LLMChains.
llm_chain35 = LLMChain.from_string(llm=llm35, template=template_base)
llm_chain4 = LLMChain.from_string(llm=llm4, template=template_base)
llm_chain35_cot = LLMChain.from_string(llm=llm35, template=template_cot)
llm_chain4_cot = LLMChain.from_string(llm=llm4, template=template_cot)
Helper Functions
Finally we define a couple of helper functions to help us perform the analysis and visualize the results.
def iterate_news_df(df, llm_chain35, llm_chain4):
""" Iterate over news dataframe and run LLMChains. """
with get_openai_callback() as cb:
## Iterate with time delay.
for idx, row in tqdm(df.iterrows(), total=len(df)):
news = row["sentence"]
if np.isnan(row["gpt4"]):
res35 = llm_chain35.run(news=news, max_tokens=3)
time.sleep(1)
res4 = llm_chain4.run(news=news, max_tokens=3)
print("Done 4.")
time.sleep(1)
## Store results.
df.loc[int(idx), "gpt35"] = int(res35)
df.loc[int(idx), "gpt4"] = int(res4)
df["label"] = df["label"].astype(int)
df["gpt35"] = df["gpt35"].astype(int)
df["gpt4"] = df["gpt4"].astype(int)
return df
def create_confusion_matrix(df, labels, models):
""" Create confusion matrix and calculate accuracy. """
matrix_dict = {}
accuracy_dict = {}
for model in models:
# Create confusion matrix
matrix = pd.crosstab(df[labels], df[model])
matrix.columns = ["Negative", "Neutral", "Positive"]
matrix.index = ["Negative", "Neutral", "Positive"]
matrix_dict[model] = matrix
# Calculate accuracy
accuracy = np.trace(matrix.values) / np.sum(matrix.values)
accuracy_dict[model] = accuracy
# Convert confusion matrices to percentage and round to 2 decimal places
matrix_percentage = (matrix / matrix.sum(axis=1)[:, np.newaxis]) * 100
matrix_dict[model] = matrix_percentage.round()
return matrix_dict, accuracy_dict
def create_confusion_matrix_plot(matrix_dict, accuracy_dict):
""" Create confusion matrix plot. """
fig = make_subplots(rows=1, cols=len(matrix_dict), subplot_titles=list(matrix_dict.keys()))
title = "Confusion Matrix"
# Function to add annotations
def add_annotations(fig, values, row, col):
for i, y in enumerate(values.index):
for j, x in enumerate(values.columns):
fig.add_annotation(
text=str(int(values.iloc[i, j])) + '%',
xref='x'+str(col), yref='y'+str(col),
x=x, y=y, showarrow=False,
font=dict(size=12, color='white' if values.iloc[i, j] > 50 else 'black'),
row=row, col=col
)
for i, model in enumerate(matrix_dict.keys(), start=1):
fig.add_trace(go.Heatmap(
z=matrix_dict[model].values,
x=matrix_dict[model].columns,
y=matrix_dict[model].index,
colorscale="Blues",
showscale=False,
), row=1, col=i)
# Add values annotations to heatmaps
add_annotations(fig, matrix_dict[model], row=1, col=i)
# Add accuracy annotations
fig.add_annotation(
text=f'Accuracy: {accuracy_dict[model]*100:.2f}%',
xref='paper', yref='paper',
x=1*(i-1), y=-0.30, showarrow=False,
font=dict(size=17)
)
# Final layout
fig.update_layout(
title=title,
yaxis=dict(title="True Label"),
width=800,
height=400,
margin=dict(l=50, r=50, b=75, t=100, pad=4),
paper_bgcolor="white",
plot_bgcolor="white",
font=dict(size=16),
)
fig.layout.yaxis2.showticklabels = False
return fig
Dataset 1: All annotators agree
We start by analyzing the first dataset, where all 4 annotators agree.
Few-shot Performance
df_all = iterate_news_df(df_all, llm_chain35, llm_chain4)
matrix_dict_all, accuracy_dict_all = create_confusion_matrix(df_all, "label", ["gpt35", "gpt4"])
fig_all = create_confusion_matrix_plot(matrix_dict_all, accuracy_dict_all)
pio.write_html(fig_all, file="scores_all_agree.html", auto_open=False)
IFrame(src="scores_all_agree.html", width=1000, height=500)
Both models achieve a ~78% accuracy on the dataset, with the difference in performance being negligible. We also see that most errors are made in the neutral category, where the models tend to predict positive sentiment instead.
Dataset 2: 3/4 annotators agree
Now let’s see how the models perform on the more ambiguous dataset, where only 3/4 annotators agree on the sentiment label.
Few-shot Performance
df_75 = iterate_news_df(df_75, llm_chain35, llm_chain4)
matrix_dict_75, accuracy_dict_75 = create_confusion_matrix(df_75, "label", ["gpt35", "gpt4"])
fig_75 = create_confusion_matrix_plot(matrix_dict_75, accuracy_dict_75)
pio.write_html(fig_75, file="scores_75_agree.html", auto_open=False)
IFrame(src="scores_75_agree.html", width=1000, height=500)
Again, both models deliver surprisingly similar scores, despite my anticipation that GPT-4 would excel in these ambiguous scenarios.
Few-shot with CoT Reasoning Performance
As a final test we run the same analysis on the 3/4 agreement dataset, but now we allow the model to use an intermediate chain-of-thought style step before settling on an answer. This approach has been shown to boost performance across a range of analytical tasks, with GPT-4 usually seeing a larger increase due to its improved reasoning capabilities.
df_75_cot = iterate_news_df(df_75, llm_chain35_cot, llm_chain4_cot)
matrix_dict_75_cot, accuracy_dict_75_cot = create_confusion_matrix(df_75_cot, "label", ["gpt35", "gpt4"])
fig_75_cot = create_confusion_matrix_plot(matrix_dict_75_cot, accuracy_dict_75_cot)
pio.write_html(fig_75_cot, file="scores_75_agree_cot.html", auto_open=False)
IFrame(src="scores_75_agree_cot.html", width=1000, height=500)
Under this specific setup GPT-4 indeed delivers a non-negligible performance boost, although the difference might not be large enough to justify the increased monetary cost. Interestingly we also observe that the CoT reasoning step improves performance over the neutral category, but slightly degrades it over the other two.
Error Analysis
As final exercise we look at the errors made by the models.
Sample Errors by GPT-3.5 (4/4 agreement dataset)
for idx, row in df_all[df_all["gpt35"] != df_all["label"]].sample(3).iterrows():
print(f"{row['sentence']}\n\tGPT-3.5: {row['gpt35']}\n\tCorrect: {row['label']}\n")
For 2009 , net profit was EUR3m and the company paid a dividend of EUR1 .30 apiece .
GPT-3.5: 1
Correct: 0
The company said that 80 % of the shares of the holding company will be sold to Meadville Holdings Limited , a Hong Kong listed parent company of the Meadville Group .
GPT-3.5: 1
Correct: 0
Mr. Doring has been with Eaton since 1989 and acted as the Business Unit Manager for Eaton 's Power Controls Business Unit since January 2007 .
GPT-3.5: 1
Correct: 0
Sample Errors by GPT-4 (4/4 agreement dataset)
for idx, row in df_all[df_all["gpt4"] != df_all["label"]].sample(3).iterrows():
print(f"{row['sentence']}\n\tGPT-4: {row['gpt4']}\n\tCorrect: {row['label']}\n")
The EA Reng group posted sales of approximately 84 million kroons for 2007 .
GPT-4: 1
Correct: 0
The adjustments concern staff in both the Specialty Papers and the Fiber Composites segments .
GPT-4: -1
Correct: 0
Cencorp would focus on the development , manufacture and marketing of standard products for production automation .
GPT-4: 1
Correct: 0
Sample Errors by GPT-4 (with CoT)
for idx, row in df_75_cot[df_75_cot["gpt4"] != df_75_cot["label"]].sample(3).iterrows():
print(f"{row['sentence']}\n\tGPT-4: {row['gpt4']}\n\tCorrect: {row['label']}\n")
`` Our approach is very much to only use raw materials that are produced in line with the principles of sustainable development .
GPT-4: 1
Correct: 0
The company 's board of directors would propose a dividend of EUR0 .15 per share for 2005 .
GPT-4: 1
Correct: 0
The commission found evidence of several meetings to discuss the cartel , including one in a Brussels restaurant in November 1997 at which the companies discussed price increases .
GPT-4: -1
Correct: 0
Inspecting this small sample of data it is not clear to me that one model makes dumber mistakes than the other.
Conclusion
In this short study we have seen that GPT-4 does not seem to offer any significant advantage over GPT-3.5 for financial news sentiment classification. This is good news for those of us who are looking to perform related tasks, as it means we can use the cheaper, faster GPT-3.5 model without sacrificing performance.
Limitations
The main limitation of this study is that we are looking at a dataset of one-line financial headlines. In practice, researchers will probably analyze full news articles, where GPT-4 reasoning capabilities might really deliver on the expected performance boost. However the insights delivered here still suggest that GPT-3.5 is reasonably well-equipped to handle sentiment classification tasks, and one may rely on it if willing to sacrifice a small fraction of performance (assuming CoT is implemented). It will also be interesting to see similar comparisons made for open-source models.
Acknowledgements
Thanks to Fettuccine Gonzales for the inspiration for this study. In a parallel universe, you could find her original analysis here.