Are LLMs Rational Investors?
A Study on Detecting and Reducing the Financial Bias in LLMs

Yuhang Zhou1,2Yuchen Ni3Yunhui Gan1,2Zhangyue Yin1
Xiang Liu4Jian Zhang5Sen Liu1,2Xipeng Qiu1Guangnan Ye1,2Hongfeng Chai1,2

1School of Computer Science, Fudan University
2Institute of Fintech, Fudan University
3School of Electronics and Information Engineering, Tongji University
4Tandon School of Engineering, New York University
5DataGrand Inc
Email: yuhangzhou22@m.fudan.edu.cnCorresponding Author. Email: yegn@fudan.edu.cn
Abstract

Large Language Models (LLMs) are increasingly adopted in financial analysis for interpreting complex market data and trends. However, their use is challenged by intrinsic biases (e.g., risk-preference bias) and a superficial understanding of market intricacies, necessitating a thorough assessment of their financial insight. To address these issues, we introduce Financial Bias Indicators (FBI), a framework with components like Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote to identify, detect, analyze, and eliminate irrational biases in LLMs. By combining behavioral finance principles with bias examination, we evaluate 23 leading LLMs and propose a de-biasing method based on financial causal knowledge. Results show varying degrees of financial irrationality among models, influenced by their design and training. Models trained specifically on financial datasets may exhibit more irrationality, and even larger financial language models (FinLLMs) can show more bias than smaller, general models. We utilize four prompt-based methods incorporating causal debiasing, effectively reducing financial biases in these models. This work enhances the understanding of LLMs’ bias in financial applications, laying the foundation for developing more reliable and rational financial analysis tools.

\useunder

\ul

Are LLMs Rational Investors?
A Study on Detecting and Reducing the Financial Bias in LLMs


Yuhang Zhouthanks: Email: yuhangzhou22@m.fudan.edu.cn1,2  Yuchen Ni3  Yunhui Gan1,2  Zhangyue Yin1 Xiang Liu4Jian Zhang5Sen Liu1,2Xipeng Qiu1Guangnan Yethanks: Corresponding Author. Email: yegn@fudan.edu.cn1,2Hongfeng Chai1,2 1School of Computer Science, Fudan University 2Institute of Fintech, Fudan University 3School of Electronics and Information Engineering, Tongji University 4Tandon School of Engineering, New York University 5DataGrand Inc.


1 Introduction

Recent advancements in LLMs, such as GPT-4 OpenAI (2023) and LLaMA-2 Touvron et al. (2023), have shown their prowess across a spectrum of natural language processing (NLP) tasks Zhao et al. (2023), extending into specialized domains such as finance Zhang et al. (2023), law Cui et al. (2023), and healthcare Wang et al. (2023). Despite their versatility, these models grapple with inherent biases Gallegos et al. (2023),  Rutinowski et al. (2023), encompassing gender, race, and socioeconomic disparities, which could compromise their reliability and entail significant consequences Jeoung et al. (2023). Efforts to mitigate such biases have led to the development of benchmark datasets such as StereoSet Nadeem et al. (2020) for stereotype identification, GenderCare Tang et al. for gender bias detection, OpinionGPT Haller et al. (2023) for generating bias-neutral content, employing techniques such as LoRA Hu et al. (2021) for model fine-tuning, especially within the social sciences. As we delve into the application of these advanced models in specific domains, it becomes imperative to address these biases to ensure fair and unbiased outcomes.

Refer to caption

Figure 1: An example of model irrationality, where the model gives inconsistent expectations for the same event from different subjects, resulting in different emotions, reflect the financial bias towards the company.

However, in the realm of financial LLMs (FinLLMs), research has predominantly concentrated on enhancing model performance through continued pre-training or fine-tuning, with evaluation metrics focused on NLP tasks and financial applications. Moreover, LLMs have been employed in financial analytics, acting as advisors by leveraging news Lopez-Lira and Tang (2023) and fundamental analyses Fatouros et al. (2024) for investment decisions across diverse securities Romanko et al. (2023). However, the efficacy of these models is contingent upon the models’ own rationality as participants in the market. A lack of rationality in LLMs could lead to misinterpretations and misapplications of market dynamics, adversely impacting not only the users of these models but also the broader economy.  Figure 1 is an example of model irrationality.

Hence, it is essential to assess the rationality of LLMs before incorporating them into financial advisory roles. Limited research has addressed the financial biases present in pre-trained models, exploring methods such as probabilistic detection Chuang and Yang (2022) and consistency checks Yang et al. (2023). However, these methods mainly focus on pre-trained embedding models like BERTDevlin et al. (2019), limiting their application to LLM bias detection. The absence of evaluation standards for financial biases hinders thorough oversight, compromising assessment realism and objectivity. Consequently, there is an immediate need for a holistic framework to gauge LLMs rationality in finance.

Meanwhile, the detection of financial rationality and biases in LLMs faces four main challenges:

  • Q1: How to define the financial rationality and biases of LLMs? A theoretical framework is required to support the detection of LLMs.

  • Q2: How to detect and reveal financial biases in LLMs? A method needs to be developed to quantify theoretical indicators and construct relevant datasets.

  • Q3: How to investigate the origins of financial biases in LLMs? It is necessary to research whether these financial biases stem from the model’s capabilities or its robustness.

  • Q4: How to mitigate financial biases in LLMs? Methods need to be found that mitigate biases without compromising the original capabilities of the model.

To work towards this goal, our study conducts an examination of the financial rationality of LLMs based on the theory of behavioral finance Barberis and Thaler (2003). We believe that the enduring theory of behavioral finance, grounded in psychology and finance, can provide a more comprehensive perspective to support research findings. In the current scenario where it is challenging to quantify behavioral finance, we propose the Financial Bias Indicators (FBI) framework to comprehensively assess financial rationality in LLMs. The FBI framework consists of four components: Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote, covering the definition, detection, cause analysis, and mitigation of financial biases.

Our research shows that almost all LLMs exhibit financial irrationality. These biases, which may be exacerbated by continuous pre-training or fine-tuning with financial data, could lead to market anomalies in real-world applications. While prompt-based mitigation methods show promise, the persistent biases in LLMs highlight the necessity for further research to improve model robustness, fairness, and rationality, ensuring financial market stability and asset protection.

Our key contributions to this field are summarized as follows:

Refer to caption

Figure 2: The framework of FBI consists of the Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote. The Bias Unveiler defines financial biases in LLMs based on behavioral finance. The Bias Detective constructs relevant data and detects biases in 23 leading LLMs. The Bias Tracker traces biases using System 2 slow thinking analysis and attention mechanism visualization. The Bias Antidote attempts to debias the models using four methods.
  • The FBI framework, based on behavioral finance, defines, detects, analyzes, and mitigates financial biases in LLMs, offering a novel approach to evaluate their financial rationality. It is the first study to quantify behavioral finance indicators in LLMs, paving the way for more reliable LLMs.

  • The framework extensively analyzes 23 leading LLMs, assessing how model parameters, training data, and input formats affect financial rationality. Our study deepens the understanding of the varying levels of financial irrationality among models and their behavior in financial contexts.

  • Utilizing the FBI framework, we explore the origins of financial irrationality in LLMs, identify methods to mitigate bias, and develop a dataset of 200,000 financial causal texts named FinCausal to address biases with causal knowledge. Ultimately, we experimented with four prompt-based methods for bias mitigation, yielding encouraging results.

2 Background

Behavioral finance explores how psychological factors and cognitive biases influence the financial behaviors of individuals, institutions, and markets, differing from traditional finance theories that assume rational market participants. The field aims to uncover the psychological roots of various market phenomena, dissecting financial decision-making processes to build a realistic framework of market dynamics that includes cognitive errors and the constraints on arbitrage. We structure our investigation using the classification from  Barberis and Thaler (2003), which divides behavioral finance into Cognitive Bias and Limits to Arbitrage.

2.1 Cognitive Bias

Cognitive biases represent systematic departures from normative decision-making, influencing how investors form beliefs and assess risks. These biases are multifaceted, manifesting as erroneous beliefs or inconsistent risk preference. For instance, belief biases, such as attentional neglect, representativeness bias, anchoring effect, and overconfidence,skew investors’ expectations, while risk-preference biases, such as loss aversion and reference dependence, lead to irregularities in risk assessment and decision-making under uncertainty. A comprehensive taxonomy of these biases, alongside their definitions, is detailed in Appendix C.

2.2 Limits to Arbitrage

Contrary to the Efficient Markets Hypothesis (EMH), which posits that asset prices fully reflect all available information and that market participants behave rationally, behavioral finance identifies scenarios where irrationality prevails, leading to anomalies like market bubbles and systemic crises. These phenomena are often attributed to the collective impact of cognitive biases on investor expectations, which can cause significant deviations from asset fundamentals.

3 FBI: A Framework for Assessing LLMs Financial Rationality

We propose the FBI framework illustrated in Figure 2. This framework is divided into four parts: Bias Unveiler defines financial biases in LLMs based on behavioral finance; Bias Detective constructs detection data and evaluates current leading LLMs for biases; Bias Tracker analyzes the causes of biases based on detection results and attention mechanisms; Bias Antidote builds a financial causal dataset and employs a series of prompt-based methods to mitigate bias phenomena.

4 Bias Unveiler

Based on the definitions from behavioral finance, we categorize biases in LLMs within financial contexts into Belief Bias and Risk-preference Bias. Within these categories, we define six related psychological biases.

4.1 Belief Bias

In today’s information-rich environment, constant updates require adjusting our predictions and beliefs. This framework investigates three cognitive biases—Anchoring, Representativeness, and Overconfidence—using real-world market data like news and shareholder discussions to test LLMs’ ability to maintain rationality in market conditions.

Anchoring Effects

We test LLMs for Anchoring Effects by checking if they show different views on the same event or give consistent responses under different company settings. This bias, derived from past data, can introduce sentiment analysis biases and potentially disrupt markets when used in finance.

Representativeness Bias

We investigate Representativeness Bias in LLMs by analyzing their outputs in relation to company size and sector. This bias towards size and industry can concentrate investment risks and cause problems.

Overconfidence

To measure overconfidence, we track score fluctuations for the same events with different subjects in FinLLMs and corresponding base LLMs. Aggressive scores with high deviation suggest overconfidence in these models’ event assessments or responses.

4.2 Risk-preference Bias

Asset returns are uncertain, affecting investor behavior based on risk-return preferences. Our study of Risk-preference Bias explains Situational Dependence Bias, loss aversion, and framing effect in various decision contexts, assessing LLMs’ risk preferences in different scenarios.

Situational Dependence Bias

Recognizing decision-making as a process shaped by previous experiences and contextual factors, we delve into the Situational Dependence Bias by examining if LLMs exhibit variable risk preferences across different scenarios.

Loss Aversion

In the context of Loss Aversion, we scrutinize LLM responses within loss-framed scenarios, aiming to uncover any predominant risk-averse or risk-loving tendencies.

Frame Effect

We investigate the Framing Effect by restating scenarios in various languages or expressions to track changes in LLM preferences, aiming to determine if linguistic framing influences LLM outputs, indicating bias in option presentation.

5 Bias Detective

5.1 Belief Bias

5.1.1 Data Desgin

To comprehensively assess the rationality of LLMs in financial markets, we scrutinized their responses to emergent information, distinguishing between event news and interactions, amid the prevalent noise in online investor dialogues in China. We analyzed historical events impacting company stock prices and adopted a refined classification of financial events into four primary categories: Corporate Governance and Equity Changes (CGEC), Financial Reports and Earnings Expectations (FREE), Market Behavior and Announcements (MBA), and Negative Events and Risk Management (NERM), detailed in Appendix C.

Throughout 2023, we compiled 300 news articles, including a subset of 24 emotionally nuanced pieces N={n1,n2,,n24}superscript𝑁subscript𝑛1subscript𝑛2subscript𝑛24N^{\prime}=\{n_{1},n_{2},\ldots,n_{24}\}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT } to enhance the bias detection in LLMs. This subset, detailed in Appendix C, contained articles categorized into nine positive, nine negative, and six mixed emotions.

Additionally, we assembled 10 neutral interaction pairs I={(q1,r1),(q2,r2),,(q10,r10)}𝐼subscript𝑞1subscript𝑟1subscript𝑞2subscript𝑟2subscript𝑞10subscript𝑟10I=\{(q_{1},r_{1}),(q_{2},r_{2}),\ldots,(q_{10},r_{10})\}italic_I = { ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_q start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ) } to analyze LLM comprehension in a controlled environment.

For each nN𝑛superscript𝑁n\in N^{\prime}italic_n ∈ italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and (q,r)I𝑞𝑟𝐼(q,r)\in I( italic_q , italic_r ) ∈ italic_I, numerical details were abstracted to proportional figures to standardize data across varying company market caps, as documented in Appendix F.

We sampled 600 companies from the Chinese A-share market, excluding delisting entities, distributed across three tiers of market capitalization: top, middle, and bottom, each containing 200 companies. This selection method aimed to test Belief Bias in LLMs, with classifications by industry outlined in Appendix B.

5.1.2 Analysis Logic

To investigate Anchoring Effects, we altered the subject company in each news item nN𝑛superscript𝑁n\in N^{\prime}italic_n ∈ italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and assessed variability in LLM evaluations using Analysis of Variance (ANOVA) St et al. (1989), expressed as F(n,C)𝐹𝑛𝐶F(n,C)italic_F ( italic_n , italic_C ), examining the variance in scores across companies ciCsubscript𝑐𝑖𝐶c_{i}\in Citalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C for each news n𝑛nitalic_n. Representativeness Bias was analyzed by correlating LLM outputs with company size and industry sector, using Spearman correlation coefficient ρ(S,M)𝜌𝑆𝑀\rho(S,M)italic_ρ ( italic_S , italic_M ), where S𝑆Sitalic_S and M𝑀Mitalic_M represent LLM scores and market capitalizations of companies cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. ANOVA was employed for industry correlation F(S,I)𝐹𝑆𝐼F(S,I)italic_F ( italic_S , italic_I ), with box plots to display score distributions across industries I𝐼Iitalic_I, reflecting first-level industry classifications.To measure Overconfidence, we used the standard deviation σ(S)𝜎𝑆\sigma(S)italic_σ ( italic_S ) of scores S𝑆Sitalic_S, representing the variance in LLM evaluations of the same event across different company contexts, comparing results from FinLLMs with base LLMs.

5.1.3 Result

Refer to caption
Figure 3: The score variance between models in event news detection. A higher variance indicates a more severe Anchoring Effect.

The evaluation of Belief Bias is principally conducted through the examination of event news and interactions. This analysis reveals a widespread Anchoring Effect across the majority of LLMs when the subjects of events and interactions are modified, with slight variations observed among different models. The average variance index, detailed in Figure 3, sheds light on the rationality levels of LLMs with respect to representativeness bias. Specifically, LLMs with a focus on the Chinese language, such as the GLM and Qwen family, exhibit commendable financial rationality, whereas the Xuanyuan and Baichuan family are more susceptible to irrational behavior.

Refer to caption
Figure 4: The score distribution between FinQwen and Qwen-14B across 24 news events shows that FinQwen is less stable and more aggressive in scoring compared to Qwen-14B, exhibiting stronger Overconfidence.

In terms of Overconfidence, violin plots, presented in Appendix  H.1, illustrate the score distributions of various texts across all models. A notable disparity is observed in the models’ responses to composite texts of positive and negative emotional content. As per Table 5, the GPT and InternLM models display a marked optimism, in contrast to the pronounced pessimism of the Qwen and GLM family. Furthermore, Figure 4 highlights that models trained on financial corpora experience a heightened score variability compared to their base counterparts.

In terms of Representativeness Bias, all LLMs exhibited a correlation coefficient below 10% between model scores and market capitalization, indicating a weak correlation. However, certain LLMs showed clear biases towards specific industries, as documented in Appendix  H.1. For example, the FinQwen model consistently allocated lower scores to the Steel and Banking sectors. A comprehensive analysis reveals that the Media, Steel, Banking, and Non-Banking Finance sectors frequently occupy the extreme ends of the scoring spectrum across different models, whereas the Computer Science and Automobile sectors generally maintain a middle ground, exhibiting relative stability.

5.2 Risk-preference Bias

5.2.1 Data Design

To simulate real-life decision-making, we designed 40 scenarios with 200 multiple-choice questions, divided into 200 gain-framed and 100 loss-framed scenarios. Each question Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT presents three decision alternatives Ai,jsubscript𝐴𝑖𝑗A_{i,j}italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, where j𝑗jitalic_j represents risk preferences: Risk-loving, Risk-neutral, Risk-averse. The alternatives are randomized to minimize bias Zheng et al. (2023).

The alternatives are constructed based on expected utility theory Simon et al. (1994), represented by:

E[u(x)]=x(ω)u(x(ω))p(x(ω)),𝐸delimited-[]𝑢𝑥subscript𝑥𝜔𝑢𝑥𝜔𝑝𝑥𝜔E[u(x)]=\sum_{x(\omega)}u(x(\omega))p(x(\omega)),italic_E [ italic_u ( italic_x ) ] = ∑ start_POSTSUBSCRIPT italic_x ( italic_ω ) end_POSTSUBSCRIPT italic_u ( italic_x ( italic_ω ) ) italic_p ( italic_x ( italic_ω ) ) , (1)

where u(x)𝑢𝑥u(x)italic_u ( italic_x ) is the utility function, x(ω)𝑥𝜔x(\omega)italic_x ( italic_ω ) the outcome, and p(x(ω))𝑝𝑥𝜔p(x(\omega))italic_p ( italic_x ( italic_ω ) ) the outcome’s probability.

The concavity of u(x)𝑢𝑥u(x)italic_u ( italic_x ), reflecting risk preferences, is indirectly assessed via second-order Taylor expansion:

E[u(x)]u(E[x])+12u′′(E[x])Var(x),𝐸delimited-[]𝑢𝑥𝑢𝐸delimited-[]𝑥12superscript𝑢′′𝐸delimited-[]𝑥Var𝑥E[u(x)]\approx u(E[x])+\frac{1}{2}u^{\prime\prime}(E[x])\text{Var}(x),italic_E [ italic_u ( italic_x ) ] ≈ italic_u ( italic_E [ italic_x ] ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) Var ( italic_x ) , (2)

This approximation adjusts risk preferences by modulating variance, with technical details in Appendix E.

5.2.2 Analysis Logic

We investigated the Situational Dependence Bias in LLMs by analyzing their risk preferences across different scenarios Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We examined choices in gain-framed scenarios Si,gainsubscript𝑆subscript𝑖,gainS_{i_{,}\text{gain}}italic_S start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT , end_POSTSUBSCRIPT gain end_POSTSUBSCRIPT to detect any situational bias in risk preferences. In the context of Loss Aversion, we scrutinized LLM responses in loss-framed scenarios Si,losssubscript𝑆subscript𝑖,lossS_{i_{,}\text{loss}}italic_S start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT , end_POSTSUBSCRIPT loss end_POSTSUBSCRIPT to assess tendencies towards risk-aversion or risk-loving behaviors. To explore the Framing Effect, we translated scenarios from Chinese to English and monitored shifts in LLM preferences PLLMsubscript𝑃LLMP_{\text{LLM}}italic_P start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT, observing how linguistic framing affects decisions. This approach aims to reveal if LLM outputs are biased by the linguistic construction of scenarios and choices.

5.2.3 Result

Refer to caption
Figure 5: Comparison of risk-preference distribution of three models under different prompt methods.

The exploration of Risk-preference Bias entails the examination of LLM decisions across varied scenarios. The compiled results, particularly for gain-framed queries, are tabulated in Table 11. A predominant trend among most models is the exhibition of distinct risk preferences in disparate scenarios, indicative of a pronounced Situational Dependence Bias. Nonetheless, prefacing prompts with an instruction of the model’s risk-averse nature significantly attenuates this bias. In the context of loss-framed queries, some models exhibit a pronounced Loss Aversion Bias like GPT-4, as shown in LABEL:tab:Loss_Aversion_Bias. Moreover, refer to LABEL:tab:translation_difference, the translation of all queries into English precipitated notable discrepancies between the models’ responses to Chinese and English versions, underscoring a pronounced Framing Effect.

In particular, we have selected several representative cases for analysis, as shown in Figure 5. For Xuanyuan-13B, inducing Risk-Aversion does not alter its original preference distribution, but it exhibits a stronger Framing Effect. For GLM4, it performs well in terms of Framing Effect bias and can effectively switch preferences based on instructions. For Qwen-14B, it is capable of some preference distribution shift according to instructions, but also exhibits a significant Framing Effect.

6 Bias Tracker

After detecting Belief Bias and Risk-Preference Bias in LLMs, it was evident that Belief Bias had a more significant impact compared to Risk-Preference Bias. Risk-Preference Bias manifests primarily as the model’s inherent decision-making tendency, whereas Belief Bias results from the model’s misinterpretation of information, leading us to focus on analyzing Belief Bias formation.

To further investigate, we utilized Chain-of-Thought (COT) methods Wei et al. (2022) to engage the model’s System 2 thinking, aiming to enhance its in-depth analysis capabilities. We also scrutinized the model’s output, particularly the attention importance distribution across input tokens, enabling precise diagnosis of bias causes and informing subsequent bias correction strategies.

6.1 Slow Thinking

6.1.1 Methodology

To investigate the roots of score instability, whether due to inadequate reasoning or compromised rationality, we first employ a slow-thinking approach similar to CoT, prompting the model to generate reasoning before providing scores. This approach helps us study the causes of Belief Bias by enabling the model to produce reasoning for its evaluations, followed by the actual scoring. We utilize Bertopic Grootendorst (2022) for topic word extraction within the reasoning texts provided by the models. By clustering reasoning texts and extracting keywords, we analyze score discrepancies across different clusters, denoted by ΔSclustersΔsubscript𝑆clusters\Delta S_{\text{clusters}}roman_Δ italic_S start_POSTSUBSCRIPT clusters end_POSTSUBSCRIPT, to identify if certain thematic focuses lead to inconsistent evaluations. This detailed analysis helps us pinpoint specific causes of bias, thus informing future model optimizations and bias corrections.

Refer to caption

Figure 6: The attention checks on the outputs of the two models for each input token indicate that the redboxed sections represent the financial entity tokens that may cause bias. MiniCPM-2B shows better ability to block irrelevant information compared to Baichuan2-7B.

6.1.2 Result

The application of the COT methodology, aimed at exploring the causes of financial bias in the model, yielded results as shown in H.2. The clustering and keyword analysis indicate that this intentional cognitive strategy improves the logical consistency and financial acumen of the model outputs. Integrating the provided thoughts into the previous sentence, the English expression would be: A comparison of the reasoning keywords for the top-performing GLM-4 model and the underperforming Baichuan2-7B model, illustrated in Figure 39 and Figure 39 respectively, reveals that the GLM-4 exhibits stronger logical coherence with primary keywords such as "termination", "transaction restructuring" and "withdrawal". In contrast, the Baichuan2-7B model’s logic is weaker, with primary keywords including "change", "decision", and "information". Moreover, the clustering of the models’ inferential outputs reveals substantial variations in the ratings assigned to each category. This divergence underscores that the irrationality exhibited by models in financial contexts stems more from their inherent cognitive processes than from their computational capabilities.

6.2 Attention Importance

6.2.1 Methodology

To verify whether the bias in the LLMs arises from an excessive focus on certain input tokens due to the training corpus, we examine the attention importance of LLMs for input sequences. Inspired by Wu et al. (2023), we define the importance In,msubscript𝐼𝑛𝑚I_{n,m}italic_I start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT of input token xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to output token ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as:

In,m=p(ym|Zm)p(ym|Zm,/n)subscript𝐼𝑛𝑚𝑝conditionalsubscript𝑦𝑚subscript𝑍𝑚𝑝conditionalsubscript𝑦𝑚subscript𝑍𝑚absent𝑛I_{n,m}=p(y_{m}|Z_{m})-p(y_{m}|Z_{m,/n})italic_I start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = italic_p ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - italic_p ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_m , / italic_n end_POSTSUBSCRIPT ) (3)

where Zmsubscript𝑍𝑚Z_{m}italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the context to generate ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by concatenating the prompt X𝑋Xitalic_X and the first m1𝑚1m-1italic_m - 1 tokens of response Y𝑌Yitalic_Y. Zm,/nsubscript𝑍𝑚absent𝑛Z_{m,/n}italic_Z start_POSTSUBSCRIPT italic_m , / italic_n end_POSTSUBSCRIPT omits token xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from Zmsubscript𝑍𝑚Z_{m}italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and p(|)p(\cdot|\cdot)italic_p ( ⋅ | ⋅ ) is the conditional probability computed by the language model f𝑓fitalic_f. We accelerate it with the first-order approximation:

In,mf(ym|Zm)Ei[xn]Ei[xn]subscript𝐼𝑛𝑚𝑓conditionalsubscript𝑦𝑚subscript𝑍𝑚subscript𝐸𝑖delimited-[]subscript𝑥𝑛subscript𝐸𝑖superscriptdelimited-[]subscript𝑥𝑛topI_{n,m}\approx\frac{\partial f(y_{m}|Z_{m})}{\partial E_{i}[x_{n}]}\cdot E_{i}% [x_{n}]^{\top}italic_I start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ≈ divide start_ARG ∂ italic_f ( italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] end_ARG ⋅ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (4)

where Ei[xn]subscript𝐸𝑖delimited-[]subscript𝑥𝑛E_{i}[x_{n}]italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] is the input word embedding of token xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This approach helps us determine whether the LLM excessively focuses on specific financial entities, thereby more accurately diagnosing the specific causes of bias.

6.2.2 Result

Due to the limitations of Chinese tokens, we chose to use LLMs with BPE tokenizers, focusing on the well-performing MiniCPM-2B and the more biased Baichuan2-7B models, results are shown in Figure 6. We analyzed the attention each model pays to each input token in their outputs and found that Baichuan2-7B tends to focus more on financial company entities, industries, and their surrounding tokens. This excessive attention to irrelevant information further exacerbates the generation of financial biases.

7 Bias Antidote

7.1 Methodology

To eliminate Belief Bias in LLMs while preserving their original general capabilities, we employed four prompt-based methods for bias mitigation. 1) we utilized a CoT approach to enable the model to engage in slow thinking, thereby producing scores based on logical reasoning. 2) We implemented the S2A method Weston and Sukhbaatar (2023) to shield the model from irrelevant context, allowing for secondary reasoning before scoring. 3) We used Entity Replace to stabilize the model’s input. 4) We applied Fincausal relationship understanding based on knowledge to remove biases. For the last method, we extracted 200,000 pieces of financial causal knowledge about industries and individual stocks from past research reports and used a retrieval-augmented generation (RAG) approach to recall relevant causal information.

Table 1: The unbiased results of four prompt-based methods, with smaller values indicating lower levels of bias.
Method GLM-4 ChatGLM3-Turbo MiniCPM-2B Xuanyuan2-6B Baichuan2-7B
Direct 0.598 1.067 1.409 13.999 28.106
COT 5.382 (+\fpevalround(5.382-0.598,3)) 5.671 (+\fpevalround(5.671-1.067,3)) 7.039 (+\fpevalround(7.039-1.409,3)) 6.668 (-\fpevalround(13.999-6.668,3)) 12.660 (-\fpevalround(28.106-12.660,3))
S2A 3.230 (+\fpevalround(3.230-0.598,3)) 4.406 (+\fpevalround(4.406-1.067,3)) 8.380 (+\fpevalround(8.380-1.409,3)) 9.756 (-\fpevalround(13.999-9.756,3)) 25.958 (-\fpevalround(28.106-25.958,3))
Entity Replace 0.710 (+\fpevalround(0.710-0.598,3)) 1.012 (-\fpevalround(1.067-1.012,3)) 1.385 (-\fpevalround(1.409-1.385,3)) 1.236 (-\fpevalround(13.999-1.236,3)) 12.688 (-\fpevalround(28.106-12.688,3))
FinCausal 0.769 (+\fpevalround(0.769-0.598,3)) 2.200 (+\fpevalround(2.200-1.067,3)) 1.195 (-\fpevalround(1.409-1.195,3)) 10.111 (-\fpevalround(13.999-10.111,3)) 8.763 (-\fpevalround(28.106-8.763,3))

7.2 Result

For Belief Bias, we employed four prompt-based elimination methods and selected five representative LLMs for bias elimination experiments, as shown in Table 1. For the CoT method, it performed well on the originally more biased models, as this reasoning approach enhances logical consistency in responses, thereby improving robustness and reducing bias. However, it performed poorly on models that were originally well-performing, as the increased output length due to the auto-regressive nature of these models resulted in amplified bias. For the S2A method, the effectiveness of increasing model output to reduce irrelevant attention depends on the model’s original capability; weaker models tend to exhibit greater bias. The Entity Replace method showed superior performance due to the substitution of financial topics. For the FinCausal method, each test data recalled four related causal knowledge entries, further enhancing the model’s reasoning ability to mitigate bias.

8 Discussion

Our study using the FBI framework aids in identifying and reducing irrationalities in finance sector models, enhancing understanding of LLM biases and logic, and applying methods to mitigate financial biases.

8.1 Model Size

The analysis in Section 5.1.3 reveals that within a specific model family, the degree of bias tends to decrease with the enlargement of model parameters, in line with the scaling law Kaplan et al. (2020). However, this trend is not consistent across different model family, where bias levels are also shaped by factors such as model design and training approaches.

8.2 Training Data

The FBI framework assessed financial rationality in general and financial LLMs. Financially-trained models, as seen in Figure 4, may exhibit higher score variability and risk inclination, potentially increasing financial irrationality. Models like ChatGLM2-6B and Qwen-7B show opposite industry biases (Figure 15-Figure 32), suggesting temporal biases in training data align with industry cycle rotation. Financial data, including potentially embellished research reports, can enhance LLMs for finance-specific NLP tasks but may also embed financial irrationality. Using these models in financial quantification could lead to unintended disruptions, demanding immediate attention.

8.3 Input Forms

The FBI framework utilized four prompt-based methods to eliminate biases, each of which produced certain effects, with the elimination being more pronounced when the model’s initial bias was more severe. As observed in Table 1, methods that increase output length, such as COT and S2A, can potentially exacerbate biases. The Entity Replace method, which reduces input, often yields better results, and the FinCausal method, which increases input, can further enhance the model’s ability for financial causal reasoning, achieving the effect of bias mitigation.

9 Conclusion

Our research introduces the FBI framework, a novel method for evaluating the financial rationality of LLMs in the intricate field of financial analysis. We rigorously examined 23 leading LLMs and revealed substantial differences in their financial rationality. By defining, detecting, analyzing, and mitigating financial biases, our study validates the capabilities and limitations of LLMs in financial contexts, offering reliable insights for their application in the finance sector. The advancement of LLMs towards greater financial acumen and reduced bias is essential for their dependable use in financial analysis, paving the way for a future where AI-generated insights are confidently and precisely applied in the financial industry.

Limitation

Our financial rational analysis focuses on biases towards the awareness of the Chinese A-share market, which may vary depending on culture and region, resulting in research findings that may not be generalizable to other situations. Meanwhile, we introduced other models during model evaluation, which may lead to the introduction of other biases.

Ethics Statement

This paper honors the EMNLP Code of Ethics. The dataset used in the paper does not contain any private information. All annotators have received enough labor fees corresponding to their amount of annotated instances. The code and data are open-sourced under the MIT license.

References

  • Barberis and Thaler (2003) Nicholas Barberis and Richard Thaler. 2003. A survey of behavioral finance. Handbook of the Economics of Finance, 1:1053–1128.
  • Chuang and Yang (2022) Chengyu Chuang and Yi Yang. 2022. Buy tesla, sell ford: Assessing implicit stock market preference in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 100–105.
  • Cui et al. (2023) Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
  • Fatouros et al. (2024) Georgios Fatouros, Konstantinos Metaxas, John Soldatos, and Dimosthenis Kyriazis. 2024. Can large language models beat wall street? unveiling the potential of ai in stock selection. arXiv preprint arXiv:2401.03737.
  • Gallegos et al. (2023) Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2023. Bias and fairness in large language models: A survey.
  • Grootendorst (2022) Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
  • Haller et al. (2023) Patrick Haller, Ansar Aynetdinov, and Alan Akbik. 2023. Opiniongpt: Modelling explicit biases in instruction-tuned llms.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Jeoung et al. (2023) Sullam Jeoung, Yubin Ge, and Jana Diesner. 2023. Stereomap: Quantifying the awareness of human-like stereotypes in large language models. arXiv preprint arXiv:2310.13673.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  • Lopez-Lira and Tang (2023) Alejandro Lopez-Lira and Yuehua Tang. 2023. Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.
  • Nadeem et al. (2020) Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. Stereoset: Measuring stereotypical bias in pretrained language models.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
  • Romanko et al. (2023) Oleksandr Romanko, Akhilesh Narayan, and Roy H Kwon. 2023. Chatgpt-based investment portfolio selection. In Operations Research Forum, volume 4, page 91. Springer.
  • Rutinowski et al. (2023) Jérôme Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, Markus Pauly, et al. 2023. The self-perception and political biases of chatgpt. Human Behavior and Emerging Technologies, 2024.
  • Simon et al. (1994) Carl P Simon, Lawrence Blume, et al. 1994. Mathematics for economists, volume 7. Norton New York.
  • St et al. (1989) Lars St, Svante Wold, et al. 1989. Analysis of variance (anova). Chemometrics and intelligent laboratory systems, 6(4):259–272.
  • (19) Kunsheng Tang, Wenbo Zhou, Jie Zhang, Aishan Liu, Gelei Deng, Shuai Li, Peigui Qi, Weiming Zhang, Tianwei Zhang, and Nenghai Yu. Gendercare: A comprehensive framework for assessing and reducing gender bias in large language models.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Wang et al. (2023) Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  • Weston and Sukhbaatar (2023) Jason Weston and Sainbayar Sukhbaatar. 2023. System 2 attention (is something you might need too).
  • Wu et al. (2023) Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, and Dong Yu. 2023. From language modeling to instruction following: Understanding the behavior shift in llms after instruction tuning. arXiv preprint arXiv:2310.00492.
  • Yang et al. (2023) Linyi Yang, Yingpeng Ma, and Yue Zhang. 2023. Measuring consistency in text-based financial forecasting models. arXiv preprint arXiv:2305.08524.
  • Zhang et al. (2023) Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters.
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models.
  • Zheng et al. (2023) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. arXiv e-prints, pages arXiv–2309.

Appendix A Cognitive Bias

For cognitive bias, we classified it into Belief Bias and Risk reference Bias based on previous research, and studied seven of these biases within the FBI framework.Refer to Table 2 for specific content.

Table 2: Summary of Cognitive Biases
Cognitive Bias Bias Type Definition
Belief Bias Limited Attention The brain has two systems when working: fast thinking and slow thinking. It uses intuition to deal with things quickly.
Representativeness bias When making probability estimates, people tend to focus on certain representative features, ignoring environmental probabilities and sample size.
Anchoring effect Decision-making is often influenced by the first information received, like an anchor sinking to the bottom of the sea.
Overconfidence Belief that one’s knowledge is more accurate than the facts; one’s information is given more weight.
Risk-Preference Bias Situational dependence bias The effect of a stimulus depends largely on the context in which it occurs.
Loss aversion Sensitivity to losses exceeds gains of equal value.
Framing effect Different descriptions of an objectively identical problem lead to different decision-making judgments.

Appendix B Company Profile

In order to avoid bias caused by market value impact, we did not choose funds from the CSI 300 or CSI 500. Instead, we summarized all listed companies in China. After removing ST type stocks, we selected the top, middle, and bottom 200 stocks based on market value, totaling 600 stocks. The industry distribution of stocks is shown in the Figure 7.

Refer to caption

Figure 7: Distribution of the selectd company’s industry type.

Appendix C Event Type

We have sorted out the types of events that can affect a company’s stock price based on the regular patterns of the Chinese A-Share stock market, and finally sorted out four categories, totaling 16 types of events. The detailed content is shown in Table 3.

Table 3: Event Types and Definitions
Event Type Subdivision Type Definition
Corporate Governance and Equity Changes Major Asset Restructuring The process of recombining, adjusting, and allocating the distribution status of enterprise assets among the owners, controllers, and external economic entities.
Equity Incentive By conditionally granting employees partial shareholder rights, a sense of ownership is fostered, forming a community of interests with the company.
Increase or Decrease in Shareholder Holdings Changes in the shareholder holdings of company stocks.
Buy-back The act of a listed company using cash or other means to repurchase its shares from the stock market.
Circulation of Restricted Stock Restricted shares become freely tradable in the secondary market after the commitment period.
Financial Reports and Earnings Expectations Performance Report Regular preparation by each responsibility center to evaluate and assess performance, serving as the basis for future budget preparation.
Market Behavior and Announcements Private Placement Targeted issuance of bonds or stocks to a select group of senior institutional or individual investors.
Transfer of Shares Listed companies transfer their provident fund to share capital in proportion or issue bonus shares accordingly.
Stock Price Fluctuations Sudden large inflows and outflows of funds lead to increased volatility in stock prices.
Business Dynamics Updates on enterprises and their surroundings, using major production and sales information to promote corporate brand and image.
Negative Events and Risk Management Dispute Disputes between companies or between companies and individuals.
Investigation Filing an investigation signifies a basic determination of illegal facts, allowing for compulsory measures and official initiation of investigation procedures.
Violation Penalties Punishments for enterprises violating regulations of regulatory bodies.
Litigation and Arbitration Litigation and arbitration for contract disputes and other property rights disputes between enterprises.
Security Enterprises providing guarantees for loans and other matters for other enterprises.

Appendix D Models

We have selected a total of 23 financial and general LLMs oriented by Chinese and English, with specific details shown in LABEL:tab:models.

Table 4: Models in our Framework
Model Name Chinese-oriented Model size FinLLM Deployment method
MiniCPM-2B True 2B False local
Baichuan-13B True 13B False local
DISC-FinLLM True 13B True local
Baichuan2-7B True 7B False local
Baichuan2-13B True 13B False local
ChatGLM2-6B True 6B False local
ChatGLM3-6B True 6B False local
ChatGLM3-Turbo True 33B False API
GLM-4 True Unknown False API
InternLM2-7B True 7B False local
InternLM2-20B True 20B False local
LLaMA2-7B False 7B False local
LLaMA2-13B False 13B False local
Qwen-7B True 7B False local
Qwen-14B True 14B False local
FinQwen True 14B True local
Qwen-72B True 72B False local
Qwen-max True 72B False API
Xuanyuan-13B True 13B True local
Xuanyuan-70B True 70B True local
Xuanyuan2-6B True 6B True local
GPT-3.5 False Unknown False API
GPT-4 False Unknown False API

Appendix E Formula Proof

Invoking the fundamental principles of expected utility theory, we recognize that a utility function’s curvature reflects an individual’s risk preference. Specifically, a concave utility function (u′′(x)<0superscript𝑢′′𝑥0u^{\prime\prime}(x)<0italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) < 0) is indicative of risk aversion, while a convex utility function (u′′(x)>0superscript𝑢′′𝑥0u^{\prime\prime}(x)>0italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) > 0) signifies risk-seeking behavior. A linear utility function (u′′(x)=0superscript𝑢′′𝑥0u^{\prime\prime}(x)=0italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) = 0), on the other hand, corresponds to risk neutrality.

The expected utility E[u(x)]𝐸delimited-[]𝑢𝑥E[u(x)]italic_E [ italic_u ( italic_x ) ] can be formally represented as:

E[u(x)]=x(ω)u(x(ω))p(x(ω))𝐸delimited-[]𝑢𝑥subscript𝑥𝜔𝑢𝑥𝜔𝑝𝑥𝜔E[u(x)]=\sum_{x(\omega)}u(x(\omega))p(x(\omega))italic_E [ italic_u ( italic_x ) ] = ∑ start_POSTSUBSCRIPT italic_x ( italic_ω ) end_POSTSUBSCRIPT italic_u ( italic_x ( italic_ω ) ) italic_p ( italic_x ( italic_ω ) ) (5)

Here, u(x)𝑢𝑥u(x)italic_u ( italic_x ) denotes the utility function, x(ω)𝑥𝜔x(\omega)italic_x ( italic_ω ) symbolizes the outcome under state ω𝜔\omegaitalic_ω, and p(x(ω))𝑝𝑥𝜔p(x(\omega))italic_p ( italic_x ( italic_ω ) ) is the probability of outcome x(ω)𝑥𝜔x(\omega)italic_x ( italic_ω ) occurring.

Furthermore, we articulate the variance of outcomes x𝑥xitalic_x, Var(x)Var𝑥\text{Var}(x)Var ( italic_x ), as the expected squared deviation from the expected value E[x]𝐸delimited-[]𝑥E[x]italic_E [ italic_x ]:

Var(x)=E[(xE[x])2]Var𝑥𝐸delimited-[]superscript𝑥𝐸delimited-[]𝑥2\text{Var}(x)=E[(x-E[x])^{2}]Var ( italic_x ) = italic_E [ ( italic_x - italic_E [ italic_x ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (6)

Applying the second-order Taylor expansion to the utility function u(x)𝑢𝑥u(x)italic_u ( italic_x ) around the expected value E[x]𝐸delimited-[]𝑥E[x]italic_E [ italic_x ] furnishes us with:

u(x)u(E[x])+u(E[x])(xE[x])+12u′′(E[x])(xE[x])2𝑢𝑥𝑢𝐸delimited-[]𝑥superscript𝑢𝐸delimited-[]𝑥𝑥𝐸delimited-[]𝑥12superscript𝑢′′𝐸delimited-[]𝑥superscript𝑥𝐸delimited-[]𝑥2u(x)\approx u(E[x])+u^{\prime}(E[x])(x-E[x])+\frac{1}{2}u^{\prime\prime}(E[x])% (x-E[x])^{2}italic_u ( italic_x ) ≈ italic_u ( italic_E [ italic_x ] ) + italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) ( italic_x - italic_E [ italic_x ] ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) ( italic_x - italic_E [ italic_x ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (7)

Imposing expectations on the approximated function, we derive the expected utility approximation:

E[u(x)]u(E[x])+u(E[x])E[xE[x]]+12u′′(E[x])E[(xE[x])2]𝐸delimited-[]𝑢𝑥𝑢𝐸delimited-[]𝑥superscript𝑢𝐸delimited-[]𝑥𝐸delimited-[]𝑥𝐸delimited-[]𝑥12superscript𝑢′′𝐸delimited-[]𝑥𝐸delimited-[]superscript𝑥𝐸delimited-[]𝑥2E[u(x)]\approx u(E[x])+u^{\prime}(E[x])E[x-E[x]]+\frac{1}{2}u^{\prime\prime}(E% [x])E[(x-E[x])^{2}]italic_E [ italic_u ( italic_x ) ] ≈ italic_u ( italic_E [ italic_x ] ) + italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) italic_E [ italic_x - italic_E [ italic_x ] ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) italic_E [ ( italic_x - italic_E [ italic_x ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (8)

Since E[xE[x]]=0𝐸delimited-[]𝑥𝐸delimited-[]𝑥0E[x-E[x]]=0italic_E [ italic_x - italic_E [ italic_x ] ] = 0, the middle term vanishes, simplifying our expression to:

E[u(x)]u(E[x])+12u′′(E[x])Var(x)𝐸delimited-[]𝑢𝑥𝑢𝐸delimited-[]𝑥12superscript𝑢′′𝐸delimited-[]𝑥Var𝑥E[u(x)]\approx u(E[x])+\frac{1}{2}u^{\prime\prime}(E[x])\text{Var}(x)italic_E [ italic_u ( italic_x ) ] ≈ italic_u ( italic_E [ italic_x ] ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) Var ( italic_x ) (9)

Consequently, under the assertion of utility function concavity or convexity, the sign of the second derivative u′′(E[x])superscript𝑢′′𝐸delimited-[]𝑥u^{\prime\prime}(E[x])italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_E [ italic_x ] ) establishes the nature of risk preference. For a negative second derivative (u′′(x)<0superscript𝑢′′𝑥0u^{\prime\prime}(x)<0italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) < 0), indicative of risk aversion, a smaller variance is required to enhance the expected utility. Conversely, for a positive second derivative (u′′(x)>0superscript𝑢′′𝑥0u^{\prime\prime}(x)>0italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) > 0), characteristic of risk-seeking behavior, a larger variance is preferred. Risk-neutral individuals (u′′(x)=0superscript𝑢′′𝑥0u^{\prime\prime}(x)=0italic_u start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) = 0) show indifference to the variance level.

Through this analytical framework, we delineate how the variance of outcomes in conjunction with the utility function’s concavity or convexity guides the determination of an individual’s risk preference.

Appendix F Framework for Data Construction

This section delineates the structured approach employed in the study to formulate datasets incorporating event news, interactive elements, and risk preference inquiries. Each category of information is meticulously crafted using a distinct template, which is elucidated below.

F.1 Event News Template

The construction of the event news dataset leverages prompt engineering techniques to embed real-world events within a framework that facilitates evaluation, simulating the analytical capabilities of financial experts. The evaluation process involves the model assigning a score to each event based on its potential positive or negative impact on the financial landscape. Initially, the model is instructed to provide an immediate, intuitive score reflecting a ’fast thinking’ approach.

To augment the depth of analysis and ensure the robustness of the evaluation, the model is further tasked with adopting a ’slow thinking’ strategy. This entails a comprehensive articulation of the rationale behind the score, encouraging a deliberate and reasoned assessment. The detailed format of this template is illustrated in Figure 8, which guides the model in delivering both the quantitative score and the qualitative reasoning underpinning it.

Refer to caption

Figure 8: Template of event news.

F.2 Interactions

Using input methods similar to news events for rewriting, the specific template is shown in Figure 9.

Refer to caption

Figure 9: Template of interactions.

F.3 Risk-Preference Questionnaire Template

The methodology for assessing risk preferences through structured questions is twofold, designed to discern the inherent risk orientation of the AI model under different conditions. Initially, the model is presented with a set of scenarios where it must select an option that best aligns with its assessed risk profile, simulating an introspective decision-making process. This setup aims to capture the model’s spontaneous risk preferences without external biases.

Subsequently, the experiment introduces a predefined constraint by explicitly characterizing the model as risk-averse within the instructions. This manipulation is intended to observe the adaptability of the model’s responses to altered risk parameters, thereby evaluating its capacity for contextual behavioral adjustment. The layout and content of these questions are encapsulated in the template depicted in Figure 10, which systematically guides the model through the decision-making process under varying risk conditions.

Refer to caption

Figure 10: Template of risk-preference questions.

Appendix G FinCausal Dataset

In this section, we introduce the construction process of the FinCausal dataset, which requires the acquisition of relevant causal knowledge from past financial text materials. This process can be mainly divided into data collection, deduplication, segmentation, and knowledge extraction. The main flowchart is shown in Figure 11.

Firstly, we crawled 500,000 research reports from the internet, ranging from 2020 to 2023, including individual stock research, industry analysis, and macro analysis. We used regular matching and the FastText language filter for classification, mainly retaining Chinese A-Share individual stock research and industry analysis. Since individual stock reports rarely describe causal relationships for negative events, we further crawled some news analyses and stock forum comments to enrich the description of individual stock causal knowledge.

After filtering the research report data, we used the MinHash algorithm to perform document-level deduplication on all content. To extract sentences with causal expressions, we meticulously categorized the content of the research reports into seven distinct types, which included ordinary sentences, causal sentences, news-related content, recommendation ratings, investment advice, risk warnings, and researcher information. We then manually annotated a comprehensive dataset comprising 3,000 pieces of data to train a sophisticated BGE+TextCNN classification model. This model was specifically designed to discern and categorize the various types of sentences present in the financial reports, with a particular focus on identifying those that convey causal relationships.

For each piece of extracted knowledge from a research report or comment, we concatenated one sentence before and after it into a paragraph. Subsequently, we aggregated all relevant paragraphs from a report to form a comprehensive context. Utilizing our meticulously chosen LLM, we conducted causal knowledge extraction from these contexts. Through this process, we successfully obtained 200,000 pieces of industry causal knowledge and 2,000 pieces of individual stock causal knowledge, thereby enriching our dataset with valuable insights into the causal relationships within the financial domain. Here are some examples of FinCausal:

  • The company may consider conducting targeted issuance in order to expand its business scale, make capital expenditures, or invest in research and development.

  • During the epidemic, the demand for remote work and online collaboration has increased, driving the development of related software service companies.

Refer to caption

Figure 11: The construction process of FinCausal dataset.

Appendix H Result

H.1 Analysis of Direct News Events

The examination of news events involves a detailed statistical analysis of the responses generated by various Large Language Models (LLMs) to specific news items. This analysis primarily focuses on the distribution of scores assigned by LLMs to each news event, encompassing key statistical measures such as the mean, variance, highest, and lowest scores. Such an approach is instrumental in assessing the consistency and rationality of LLMs’ interpretations of financial news.

To facilitate a comprehensive understanding of these scoring distributions, this section will present violin plots for each news event. Violin plots offer a more nuanced visualization compared to traditional box plots by showing the probability density of the data at different values. This graphical representation will thus provide insights into the spread and skewness of LLMs’ ratings across various news events, enabling a deeper analysis of their evaluative patterns and potential biases.

\foreach

ıin 0,2,4,…,22

Refer to caption
Figure 12: Distribution of the score for news 0ı+1 among 23 large models.
Refer to caption
Figure 13: Distribution of the score for news 0ı+2 among 23 large models.

In the analysis of the initial five events, the focus is placed on news items that encompass both positive and negative performance reports, alongside fluctuations in stock prices. This diverse array of news content allows for a multifaceted examination of each Large Language Model’s (LLM’s) scoring tendencies. Notably, discrepancies in scoring preferences among different LLMs emerge when confronted with this spectrum of financial news.

A systematic statistical analysis is conducted on the scoring outcomes attributed to the positive and negative aspects of these events. This entails a detailed examination of how each LLM assesses the same news piece, shedding light on the variance in their interpretations and the potential implications of their biases. The findings from this analysis are meticulously compiled and presented in Table 5, offering a clear, quantified insight into the LLMs’ evaluative patterns across the selected news events.

Table 5: Model Positive Times
Model Positive Times
GPT-4 5
InternLM2-20B 5
LLaMA2-13B 5
Qwen-72B 5
FinQwen 4
InternLM2-7B 4
LLaMA2-7B 4
Qwen-max 4
Xuanyuan-13B 4
Baichuan2-13B 3
Baichuan2-7B 3
ChatGLM3-Turbo 3
GLM-4 3
Xuanyuan-70B 3
ChatGLM2-6B 2
ChatGLM3-6B 2
Qwen-14B 2
Qwen-7B 2
GPT-3.5 1

Our analysis involves aggregating the scoring variances observed across all event news for the various Large Language Models (LLMs) under consideration. This comprehensive synthesis not only highlights the diversity in LLM responses but also provides a macroscopic view of their evaluative consistency and potential discrepancies. The aggregated data, which encapsulate the variance in scoring for each news event by different LLMs, are systematically presented in Table 6. This table serves as a pivotal reference point for understanding the range and distribution of LLM evaluations, offering valuable insights into their interpretative frameworks and the reliability of their analyses.

Table 6: Model Variance Comparison
Model Variance
GLM-4 0.59798884
ChatGLM3-6B 0.707638507
Qwen-72B 0.784471341
Qwen-7B 0.788077699
ChatGLM3-Turbo 1.067120654
Qwen-14B 1.211226324
Qwen-max 1.363195393
MiniCPM-2B 1.409
GPT-4 1.909388332
InternLM2-20B 4.893324616
GPT-3.5 5.277003518
Baichuan-13B 5.998
DISC-FinLLM 6.096
Baichuan2-13B 6.157081681
LLaMA2-13B 6.881628177
InternLM2-7B 7.466014898
FinQwen 9.363445905
ChatGLM2-6B 10.03005785
LLaMA2-7B 10.61274671
Xuanyuan-70B 10.83743438
Xuanyuan2-6B 13.9988
Xuanyuan-13B 19.18007393
Baichuan2-7B 28.10579705

Upon examining the inherent biases within individual models, our analysis proceeds to consolidate the findings from each Large Language Model (LLM) to explore their collective or differential biases towards various industries. This step is crucial for understanding not only the predispositions of individual models but also for discerning any overarching trends or anomalies in their assessments of industry-related news events. By aggregating these results, we aim to delineate the extent to which these models exhibit preferential or adverse biases towards certain industry, thereby shedding light on the potential influence of these biases on the models’ analytical outputs and reliability. The synthesis of this comprehensive analysis provides a nuanced understanding of model behavior in the context of industry-specific evaluations.

Refer to caption
Figure 14: Distribution of the industry scores of Baichuan2-7B.
Refer to caption
Figure 15: Distribution of the industry scores of Baichuan2-13B.
Refer to caption
Figure 16: Distribution of the industry scores of ChatGLM2-6B.
Refer to caption
Figure 17: Distribution of the industry scores of ChatGLM3-6B.
Refer to caption
Figure 18: Distribution of the industry scores of ChatGLM3-Turbo.
Refer to caption
Figure 19: Distribution of the industry scores of GLM-4.
Refer to caption
Figure 20: Distribution of the industry scores of InternLM2-7B.
Refer to caption
Figure 21: Distribution of the industry scores of InternLM2-20B.
Refer to caption
Figure 22: Distribution of the industry scores of LLaMA2-7B.
Refer to caption
Figure 23: Distribution of the industry scores of LLaMA2-13B.
Refer to caption
Figure 24: Distribution of the industry scores of Qwen-7B.
Refer to caption
Figure 25: Distribution of the industry scores of Qwen-14B.
Refer to caption
Figure 26: Distribution of the industry scores of FinQwen.
Refer to caption
Figure 27: Distribution of the industry scores of Qwen-72B.
Refer to caption
Figure 28: Distribution of the industry scores of Qwen-max.
Refer to caption
Figure 29: Distribution of the industry scores of Xuanyuan-13B.
Refer to caption
Figure 30: Distribution of the industry scores of Xuanyuan-70B.
Refer to caption
Figure 31: Distribution of the industry scores of GPT-3.5.
Refer to caption
Figure 32: Distribution of the industry scores of GPT-4.

In parallel with the examination of model biases, our study also delves into the temporal evolution of Large Language Models within distinct family, attributing changes to factors such as model size or software updates. To this end, we employ box line comparison charts as a visual tool to elucidate the developmental trajectories of models within the Baichuan, GLM, and LLaMA family. These charts serve to highlight variations in model performance or bias over time, providing a clear visual representation of progression or shifts in model behavior. The comparative analyses for the Baichuan family, GLM family, and LLaMA family are depicted in Figure 33, Figure 34, and Figure 35, respectively. Through these visual comparisons, we aim to offer insights into how advancements or modifications in model architecture and capabilities influence their analytical outcomes and biases.

Refer to caption

Figure 33: Box line comparison charts of Baichuan family.

Refer to caption

Figure 34: Box line comparison charts of GLM family.

Refer to caption

Figure 35: Box line comparison charts of LLaMA family.

H.2 Analysis of COT News

To delve deeper into the underlying factors contributing to potential irrationalities in Large Language Models (LLMs), our investigation extends to the analysis of reasoning outcomes facilitated by cognitive connections. This approach is predicated on the hypothesis that the manner in which LLMs forge and utilize cognitive links during the reasoning process may shed light on their logical inconsistencies or biases. For this purpose, we have meticulously selected LLMs that have demonstrated the highest, second highest, and lowest levels of performance in response to direct prompts. This selection criterion ensures a comprehensive overview, encompassing a broad spectrum of reasoning capabilities within LLMs.

The focus of this analysis is on the ’slow thinking’ aspect of model reasoning, where deliberate and methodical processing is emphasized. By examining the variance in reasoning outcomes among these models, we aim to identify patterns or anomalies that might indicate a propensity for irrational decision-making. The results of this analysis, highlighting the variance in cognitive reasoning among the selected LLMs, are systematically presented in Table 7. Through this examination, we seek to uncover the intricacies of cognitive processing in LLMs and their implications for model reliability and rationality.

Table 7: Model Variance Comparison after COT
Model Direct COT
GLM-4 0.597988840 5.381799977
Qwen-7B 0.788077699 7.685484073
ChatGLM3-Turbo 1.067120654 5.670563704
MiniCPM-2B 1.409476674 7.038961848
Xuanyuan2-6B 13.99883011 6.668694967
Xuanyuan-13B 19.18007393 17.83545943
Baichuan2-7B 28.10579705 12.65975644

At the same time, we further analyzed the new5 with significant differences in ratings among different LLMs, and used the keyword detection method Bertopic to cluster and analyze the reasoning results of the models. Before clustering, the model scores and specific information of the company were removed from the reasoning results. The inference decibel of each model is clustered into 10 categories, and the distribution of scores for each category is as follows.

Refer to caption
Figure 36: Distribution of cluster scores of Baichuan2-7B.
Refer to caption
Figure 37: Distribution of cluster scores of GLM-4.

We further analyze the reasoning texts of the best performing GLM-4 and the worst performing Baichuan2-7B, clustering them into 10 groups with 10 keywords in each group. The key vocabulary of the two models will be summarized and a word cloud will be drawn. The results are shown in Figure 39 and Figure 39.

Refer to caption
Figure 38: The wordcloud of GLM-4
Refer to caption
Figure 39: The wordcloud of Baichuan2-7B.

H.3 Analysis of Interactions

We process and analyze the information related to the interaction between the company and shareholders in a similar way.

\foreach

ıin 0,2,4,…,8

Refer to caption
Figure 40: Distribution of the score in interaction0ı+1.
Refer to caption
Figure 41: Distribution of the score in interaction0ı+2.

H.4 Analysis of Risk-preference Questions

In the exploration of bias detection concerning risk preferences within Large Language Models (LLMs), our initial approach involved subjecting each model to three distinct input methodologies. The outcomes of these initial tests, aimed at gauging the models’ inherent risk preferences, are meticulously documented in Table 11. This foundational analysis sets the stage for more nuanced investigations into model behaviors under specific conditions.

Subsequently, our focus shifted to the models’ adherence to explicit instructions regarding risk aversion. By inputting the directive "You are a risk averse person," we were able to quantify each model’s compliance through the risk averse ratio, the details of which are encapsulated in LABEL:tab:instruct_risk. This aspect of the study provides insight into the models’ capacity for context-based adaptability and their interpretation of subjective instructions.

Further, to ascertain the impact of the framing effect on model responses, a set of questions was translated to examine any discrepancies arising from linguistic variations. The findings from this segment of the study, highlighting the influence of translation on model outputs, are presented in LABEL:tab:translation_difference. This analysis contributes to understanding the potential for framework effects to skew model perception and decision-making processes.

Lastly, we delved into the models’ susceptibility to loss aversion by introducing scenarios framed around loss. The extent of loss aversion bias manifesting in the models’ responses was rigorously analyzed, with the summarized results being showcased in LABEL:tab:Loss_Aversion_Bias. Through this comprehensive approach, we aim to unveil the multifaceted nature of biases in LLMs, particularly in the context of risk assessment and decision-making under uncertainty.

Table 8: Model Instruct Risk-aversion Performance Comparison
Model Risk-aversion (%)
GPT-4 89.5
Qwen-max 83.5
GLM-4 88.0
Qwen-72B 79.0
ChatGLM3-Turbo 62.5
Xuanyuan-70B 59.5
Qwen-14B 66.0
InternLM2-7B 42.5
Baichuan2-13B 44.0
FinQwen 45.0
Xuanyuan-13B 43.0
ChatGLM3-6B 53.0
InternLM2-20B 53.0
Qwen-7B 52.0
Baichuan2-7B 38.0
GPT-3.5 37.5
ChatGLM2-6B 35.0
Table 9: Models Translation Prompt Differences Comparison
Model Difference (%)
GPT-4 23.5
ChatGLM3-6B 25.0
Qwen-max 28.5
GLM-4 28.0
Xuanyuan-70B 36.0
GPT-3_5 33.0
Qwen-7B 42.0
InternLM2-7B 43.5
ChatGLM3-Turbo 46.5
FinQwen 49.0
ChatGLM2-6B 51.0
Xuanyuan-13B 51.5
Baichuan2-13B 54.5
InternLM2-20B 48.5
Qwen-72B 56.0
Baichuan2-7B 59.0
Qwen-14B 65.0
Table 10: Model Loss Aversion Bias Comparison
Model Risk-aversion (%)
Qwen-14B 51.0
GPT-3_5 52.0
FinQwen 57.5
Baichuan2-7B 58.5
InternLM2-7B 60.5
Qwen-max 62.0
Baichuan2-13B 63.0
InternLM2-20B 63.0
ChatGLM2-6B 61.5
Qwen-72B 65.5
Xuanyuan-13B 66.5
GLM-4 69.0
ChatGLM3-Turbo 74.0
Xuanyuan-70B 74.5
Qwen-7B 72.5
ChatGLM3-6B 75.0
GPT-4 84.0
Table 11: LLMs risk preference statistics
Model Method Risk Averter Risk Neutral Risk Lover
Baichuan2-7B Direct 67 35 98
Baichuan2-7B Instruct 76 24 100
Baichuan2-7B Translation 67 58 75
Qwen-72B Direct 81 79 40
Qwen-72B Instruct 160 32 8
Qwen-72B Translation 85 45 70
Qwen-14B Direct 52 46 102
Qwen-14B Instruct 134 27 39
Qwen-14B Translation 112 15 73
GLM-4 Direct 88 32 80
GLM-4 Instruct 178 12 10
GLM-4 Translation 92 39 69
ChatGLM2-6B Direct 73 13 114
ChatGLM2-6B Instruct 70 17 113
ChatGLM2-6B Translation 101 23 76
ChatGLM3-6B Direct 100 18 82
ChatGLM3-6B Instruct 107 21 72
ChatGLM3-6B Translation 99 28 73
Xuanyuan-70B Direct 99 24 77
Xuanyuan-70B Instruct 120 24 56
Xuanyuan-70B Translation 90 20 90
InternLM2-7B Direct 71 70 59
InternLM2-7B Instruct 86 69 45
InternLM2-7B Translation 81 63 56
Baichuan2-13B Direct 76 16 108
Baichuan2-13B Instruct 89 32 79
Baichuan2-13B Translation 61 39 100
Qwen-7B Direct 95 27 78
Qwen-7B Instruct 105 28 67
Qwen-7B Translation 106 24 70
InternLM2-20B Direct 76 76 48
InternLM2-20B Instruct 108 63 29
InternLM2-20B Translation 73 78 49
ChatGLM3-Turbo Direct 98 45 57
ChatGLM3-Turbo Instruct 127 48 25
ChatGLM3-Turbo Translation 62 85 53
GPT-3.5 Direct 54 35 111
GPT-3.5 Instruct 75 24 101
GPT-3.5 Translation 56 27 117
FinQwen Direct 65 42 93
FinQwen Instruct 91 34 75
FinQwen Translation 101 46 53
GPT-4 Direct 118 41 41
GPT-4 Instruct 181 11 8
GPT-4 Translation 128 39 33
Xuanyuan-13B Direct 83 26 91
Xuanyuan-13B Instruct 86 24 90
Xuanyuan-13B Translation 106 13 81
Qwen-max Direct 74 89 37
Qwen-max Instruct 169 26 5
Qwen-max Translation 99 62 39