Are LLMs Rational Investors?
A Study on Detecting and Reducing the Financial Bias in LLMs

Yuhang Zhou^1,2 Yuchen Ni³ Yunhui Gan^1,2 Zhangyue Yin¹
Xiang Liu⁴ Jian Zhang⁵ Sen Liu^1,2 Xipeng Qiu¹ Guangnan Ye^1,2 Hongfeng Chai^1,2

¹School of Computer Science, Fudan University
²Institute of Fintech, Fudan University
³School of Electronics and Information Engineering, Tongji University
⁴Tandon School of Engineering, New York University
⁵DataGrand Inc Email: yuhangzhou22@m.fudan.edu.cnCorresponding Author. Email: yegn@fudan.edu.cn

Abstract

Large Language Models (LLMs) are increasingly adopted in financial analysis for interpreting complex market data and trends. However, their use is challenged by intrinsic biases (e.g., risk-preference bias) and a superficial understanding of market intricacies, necessitating a thorough assessment of their financial insight. To address these issues, we introduce Financial Bias Indicators (FBI), a framework with components like Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote to identify, detect, analyze, and eliminate irrational biases in LLMs. By combining behavioral finance principles with bias examination, we evaluate 23 leading LLMs and propose a de-biasing method based on financial causal knowledge. Results show varying degrees of financial irrationality among models, influenced by their design and training. Models trained specifically on financial datasets may exhibit more irrationality, and even larger financial language models (FinLLMs) can show more bias than smaller, general models. We utilize four prompt-based methods incorporating causal debiasing, effectively reducing financial biases in these models. This work enhances the understanding of LLMs’ bias in financial applications, laying the foundation for developing more reliable and rational financial analysis tools.

\useunder

\ul

Are LLMs Rational Investors?
A Study on Detecting and Reducing the Financial Bias in LLMs

Yuhang Zhou^†^†thanks: Email: yuhangzhou22@m.fudan.edu.cn^1,2 Yuchen Ni³ Yunhui Gan^1,2 Zhangyue Yin¹ Xiang Liu⁴ Jian Zhang⁵ Sen Liu^1,2 Xipeng Qiu¹ Guangnan Ye^†^†thanks: Corresponding Author. Email: yegn@fudan.edu.cn^1,2 Hongfeng Chai^1,2 ¹School of Computer Science, Fudan University ²Institute of Fintech, Fudan University ³School of Electronics and Information Engineering, Tongji University ⁴Tandon School of Engineering, New York University ⁵DataGrand Inc.

1 Introduction

Recent advancements in LLMs, such as GPT-4 OpenAI (2023) and LLaMA-2 Touvron et al. (2023), have shown their prowess across a spectrum of natural language processing (NLP) tasks Zhao et al. (2023), extending into specialized domains such as finance Zhang et al. (2023), law Cui et al. (2023), and healthcare Wang et al. (2023). Despite their versatility, these models grapple with inherent biases Gallegos et al. (2023), Rutinowski et al. (2023), encompassing gender, race, and socioeconomic disparities, which could compromise their reliability and entail significant consequences Jeoung et al. (2023). Efforts to mitigate such biases have led to the development of benchmark datasets such as StereoSet Nadeem et al. (2020) for stereotype identification, GenderCare Tang et al. for gender bias detection, OpinionGPT Haller et al. (2023) for generating bias-neutral content, employing techniques such as LoRA Hu et al. (2021) for model fine-tuning, especially within the social sciences. As we delve into the application of these advanced models in specific domains, it becomes imperative to address these biases to ensure fair and unbiased outcomes.

Refer to caption — Figure 1: An example of model irrationality, where the model gives inconsistent expectations for the same event from different subjects, resulting in different emotions, reflect the financial bias towards the company.

However, in the realm of financial LLMs (FinLLMs), research has predominantly concentrated on enhancing model performance through continued pre-training or fine-tuning, with evaluation metrics focused on NLP tasks and financial applications. Moreover, LLMs have been employed in financial analytics, acting as advisors by leveraging news Lopez-Lira and Tang (2023) and fundamental analyses Fatouros et al. (2024) for investment decisions across diverse securities Romanko et al. (2023). However, the efficacy of these models is contingent upon the models’ own rationality as participants in the market. A lack of rationality in LLMs could lead to misinterpretations and misapplications of market dynamics, adversely impacting not only the users of these models but also the broader economy. Figure 1 is an example of model irrationality.

Hence, it is essential to assess the rationality of LLMs before incorporating them into financial advisory roles. Limited research has addressed the financial biases present in pre-trained models, exploring methods such as probabilistic detection Chuang and Yang (2022) and consistency checks Yang et al. (2023). However, these methods mainly focus on pre-trained embedding models like BERTDevlin et al. (2019), limiting their application to LLM bias detection. The absence of evaluation standards for financial biases hinders thorough oversight, compromising assessment realism and objectivity. Consequently, there is an immediate need for a holistic framework to gauge LLMs rationality in finance.

Meanwhile, the detection of financial rationality and biases in LLMs faces four main challenges:

•

Q1: How to define the financial rationality and biases of LLMs? A theoretical framework is required to support the detection of LLMs.
•

Q2: How to detect and reveal financial biases in LLMs? A method needs to be developed to quantify theoretical indicators and construct relevant datasets.
•

Q3: How to investigate the origins of financial biases in LLMs? It is necessary to research whether these financial biases stem from the model’s capabilities or its robustness.
•

Q4: How to mitigate financial biases in LLMs? Methods need to be found that mitigate biases without compromising the original capabilities of the model.

To work towards this goal, our study conducts an examination of the financial rationality of LLMs based on the theory of behavioral finance Barberis and Thaler (2003). We believe that the enduring theory of behavioral finance, grounded in psychology and finance, can provide a more comprehensive perspective to support research findings. In the current scenario where it is challenging to quantify behavioral finance, we propose the Financial Bias Indicators (FBI) framework to comprehensively assess financial rationality in LLMs. The FBI framework consists of four components: Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote, covering the definition, detection, cause analysis, and mitigation of financial biases.

Our research shows that almost all LLMs exhibit financial irrationality. These biases, which may be exacerbated by continuous pre-training or fine-tuning with financial data, could lead to market anomalies in real-world applications. While prompt-based mitigation methods show promise, the persistent biases in LLMs highlight the necessity for further research to improve model robustness, fairness, and rationality, ensuring financial market stability and asset protection.

Our key contributions to this field are summarized as follows:

•

The FBI framework, based on behavioral finance, defines, detects, analyzes, and mitigates financial biases in LLMs, offering a novel approach to evaluate their financial rationality. It is the first study to quantify behavioral finance indicators in LLMs, paving the way for more reliable LLMs.
•

The framework extensively analyzes 23 leading LLMs, assessing how model parameters, training data, and input formats affect financial rationality. Our study deepens the understanding of the varying levels of financial irrationality among models and their behavior in financial contexts.
•

Utilizing the FBI framework, we explore the origins of financial irrationality in LLMs, identify methods to mitigate bias, and develop a dataset of 200,000 financial causal texts named FinCausal to address biases with causal knowledge. Ultimately, we experimented with four prompt-based methods for bias mitigation, yielding encouraging results.

2 Background

Behavioral finance explores how psychological factors and cognitive biases influence the financial behaviors of individuals, institutions, and markets, differing from traditional finance theories that assume rational market participants. The field aims to uncover the psychological roots of various market phenomena, dissecting financial decision-making processes to build a realistic framework of market dynamics that includes cognitive errors and the constraints on arbitrage. We structure our investigation using the classification from Barberis and Thaler (2003), which divides behavioral finance into Cognitive Bias and Limits to Arbitrage.

2.1 Cognitive Bias

Cognitive biases represent systematic departures from normative decision-making, influencing how investors form beliefs and assess risks. These biases are multifaceted, manifesting as erroneous beliefs or inconsistent risk preference. For instance, belief biases, such as attentional neglect, representativeness bias, anchoring effect, and overconfidence,skew investors’ expectations, while risk-preference biases, such as loss aversion and reference dependence, lead to irregularities in risk assessment and decision-making under uncertainty. A comprehensive taxonomy of these biases, alongside their definitions, is detailed in Appendix C.

2.2 Limits to Arbitrage

Contrary to the Efficient Markets Hypothesis (EMH), which posits that asset prices fully reflect all available information and that market participants behave rationally, behavioral finance identifies scenarios where irrationality prevails, leading to anomalies like market bubbles and systemic crises. These phenomena are often attributed to the collective impact of cognitive biases on investor expectations, which can cause significant deviations from asset fundamentals.

3 FBI: A Framework for Assessing LLMs Financial Rationality

We propose the FBI framework illustrated in Figure 2. This framework is divided into four parts: Bias Unveiler defines financial biases in LLMs based on behavioral finance; Bias Detective constructs detection data and evaluates current leading LLMs for biases; Bias Tracker analyzes the causes of biases based on detection results and attention mechanisms; Bias Antidote builds a financial causal dataset and employs a series of prompt-based methods to mitigate bias phenomena.

4 Bias Unveiler

Based on the definitions from behavioral finance, we categorize biases in LLMs within financial contexts into Belief Bias and Risk-preference Bias. Within these categories, we define six related psychological biases.

4.1 Belief Bias

In today’s information-rich environment, constant updates require adjusting our predictions and beliefs. This framework investigates three cognitive biases—Anchoring, Representativeness, and Overconfidence—using real-world market data like news and shareholder discussions to test LLMs’ ability to maintain rationality in market conditions.

Anchoring Effects

We test LLMs for Anchoring Effects by checking if they show different views on the same event or give consistent responses under different company settings. This bias, derived from past data, can introduce sentiment analysis biases and potentially disrupt markets when used in finance.

Representativeness Bias

We investigate Representativeness Bias in LLMs by analyzing their outputs in relation to company size and sector. This bias towards size and industry can concentrate investment risks and cause problems.

Overconfidence

To measure overconfidence, we track score fluctuations for the same events with different subjects in FinLLMs and corresponding base LLMs. Aggressive scores with high deviation suggest overconfidence in these models’ event assessments or responses.

4.2 Risk-preference Bias

Asset returns are uncertain, affecting investor behavior based on risk-return preferences. Our study of Risk-preference Bias explains Situational Dependence Bias, loss aversion, and framing effect in various decision contexts, assessing LLMs’ risk preferences in different scenarios.

Situational Dependence Bias

Recognizing decision-making as a process shaped by previous experiences and contextual factors, we delve into the Situational Dependence Bias by examining if LLMs exhibit variable risk preferences across different scenarios.

Loss Aversion

In the context of Loss Aversion, we scrutinize LLM responses within loss-framed scenarios, aiming to uncover any predominant risk-averse or risk-loving tendencies.

Frame Effect

We investigate the Framing Effect by restating scenarios in various languages or expressions to track changes in LLM preferences, aiming to determine if linguistic framing influences LLM outputs, indicating bias in option presentation.

5 Bias Detective

5.1 Belief Bias

5.1.1 Data Desgin

To comprehensively assess the rationality of LLMs in financial markets, we scrutinized their responses to emergent information, distinguishing between event news and interactions, amid the prevalent noise in online investor dialogues in China. We analyzed historical events impacting company stock prices and adopted a refined classification of financial events into four primary categories: Corporate Governance and Equity Changes (CGEC), Financial Reports and Earnings Expectations (FREE), Market Behavior and Announcements (MBA), and Negative Events and Risk Management (NERM), detailed in Appendix C.

Throughout 2023, we compiled 300 news articles, including a subset of 24 emotionally nuanced pieces $N^{\prime}=\{n_{1},n_{2},\ldots,n_{24}\}$ to enhance the bias detection in LLMs. This subset, detailed in Appendix C, contained articles categorized into nine positive, nine negative, and six mixed emotions.

Additionally, we assembled 10 neutral interaction pairs $I=\{(q_{1},r_{1}),(q_{2},r_{2}),\ldots,(q_{10},r_{10})\}$ to analyze LLM comprehension in a controlled environment.

For each $n\in N^{\prime}$ and $(q,r)\in I$ , numerical details were abstracted to proportional figures to standardize data across varying company market caps, as documented in Appendix F.

We sampled 600 companies from the Chinese A-share market, excluding delisting entities, distributed across three tiers of market capitalization: top, middle, and bottom, each containing 200 companies. This selection method aimed to test Belief Bias in LLMs, with classifications by industry outlined in Appendix B.

5.1.2 Analysis Logic

To investigate Anchoring Effects, we altered the subject company in each news item $n\in N^{\prime}$ and assessed variability in LLM evaluations using Analysis of Variance (ANOVA) St et al. (1989), expressed as $F(n,C)$ , examining the variance in scores across companies $c_{i}\in C$ for each news $n$ . Representativeness Bias was analyzed by correlating LLM outputs with company size and industry sector, using Spearman correlation coefficient $\rho(S,M)$ , where $S$ and $M$ represent LLM scores and market capitalizations of companies $c_{i}$ , respectively. ANOVA was employed for industry correlation $F(S,I)$ , with box plots to display score distributions across industries $I$ , reflecting first-level industry classifications.To measure Overconfidence, we used the standard deviation $\sigma(S)$ of scores $S$ , representing the variance in LLM evaluations of the same event across different company contexts, comparing results from FinLLMs with base LLMs.

5.1.3 Result

The evaluation of Belief Bias is principally conducted through the examination of event news and interactions. This analysis reveals a widespread Anchoring Effect across the majority of LLMs when the subjects of events and interactions are modified, with slight variations observed among different models. The average variance index, detailed in Figure 3, sheds light on the rationality levels of LLMs with respect to representativeness bias. Specifically, LLMs with a focus on the Chinese language, such as the GLM and Qwen family, exhibit commendable financial rationality, whereas the Xuanyuan and Baichuan family are more susceptible to irrational behavior.

In terms of Overconfidence, violin plots, presented in Appendix H.1, illustrate the score distributions of various texts across all models. A notable disparity is observed in the models’ responses to composite texts of positive and negative emotional content. As per Table 5, the GPT and InternLM models display a marked optimism, in contrast to the pronounced pessimism of the Qwen and GLM family. Furthermore, Figure 4 highlights that models trained on financial corpora experience a heightened score variability compared to their base counterparts.

In terms of Representativeness Bias, all LLMs exhibited a correlation coefficient below 10% between model scores and market capitalization, indicating a weak correlation. However, certain LLMs showed clear biases towards specific industries, as documented in Appendix H.1. For example, the FinQwen model consistently allocated lower scores to the Steel and Banking sectors. A comprehensive analysis reveals that the Media, Steel, Banking, and Non-Banking Finance sectors frequently occupy the extreme ends of the scoring spectrum across different models, whereas the Computer Science and Automobile sectors generally maintain a middle ground, exhibiting relative stability.

5.2 Risk-preference Bias

5.2.1 Data Design

To simulate real-life decision-making, we designed 40 scenarios with 200 multiple-choice questions, divided into 200 gain-framed and 100 loss-framed scenarios. Each question $Q_{i}$ presents three decision alternatives $A_{i,j}$ , where $j$ represents risk preferences: Risk-loving, Risk-neutral, Risk-averse. The alternatives are randomized to minimize bias Zheng et al. (2023).

The alternatives are constructed based on expected utility theory Simon et al. (1994), represented by:

E[u(x)]=\sum_{x(\omega)}u(x(\omega))p(x(\omega)),

(1)

where $u(x)$ is the utility function, $x(\omega)$ the outcome, and $p(x(\omega))$ the outcome’s probability.

The concavity of $u(x)$ , reflecting risk preferences, is indirectly assessed via second-order Taylor expansion:

E[u(x)]\approx u(E[x])+\frac{1}{2}u^{\prime\prime}(E[x])\text{Var}(x),

(2)

This approximation adjusts risk preferences by modulating variance, with technical details in Appendix E.

5.2.2 Analysis Logic

We investigated the Situational Dependence Bias in LLMs by analyzing their risk preferences across different scenarios $S_{i}$ . We examined choices in gain-framed scenarios $S_{i_{,}\text{gain}}$ to detect any situational bias in risk preferences. In the context of Loss Aversion, we scrutinized LLM responses in loss-framed scenarios $S_{i_{,}\text{loss}}$ to assess tendencies towards risk-aversion or risk-loving behaviors. To explore the Framing Effect, we translated scenarios from Chinese to English and monitored shifts in LLM preferences $P_{\text{LLM}}$ , observing how linguistic framing affects decisions. This approach aims to reveal if LLM outputs are biased by the linguistic construction of scenarios and choices.

5.2.3 Result

The exploration of Risk-preference Bias entails the examination of LLM decisions across varied scenarios. The compiled results, particularly for gain-framed queries, are tabulated in Table 11. A predominant trend among most models is the exhibition of distinct risk preferences in disparate scenarios, indicative of a pronounced Situational Dependence Bias. Nonetheless, prefacing prompts with an instruction of the model’s risk-averse nature significantly attenuates this bias. In the context of loss-framed queries, some models exhibit a pronounced Loss Aversion Bias like GPT-4, as shown in LABEL:tab:Loss_Aversion_Bias. Moreover, refer to LABEL:tab:translation_difference, the translation of all queries into English precipitated notable discrepancies between the models’ responses to Chinese and English versions, underscoring a pronounced Framing Effect.

In particular, we have selected several representative cases for analysis, as shown in Figure 5. For Xuanyuan-13B, inducing Risk-Aversion does not alter its original preference distribution, but it exhibits a stronger Framing Effect. For GLM4, it performs well in terms of Framing Effect bias and can effectively switch preferences based on instructions. For Qwen-14B, it is capable of some preference distribution shift according to instructions, but also exhibits a significant Framing Effect.

6 Bias Tracker

After detecting Belief Bias and Risk-Preference Bias in LLMs, it was evident that Belief Bias had a more significant impact compared to Risk-Preference Bias. Risk-Preference Bias manifests primarily as the model’s inherent decision-making tendency, whereas Belief Bias results from the model’s misinterpretation of information, leading us to focus on analyzing Belief Bias formation.

To further investigate, we utilized Chain-of-Thought (COT) methods Wei et al. (2022) to engage the model’s System 2 thinking, aiming to enhance its in-depth analysis capabilities. We also scrutinized the model’s output, particularly the attention importance distribution across input tokens, enabling precise diagnosis of bias causes and informing subsequent bias correction strategies.

6.1 Slow Thinking

6.1.1 Methodology

To investigate the roots of score instability, whether due to inadequate reasoning or compromised rationality, we first employ a slow-thinking approach similar to CoT, prompting the model to generate reasoning before providing scores. This approach helps us study the causes of Belief Bias by enabling the model to produce reasoning for its evaluations, followed by the actual scoring. We utilize Bertopic Grootendorst (2022) for topic word extraction within the reasoning texts provided by the models. By clustering reasoning texts and extracting keywords, we analyze score discrepancies across different clusters, denoted by $\Delta S_{\text{clusters}}$ , to identify if certain thematic focuses lead to inconsistent evaluations. This detailed analysis helps us pinpoint specific causes of bias, thus informing future model optimizations and bias corrections.

6.1.2 Result

The application of the COT methodology, aimed at exploring the causes of financial bias in the model, yielded results as shown in H.2. The clustering and keyword analysis indicate that this intentional cognitive strategy improves the logical consistency and financial acumen of the model outputs. Integrating the provided thoughts into the previous sentence, the English expression would be: A comparison of the reasoning keywords for the top-performing GLM-4 model and the underperforming Baichuan2-7B model, illustrated in Figure 39 and Figure 39 respectively, reveals that the GLM-4 exhibits stronger logical coherence with primary keywords such as "termination", "transaction restructuring" and "withdrawal". In contrast, the Baichuan2-7B model’s logic is weaker, with primary keywords including "change", "decision", and "information". Moreover, the clustering of the models’ inferential outputs reveals substantial variations in the ratings assigned to each category. This divergence underscores that the irrationality exhibited by models in financial contexts stems more from their inherent cognitive processes than from their computational capabilities.

6.2 Attention Importance

6.2.1 Methodology

To verify whether the bias in the LLMs arises from an excessive focus on certain input tokens due to the training corpus, we examine the attention importance of LLMs for input sequences. Inspired by Wu et al. (2023), we define the importance $I_{n,m}$ of input token $x_{n}$ to output token $y_{m}$ as:

I_{n,m}=p(y_{m}|Z_{m})-p(y_{m}|Z_{m,/n})

(3)

where $Z_{m}$ is the context to generate $y_{m}$ by concatenating the prompt $X$ and the first $m-1$ tokens of response $Y$ . $Z_{m,/n}$ omits token $x_{n}$ from $Z_{m}$ , and $p(\cdot|\cdot)$ is the conditional probability computed by the language model $f$ . We accelerate it with the first-order approximation:

I_{n,m}\approx\frac{\partial f(y_{m}|Z_{m})}{\partial E_{i}[x_{n}]}\cdot E_{i}% [x_{n}]^{\top}

(4)

where $E_{i}[x_{n}]$ is the input word embedding of token $x_{n}$ . This approach helps us determine whether the LLM excessively focuses on specific financial entities, thereby more accurately diagnosing the specific causes of bias.

6.2.2 Result

Due to the limitations of Chinese tokens, we chose to use LLMs with BPE tokenizers, focusing on the well-performing MiniCPM-2B and the more biased Baichuan2-7B models, results are shown in Figure 6. We analyzed the attention each model pays to each input token in their outputs and found that Baichuan2-7B tends to focus more on financial company entities, industries, and their surrounding tokens. This excessive attention to irrelevant information further exacerbates the generation of financial biases.

7 Bias Antidote

7.1 Methodology

To eliminate Belief Bias in LLMs while preserving their original general capabilities, we employed four prompt-based methods for bias mitigation. 1) we utilized a CoT approach to enable the model to engage in slow thinking, thereby producing scores based on logical reasoning. 2) We implemented the S2A method Weston and Sukhbaatar (2023) to shield the model from irrelevant context, allowing for secondary reasoning before scoring. 3) We used Entity Replace to stabilize the model’s input. 4) We applied Fincausal relationship understanding based on knowledge to remove biases. For the last method, we extracted 200,000 pieces of financial causal knowledge about industries and individual stocks from past research reports and used a retrieval-augmented generation (RAG) approach to recall relevant causal information.

Table 1: The unbiased results of four prompt-based methods, with smaller values indicating lower levels of bias.

Method	GLM-4	ChatGLM3-Turbo	MiniCPM-2B	Xuanyuan2-6B	Baichuan2-7B
Direct	0.598	1.067	1.409	13.999	28.106
COT	5.382^↑ (+\fpevalround(5.382-0.598,3))	5.671^↑ (+\fpevalround(5.671-1.067,3))	7.039^↑ (+\fpevalround(7.039-1.409,3))	6.668^↓ (-\fpevalround(13.999-6.668,3))	12.660^↓ (-\fpevalround(28.106-12.660,3))
S2A	3.230^↑ (+\fpevalround(3.230-0.598,3))	4.406^↑ (+\fpevalround(4.406-1.067,3))	8.380^↑ (+\fpevalround(8.380-1.409,3))	9.756^↓ (-\fpevalround(13.999-9.756,3))	25.958^↓ (-\fpevalround(28.106-25.958,3))
Entity Replace	0.710^↑ (+\fpevalround(0.710-0.598,3))	1.012^↓ (-\fpevalround(1.067-1.012,3))	1.385^↓ (-\fpevalround(1.409-1.385,3))	1.236^↓ (-\fpevalround(13.999-1.236,3))	12.688^↓ (-\fpevalround(28.106-12.688,3))
FinCausal	0.769^↑ (+\fpevalround(0.769-0.598,3))	2.200^↑ (+\fpevalround(2.200-1.067,3))	1.195^↓ (-\fpevalround(1.409-1.195,3))	10.111^↓ (-\fpevalround(13.999-10.111,3))	8.763^↓ (-\fpevalround(28.106-8.763,3))

7.2 Result

For Belief Bias, we employed four prompt-based elimination methods and selected five representative LLMs for bias elimination experiments, as shown in Table 1. For the CoT method, it performed well on the originally more biased models, as this reasoning approach enhances logical consistency in responses, thereby improving robustness and reducing bias. However, it performed poorly on models that were originally well-performing, as the increased output length due to the auto-regressive nature of these models resulted in amplified bias. For the S2A method, the effectiveness of increasing model output to reduce irrelevant attention depends on the model’s original capability; weaker models tend to exhibit greater bias. The Entity Replace method showed superior performance due to the substitution of financial topics. For the FinCausal method, each test data recalled four related causal knowledge entries, further enhancing the model’s reasoning ability to mitigate bias.

8 Discussion

Our study using the FBI framework aids in identifying and reducing irrationalities in finance sector models, enhancing understanding of LLM biases and logic, and applying methods to mitigate financial biases.

8.1 Model Size

The analysis in Section 5.1.3 reveals that within a specific model family, the degree of bias tends to decrease with the enlargement of model parameters, in line with the scaling law Kaplan et al. (2020). However, this trend is not consistent across different model family, where bias levels are also shaped by factors such as model design and training approaches.

8.2 Training Data

The FBI framework assessed financial rationality in general and financial LLMs. Financially-trained models, as seen in Figure 4, may exhibit higher score variability and risk inclination, potentially increasing financial irrationality. Models like ChatGLM2-6B and Qwen-7B show opposite industry biases (Figure 15-Figure 32), suggesting temporal biases in training data align with industry cycle rotation. Financial data, including potentially embellished research reports, can enhance LLMs for finance-specific NLP tasks but may also embed financial irrationality. Using these models in financial quantification could lead to unintended disruptions, demanding immediate attention.

8.3 Input Forms

The FBI framework utilized four prompt-based methods to eliminate biases, each of which produced certain effects, with the elimination being more pronounced when the model’s initial bias was more severe. As observed in Table 1, methods that increase output length, such as COT and S2A, can potentially exacerbate biases. The Entity Replace method, which reduces input, often yields better results, and the FinCausal method, which increases input, can further enhance the model’s ability for financial causal reasoning, achieving the effect of bias mitigation.

9 Conclusion

Our research introduces the FBI framework, a novel method for evaluating the financial rationality of LLMs in the intricate field of financial analysis. We rigorously examined 23 leading LLMs and revealed substantial differences in their financial rationality. By defining, detecting, analyzing, and mitigating financial biases, our study validates the capabilities and limitations of LLMs in financial contexts, offering reliable insights for their application in the finance sector. The advancement of LLMs towards greater financial acumen and reduced bias is essential for their dependable use in financial analysis, paving the way for a future where AI-generated insights are confidently and precisely applied in the financial industry.

Limitation

Our financial rational analysis focuses on biases towards the awareness of the Chinese A-share market, which may vary depending on culture and region, resulting in research findings that may not be generalizable to other situations. Meanwhile, we introduced other models during model evaluation, which may lead to the introduction of other biases.

Ethics Statement

This paper honors the EMNLP Code of Ethics. The dataset used in the paper does not contain any private information. All annotators have received enough labor fees corresponding to their amount of annotated instances. The code and data are open-sourced under the MIT license.

References

Barberis and Thaler (2003) Nicholas Barberis and Richard Thaler. 2003. A survey of behavioral finance. Handbook of the Economics of Finance, 1:1053–1128.
Chuang and Yang (2022) Chengyu Chuang and Yi Yang. 2022. Buy tesla, sell ford: Assessing implicit stock market preference in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 100–105.
Cui et al. (2023) Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
Fatouros et al. (2024) Georgios Fatouros, Konstantinos Metaxas, John Soldatos, and Dimosthenis Kyriazis. 2024. Can large language models beat wall street? unveiling the potential of ai in stock selection. arXiv preprint arXiv:2401.03737.
Gallegos et al. (2023) Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2023. Bias and fairness in large language models: A survey.
Grootendorst (2022) Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
Haller et al. (2023) Patrick Haller, Ansar Aynetdinov, and Alan Akbik. 2023. Opiniongpt: Modelling explicit biases in instruction-tuned llms.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Jeoung et al. (2023) Sullam Jeoung, Yubin Ge, and Jana Diesner. 2023. Stereomap: Quantifying the awareness of human-like stereotypes in large language models. arXiv preprint arXiv:2310.13673.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Lopez-Lira and Tang (2023) Alejandro Lopez-Lira and Yuehua Tang. 2023. Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.
Nadeem et al. (2020) Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. Stereoset: Measuring stereotypical bias in pretrained language models.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Romanko et al. (2023) Oleksandr Romanko, Akhilesh Narayan, and Roy H Kwon. 2023. Chatgpt-based investment portfolio selection. In Operations Research Forum, volume 4, page 91. Springer.
Rutinowski et al. (2023) Jérôme Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, Markus Pauly, et al. 2023. The self-perception and political biases of chatgpt. Human Behavior and Emerging Technologies, 2024.
Simon et al. (1994) Carl P Simon, Lawrence Blume, et al. 1994. Mathematics for economists, volume 7. Norton New York.
St et al. (1989) Lars St, Svante Wold, et al. 1989. Analysis of variance (anova). Chemometrics and intelligent laboratory systems, 6(4):259–272.
(19) Kunsheng Tang, Wenbo Zhou, Jie Zhang, Aishan Liu, Gelei Deng, Shuai Li, Peigui Qi, Weiming Zhang, Tianwei Zhang, and Nenghai Yu. Gendercare: A comprehensive framework for assessing and reducing gender bias in large language models.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang et al. (2023) Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
Weston and Sukhbaatar (2023) Jason Weston and Sainbayar Sukhbaatar. 2023. System 2 attention (is something you might need too).
Wu et al. (2023) Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, and Dong Yu. 2023. From language modeling to instruction following: Understanding the behavior shift in llms after instruction tuning. arXiv preprint arXiv:2310.00492.
Yang et al. (2023) Linyi Yang, Yingpeng Ma, and Yue Zhang. 2023. Measuring consistency in text-based financial forecasting models. arXiv preprint arXiv:2305.08524.
Zhang et al. (2023) Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models.
Zheng et al. (2023) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. arXiv e-prints, pages arXiv–2309.

Appendix A Cognitive Bias

For cognitive bias, we classified it into Belief Bias and Risk reference Bias based on previous research, and studied seven of these biases within the FBI framework.Refer to Table 2 for specific content.

Table 2: Summary of Cognitive Biases

Cognitive Bias	Bias Type	Definition
Belief Bias	Limited Attention	The brain has two systems when working: fast thinking and slow thinking. It uses intuition to deal with things quickly.
	Representativeness bias	When making probability estimates, people tend to focus on certain representative features, ignoring environmental probabilities and sample size.
	Anchoring effect	Decision-making is often influenced by the first information received, like an anchor sinking to the bottom of the sea.
	Overconfidence	Belief that one’s knowledge is more accurate than the facts; one’s information is given more weight.
Risk-Preference Bias	Situational dependence bias	The effect of a stimulus depends largely on the context in which it occurs.
	Loss aversion	Sensitivity to losses exceeds gains of equal value.
	Framing effect	Different descriptions of an objectively identical problem lead to different decision-making judgments.

Appendix B Company Profile

In order to avoid bias caused by market value impact, we did not choose funds from the CSI 300 or CSI 500. Instead, we summarized all listed companies in China. After removing ST type stocks, we selected the top, middle, and bottom 200 stocks based on market value, totaling 600 stocks. The industry distribution of stocks is shown in the Figure 7.

Appendix C Event Type

We have sorted out the types of events that can affect a company’s stock price based on the regular patterns of the Chinese A-Share stock market, and finally sorted out four categories, totaling 16 types of events. The detailed content is shown in Table 3.

Table 3: Event Types and Definitions

Event Type	Subdivision Type	Definition
Corporate Governance and Equity Changes	Major Asset Restructuring	The process of recombining, adjusting, and allocating the distribution status of enterprise assets among the owners, controllers, and external economic entities.
	Equity Incentive	By conditionally granting employees partial shareholder rights, a sense of ownership is fostered, forming a community of interests with the company.
	Increase or Decrease in Shareholder Holdings	Changes in the shareholder holdings of company stocks.
	Buy-back	The act of a listed company using cash or other means to repurchase its shares from the stock market.
	Circulation of Restricted Stock	Restricted shares become freely tradable in the secondary market after the commitment period.
Financial Reports and Earnings Expectations	Performance Report	Regular preparation by each responsibility center to evaluate and assess performance, serving as the basis for future budget preparation.
Market Behavior and Announcements	Private Placement	Targeted issuance of bonds or stocks to a select group of senior institutional or individual investors.
	Transfer of Shares	Listed companies transfer their provident fund to share capital in proportion or issue bonus shares accordingly.
	Stock Price Fluctuations	Sudden large inflows and outflows of funds lead to increased volatility in stock prices.
	Business Dynamics	Updates on enterprises and their surroundings, using major production and sales information to promote corporate brand and image.
Negative Events and Risk Management	Dispute	Disputes between companies or between companies and individuals.
	Investigation	Filing an investigation signifies a basic determination of illegal facts, allowing for compulsory measures and official initiation of investigation procedures.
	Violation Penalties	Punishments for enterprises violating regulations of regulatory bodies.
	Litigation and Arbitration	Litigation and arbitration for contract disputes and other property rights disputes between enterprises.
	Security	Enterprises providing guarantees for loans and other matters for other enterprises.

Appendix D Models

We have selected a total of 23 financial and general LLMs oriented by Chinese and English, with specific details shown in LABEL:tab:models.

Table 4: Models in our Framework

Model Name	Chinese-oriented	Model size	FinLLM	Deployment method
MiniCPM-2B	True	2B	False	local
Baichuan-13B	True	13B	False	local
DISC-FinLLM	True	13B	True	local
Baichuan2-7B	True	7B	False	local
Baichuan2-13B	True	13B	False	local
ChatGLM2-6B	True	6B	False	local
ChatGLM3-6B	True	6B	False	local
ChatGLM3-Turbo	True	33B	False	API
GLM-4	True	Unknown	False	API
InternLM2-7B	True	7B	False	local
InternLM2-20B	True	20B	False	local
LLaMA2-7B	False	7B	False	local
LLaMA2-13B	False	13B	False	local
Qwen-7B	True	7B	False	local
Qwen-14B	True	14B	False	local
FinQwen	True	14B	True	local
Qwen-72B	True	72B	False	local
Qwen-max	True	72B	False	API
Xuanyuan-13B	True	13B	True	local
Xuanyuan-70B	True	70B	True	local
Xuanyuan2-6B	True	6B	True	local
GPT-3.5	False	Unknown	False	API
GPT-4	False	Unknown	False	API

Appendix E Formula Proof

Invoking the fundamental principles of expected utility theory, we recognize that a utility function’s curvature reflects an individual’s risk preference. Specifically, a concave utility function ( $u^{\prime\prime}(x)<0$ ) is indicative of risk aversion, while a convex utility function ( $u^{\prime\prime}(x)>0$ ) signifies risk-seeking behavior. A linear utility function ( $u^{\prime\prime}(x)=0$ ), on the other hand, corresponds to risk neutrality.

The expected utility $E[u(x)]$ can be formally represented as:

E[u(x)]=\sum_{x(\omega)}u(x(\omega))p(x(\omega))

(5)

Here, $u(x)$ denotes the utility function, $x(\omega)$ symbolizes the outcome under state $\omega$ , and $p(x(\omega))$ is the probability of outcome $x(\omega)$ occurring.

Furthermore, we articulate the variance of outcomes $x$ , $\text{Var}(x)$ , as the expected squared deviation from the expected value $E[x]$ :

\text{Var}(x)=E[(x-E[x])^{2}]

(6)

Applying the second-order Taylor expansion to the utility function $u(x)$ around the expected value $E[x]$ furnishes us with:

u(x)\approx u(E[x])+u^{\prime}(E[x])(x-E[x])+\frac{1}{2}u^{\prime\prime}(E[x])% (x-E[x])^{2}

(7)

Imposing expectations on the approximated function, we derive the expected utility approximation:

E[u(x)]\approx u(E[x])+u^{\prime}(E[x])E[x-E[x]]+\frac{1}{2}u^{\prime\prime}(E% [x])E[(x-E[x])^{2}]

(8)

Since $E[x-E[x]]=0$ , the middle term vanishes, simplifying our expression to:

E[u(x)]\approx u(E[x])+\frac{1}{2}u^{\prime\prime}(E[x])\text{Var}(x)

(9)

Consequently, under the assertion of utility function concavity or convexity, the sign of the second derivative $u^{\prime\prime}(E[x])$ establishes the nature of risk preference. For a negative second derivative ( $u^{\prime\prime}(x)<0$ ), indicative of risk aversion, a smaller variance is required to enhance the expected utility. Conversely, for a positive second derivative ( $u^{\prime\prime}(x)>0$ ), characteristic of risk-seeking behavior, a larger variance is preferred. Risk-neutral individuals ( $u^{\prime\prime}(x)=0$ ) show indifference to the variance level.

Through this analytical framework, we delineate how the variance of outcomes in conjunction with the utility function’s concavity or convexity guides the determination of an individual’s risk preference.

Appendix F Framework for Data Construction

This section delineates the structured approach employed in the study to formulate datasets incorporating event news, interactive elements, and risk preference inquiries. Each category of information is meticulously crafted using a distinct template, which is elucidated below.

F.1 Event News Template

The construction of the event news dataset leverages prompt engineering techniques to embed real-world events within a framework that facilitates evaluation, simulating the analytical capabilities of financial experts. The evaluation process involves the model assigning a score to each event based on its potential positive or negative impact on the financial landscape. Initially, the model is instructed to provide an immediate, intuitive score reflecting a ’fast thinking’ approach.

To augment the depth of analysis and ensure the robustness of the evaluation, the model is further tasked with adopting a ’slow thinking’ strategy. This entails a comprehensive articulation of the rationale behind the score, encouraging a deliberate and reasoned assessment. The detailed format of this template is illustrated in Figure 8, which guides the model in delivering both the quantitative score and the qualitative reasoning underpinning it.

F.2 Interactions

Using input methods similar to news events for rewriting, the specific template is shown in Figure 9.

F.3 Risk-Preference Questionnaire Template

The methodology for assessing risk preferences through structured questions is twofold, designed to discern the inherent risk orientation of the AI model under different conditions. Initially, the model is presented with a set of scenarios where it must select an option that best aligns with its assessed risk profile, simulating an introspective decision-making process. This setup aims to capture the model’s spontaneous risk preferences without external biases.

Subsequently, the experiment introduces a predefined constraint by explicitly characterizing the model as risk-averse within the instructions. This manipulation is intended to observe the adaptability of the model’s responses to altered risk parameters, thereby evaluating its capacity for contextual behavioral adjustment. The layout and content of these questions are encapsulated in the template depicted in Figure 10, which systematically guides the model through the decision-making process under varying risk conditions.

Appendix G FinCausal Dataset

In this section, we introduce the construction process of the FinCausal dataset, which requires the acquisition of relevant causal knowledge from past financial text materials. This process can be mainly divided into data collection, deduplication, segmentation, and knowledge extraction. The main flowchart is shown in Figure 11.

Firstly, we crawled 500,000 research reports from the internet, ranging from 2020 to 2023, including individual stock research, industry analysis, and macro analysis. We used regular matching and the FastText language filter for classification, mainly retaining Chinese A-Share individual stock research and industry analysis. Since individual stock reports rarely describe causal relationships for negative events, we further crawled some news analyses and stock forum comments to enrich the description of individual stock causal knowledge.

After filtering the research report data, we used the MinHash algorithm to perform document-level deduplication on all content. To extract sentences with causal expressions, we meticulously categorized the content of the research reports into seven distinct types, which included ordinary sentences, causal sentences, news-related content, recommendation ratings, investment advice, risk warnings, and researcher information. We then manually annotated a comprehensive dataset comprising 3,000 pieces of data to train a sophisticated BGE+TextCNN classification model. This model was specifically designed to discern and categorize the various types of sentences present in the financial reports, with a particular focus on identifying those that convey causal relationships.

For each piece of extracted knowledge from a research report or comment, we concatenated one sentence before and after it into a paragraph. Subsequently, we aggregated all relevant paragraphs from a report to form a comprehensive context. Utilizing our meticulously chosen LLM, we conducted causal knowledge extraction from these contexts. Through this process, we successfully obtained 200,000 pieces of industry causal knowledge and 2,000 pieces of individual stock causal knowledge, thereby enriching our dataset with valuable insights into the causal relationships within the financial domain. Here are some examples of FinCausal:

•

The company may consider conducting targeted issuance in order to expand its business scale, make capital expenditures, or invest in research and development.
•

During the epidemic, the demand for remote work and online collaboration has increased, driving the development of related software service companies.

Appendix H Result

H.1 Analysis of Direct News Events

The examination of news events involves a detailed statistical analysis of the responses generated by various Large Language Models (LLMs) to specific news items. This analysis primarily focuses on the distribution of scores assigned by LLMs to each news event, encompassing key statistical measures such as the mean, variance, highest, and lowest scores. Such an approach is instrumental in assessing the consistency and rationality of LLMs’ interpretations of financial news.

To facilitate a comprehensive understanding of these scoring distributions, this section will present violin plots for each news event. Violin plots offer a more nuanced visualization compared to traditional box plots by showing the probability density of the data at different values. This graphical representation will thus provide insights into the spread and skewness of LLMs’ ratings across various news events, enabling a deeper analysis of their evaluative patterns and potential biases.

\foreach

ıin 0,2,4,…,22

In the analysis of the initial five events, the focus is placed on news items that encompass both positive and negative performance reports, alongside fluctuations in stock prices. This diverse array of news content allows for a multifaceted examination of each Large Language Model’s (LLM’s) scoring tendencies. Notably, discrepancies in scoring preferences among different LLMs emerge when confronted with this spectrum of financial news.

A systematic statistical analysis is conducted on the scoring outcomes attributed to the positive and negative aspects of these events. This entails a detailed examination of how each LLM assesses the same news piece, shedding light on the variance in their interpretations and the potential implications of their biases. The findings from this analysis are meticulously compiled and presented in Table 5, offering a clear, quantified insight into the LLMs’ evaluative patterns across the selected news events.

Table 5: Model Positive Times

Model	Positive Times
GPT-4	5
InternLM2-20B	5
LLaMA2-13B	5
Qwen-72B	5
FinQwen	4
InternLM2-7B	4
LLaMA2-7B	4
Qwen-max	4
Xuanyuan-13B	4
Baichuan2-13B	3
Baichuan2-7B	3
ChatGLM3-Turbo	3
GLM-4	3
Xuanyuan-70B	3
ChatGLM2-6B	2
ChatGLM3-6B	2
Qwen-14B	2
Qwen-7B	2
GPT-3.5	1

Our analysis involves aggregating the scoring variances observed across all event news for the various Large Language Models (LLMs) under consideration. This comprehensive synthesis not only highlights the diversity in LLM responses but also provides a macroscopic view of their evaluative consistency and potential discrepancies. The aggregated data, which encapsulate the variance in scoring for each news event by different LLMs, are systematically presented in Table 6. This table serves as a pivotal reference point for understanding the range and distribution of LLM evaluations, offering valuable insights into their interpretative frameworks and the reliability of their analyses.

Table 6: Model Variance Comparison

Model	Variance
GLM-4	0.59798884
ChatGLM3-6B	0.707638507
Qwen-72B	0.784471341
Qwen-7B	0.788077699
ChatGLM3-Turbo	1.067120654
Qwen-14B	1.211226324
Qwen-max	1.363195393
MiniCPM-2B	1.409
GPT-4	1.909388332
InternLM2-20B	4.893324616
GPT-3.5	5.277003518
Baichuan-13B	5.998
DISC-FinLLM	6.096
Baichuan2-13B	6.157081681
LLaMA2-13B	6.881628177
InternLM2-7B	7.466014898
FinQwen	9.363445905
ChatGLM2-6B	10.03005785
LLaMA2-7B	10.61274671
Xuanyuan-70B	10.83743438
Xuanyuan2-6B	13.9988
Xuanyuan-13B	19.18007393
Baichuan2-7B	28.10579705

Upon examining the inherent biases within individual models, our analysis proceeds to consolidate the findings from each Large Language Model (LLM) to explore their collective or differential biases towards various industries. This step is crucial for understanding not only the predispositions of individual models but also for discerning any overarching trends or anomalies in their assessments of industry-related news events. By aggregating these results, we aim to delineate the extent to which these models exhibit preferential or adverse biases towards certain industry, thereby shedding light on the potential influence of these biases on the models’ analytical outputs and reliability. The synthesis of this comprehensive analysis provides a nuanced understanding of model behavior in the context of industry-specific evaluations.

In parallel with the examination of model biases, our study also delves into the temporal evolution of Large Language Models within distinct family, attributing changes to factors such as model size or software updates. To this end, we employ box line comparison charts as a visual tool to elucidate the developmental trajectories of models within the Baichuan, GLM, and LLaMA family. These charts serve to highlight variations in model performance or bias over time, providing a clear visual representation of progression or shifts in model behavior. The comparative analyses for the Baichuan family, GLM family, and LLaMA family are depicted in Figure 33, Figure 34, and Figure 35, respectively. Through these visual comparisons, we aim to offer insights into how advancements or modifications in model architecture and capabilities influence their analytical outcomes and biases.

H.2 Analysis of COT News

To delve deeper into the underlying factors contributing to potential irrationalities in Large Language Models (LLMs), our investigation extends to the analysis of reasoning outcomes facilitated by cognitive connections. This approach is predicated on the hypothesis that the manner in which LLMs forge and utilize cognitive links during the reasoning process may shed light on their logical inconsistencies or biases. For this purpose, we have meticulously selected LLMs that have demonstrated the highest, second highest, and lowest levels of performance in response to direct prompts. This selection criterion ensures a comprehensive overview, encompassing a broad spectrum of reasoning capabilities within LLMs.

The focus of this analysis is on the ’slow thinking’ aspect of model reasoning, where deliberate and methodical processing is emphasized. By examining the variance in reasoning outcomes among these models, we aim to identify patterns or anomalies that might indicate a propensity for irrational decision-making. The results of this analysis, highlighting the variance in cognitive reasoning among the selected LLMs, are systematically presented in Table 7. Through this examination, we seek to uncover the intricacies of cognitive processing in LLMs and their implications for model reliability and rationality.

Table 7: Model Variance Comparison after COT

Model	Direct	COT
GLM-4	0.597988840	5.381799977
Qwen-7B	0.788077699	7.685484073
ChatGLM3-Turbo	1.067120654	5.670563704
MiniCPM-2B	1.409476674	7.038961848
Xuanyuan2-6B	13.99883011	6.668694967
Xuanyuan-13B	19.18007393	17.83545943
Baichuan2-7B	28.10579705	12.65975644

At the same time, we further analyzed the new5 with significant differences in ratings among different LLMs, and used the keyword detection method Bertopic to cluster and analyze the reasoning results of the models. Before clustering, the model scores and specific information of the company were removed from the reasoning results. The inference decibel of each model is clustered into 10 categories, and the distribution of scores for each category is as follows.

We further analyze the reasoning texts of the best performing GLM-4 and the worst performing Baichuan2-7B, clustering them into 10 groups with 10 keywords in each group. The key vocabulary of the two models will be summarized and a word cloud will be drawn. The results are shown in Figure 39 and Figure 39.

H.3 Analysis of Interactions

We process and analyze the information related to the interaction between the company and shareholders in a similar way.

\foreach

ıin 0,2,4,…,8

H.4 Analysis of Risk-preference Questions

In the exploration of bias detection concerning risk preferences within Large Language Models (LLMs), our initial approach involved subjecting each model to three distinct input methodologies. The outcomes of these initial tests, aimed at gauging the models’ inherent risk preferences, are meticulously documented in Table 11. This foundational analysis sets the stage for more nuanced investigations into model behaviors under specific conditions.

Subsequently, our focus shifted to the models’ adherence to explicit instructions regarding risk aversion. By inputting the directive "You are a risk averse person," we were able to quantify each model’s compliance through the risk averse ratio, the details of which are encapsulated in LABEL:tab:instruct_risk. This aspect of the study provides insight into the models’ capacity for context-based adaptability and their interpretation of subjective instructions.

Further, to ascertain the impact of the framing effect on model responses, a set of questions was translated to examine any discrepancies arising from linguistic variations. The findings from this segment of the study, highlighting the influence of translation on model outputs, are presented in LABEL:tab:translation_difference. This analysis contributes to understanding the potential for framework effects to skew model perception and decision-making processes.

Lastly, we delved into the models’ susceptibility to loss aversion by introducing scenarios framed around loss. The extent of loss aversion bias manifesting in the models’ responses was rigorously analyzed, with the summarized results being showcased in LABEL:tab:Loss_Aversion_Bias. Through this comprehensive approach, we aim to unveil the multifaceted nature of biases in LLMs, particularly in the context of risk assessment and decision-making under uncertainty.

Table 8: Model Instruct Risk-aversion Performance Comparison

Model	Risk-aversion (%)
GPT-4	89.5
Qwen-max	83.5
GLM-4	88.0
Qwen-72B	79.0
ChatGLM3-Turbo	62.5
Xuanyuan-70B	59.5
Qwen-14B	66.0
InternLM2-7B	42.5
Baichuan2-13B	44.0
FinQwen	45.0
Xuanyuan-13B	43.0
ChatGLM3-6B	53.0
InternLM2-20B	53.0
Qwen-7B	52.0
Baichuan2-7B	38.0
GPT-3.5	37.5
ChatGLM2-6B	35.0

Table 9: Models Translation Prompt Differences Comparison

Model	Difference (%)
GPT-4	23.5
ChatGLM3-6B	25.0
Qwen-max	28.5
GLM-4	28.0
Xuanyuan-70B	36.0
GPT-3_5	33.0
Qwen-7B	42.0
InternLM2-7B	43.5
ChatGLM3-Turbo	46.5
FinQwen	49.0
ChatGLM2-6B	51.0
Xuanyuan-13B	51.5
Baichuan2-13B	54.5
InternLM2-20B	48.5
Qwen-72B	56.0
Baichuan2-7B	59.0
Qwen-14B	65.0

Table 10: Model Loss Aversion Bias Comparison

Model	Risk-aversion (%)
Qwen-14B	51.0
GPT-3_5	52.0
FinQwen	57.5
Baichuan2-7B	58.5
InternLM2-7B	60.5
Qwen-max	62.0
Baichuan2-13B	63.0
InternLM2-20B	63.0
ChatGLM2-6B	61.5
Qwen-72B	65.5
Xuanyuan-13B	66.5
GLM-4	69.0
ChatGLM3-Turbo	74.0
Xuanyuan-70B	74.5
Qwen-7B	72.5
ChatGLM3-6B	75.0
GPT-4	84.0

Table 11: LLMs risk preference statistics

Model	Method	Risk Averter	Risk Neutral	Risk Lover
Baichuan2-7B	Direct	67	35	98
Baichuan2-7B	Instruct	76	24	100
Baichuan2-7B	Translation	67	58	75
Qwen-72B	Direct	81	79	40
Qwen-72B	Instruct	160	32	8
Qwen-72B	Translation	85	45	70
Qwen-14B	Direct	52	46	102
Qwen-14B	Instruct	134	27	39
Qwen-14B	Translation	112	15	73
GLM-4	Direct	88	32	80
GLM-4	Instruct	178	12	10
GLM-4	Translation	92	39	69
ChatGLM2-6B	Direct	73	13	114
ChatGLM2-6B	Instruct	70	17	113
ChatGLM2-6B	Translation	101	23	76
ChatGLM3-6B	Direct	100	18	82
ChatGLM3-6B	Instruct	107	21	72
ChatGLM3-6B	Translation	99	28	73
Xuanyuan-70B	Direct	99	24	77
Xuanyuan-70B	Instruct	120	24	56
Xuanyuan-70B	Translation	90	20	90
InternLM2-7B	Direct	71	70	59
InternLM2-7B	Instruct	86	69	45
InternLM2-7B	Translation	81	63	56
Baichuan2-13B	Direct	76	16	108
Baichuan2-13B	Instruct	89	32	79
Baichuan2-13B	Translation	61	39	100
Qwen-7B	Direct	95	27	78
Qwen-7B	Instruct	105	28	67
Qwen-7B	Translation	106	24	70
InternLM2-20B	Direct	76	76	48
InternLM2-20B	Instruct	108	63	29
InternLM2-20B	Translation	73	78	49
ChatGLM3-Turbo	Direct	98	45	57
ChatGLM3-Turbo	Instruct	127	48	25
ChatGLM3-Turbo	Translation	62	85	53
GPT-3.5	Direct	54	35	111
GPT-3.5	Instruct	75	24	101
GPT-3.5	Translation	56	27	117
FinQwen	Direct	65	42	93
FinQwen	Instruct	91	34	75
FinQwen	Translation	101	46	53
GPT-4	Direct	118	41	41
GPT-4	Instruct	181	11	8
GPT-4	Translation	128	39	33
Xuanyuan-13B	Direct	83	26	91
Xuanyuan-13B	Instruct	86	24	90
Xuanyuan-13B	Translation	106	13	81
Qwen-max	Direct	74	89	37
Qwen-max	Instruct	169	26	5
Qwen-max	Translation	99	62	39

Are LLMs Rational Investors? A Study on Detecting and Reducing the Financial Bias in LLMs

Abstract

1 Introduction

2 Background

2.1 Cognitive Bias

2.2 Limits to Arbitrage

3 FBI: A Framework for Assessing LLMs Financial Rationality

4 Bias Unveiler

4.1 Belief Bias

Anchoring Effects

Representativeness Bias

Overconfidence

4.2 Risk-preference Bias

Situational Dependence Bias

Loss Aversion

Frame Effect

5 Bias Detective

5.1 Belief Bias

5.1.1 Data Desgin

5.1.2 Analysis Logic

5.1.3 Result

5.2 Risk-preference Bias

5.2.1 Data Design

5.2.2 Analysis Logic

5.2.3 Result

6 Bias Tracker

6.1 Slow Thinking

6.1.1 Methodology

6.1.2 Result

6.2 Attention Importance

6.2.1 Methodology

6.2.2 Result

7 Bias Antidote

7.1 Methodology

7.2 Result

8 Discussion

8.1 Model Size

8.2 Training Data

8.3 Input Forms

9 Conclusion

Limitation

Ethics Statement

References

Appendix A Cognitive Bias

Appendix B Company Profile

Appendix C Event Type

Appendix D Models

Appendix E Formula Proof

Appendix F Framework for Data Construction

F.1 Event News Template

F.2 Interactions

F.3 Risk-Preference Questionnaire Template

Appendix G FinCausal Dataset

Appendix H Result

H.1 Analysis of Direct News Events

H.2 Analysis of COT News

H.3 Analysis of Interactions

H.4 Analysis of Risk-preference Questions

Are LLMs Rational Investors?
A Study on Detecting and Reducing the Financial Bias in LLMs