All articles

xFinance: How to outperform BloombergGPT without breaking the bank

Published：

May 3, 2023

NaN

min read

Despite being 4x smaller and budget-friendly, xFinance has outperformed BloombergGPT on financial NLP tasks by a significant margin.

We have recently developed an analog of the Bloomberg GPT model using our very own xTuring library and data scraped from the internet. Our aim was to create a cost-effective solution for financial NLP tasks without sacrificing performance.‍

Introduction

At Stochastic, we're always looking for ways to push the boundaries of what's possible in the field of artificial intelligence. With the recent buzz around BloombergGPT, a 50-billion parameter large language model designed specifically for the finance industry, we set out to see if we could achieve similar results using a open-source model and a modest budget of $1000.

What we discovered was surprising - not only did we achieve better results on finance tasks than BloombergGPT, but we also discovered a formula for fine-tuning open-source models for specific domains at a fraction of the cost.

With xFinance, we've developed an approach for fine-tuning LLMs that allows for continual or incremental learning, without the models losing their prior training. This technique empowers us to construct automated pipelines for fine-tuning open-source models with domain-specific data as it becomes available. By leveraging this, we're able to ensure that the models remain up-to-date and perform exceptionally, delivering unparalleled results to our clients.

‍

xFinance

We created xFinance, a 13-billion parameter model fine-tuned on an open-source model using LoRA. Our goal was to show that it is possible to achieve impressive results in financial NLP tasks without breaking the bank. We put xFinance to the test on popular open-source finance tasks like Financial Phrasebank(FPB), FiQA SA, and Headline, and the results speak for themselves. We were able to achieve a better F1 score than BloombergGPT on these tasks, despite the significant difference in model size and cost.

‍

Dataset

In training our xFinance model, we utilized two distinct datasets: the text dataset and the instruction dataset. The text dataset consists of raw financial text data for unsupervised fine-tuning. The instruction dataset, on the other hand, was generated using our dataset generation flow in the xTuring library for the purpose of instruction fine-tuning. By combining these two datasets, we aimed to provide our model with a robust training corpus that can capture both domain-specific and general-purpose text. In the following section, we provide detailed information about the datasets and the methodology used to construct them.

‍

Text dataset

The text dataset which contains various types of financial documents written in English, such as news articles, technical reports, filings, press releases, financial documents scraped from the internet, and social media posts in 2022 and 2023. We divided the data into three sets for continuous learning. The first set consists of data up to January 2023, while the second set comprises a sample of previous data before February 2023 and February 2023 at a ratio of 5:1. The third set is a sample of previous data before March 2023 and February 2023 at a ratio of 5:1. Our training corpus is designed for both domain-specific and general-purpose text and aims to improve the quality of the data by removing duplicates. We present a detailed breakdown of the entire training set in Table 1, which may differ from those in other papers due to the de-duplication process.

Harvard Innovation Labs  
125 Western Ave  
Boston, MA 02163

Terms & Conditions

Harvard Innovation Labs  
125 Western Ave  
Boston, MA 02163

Terms & Conditions

Harvard Innovation Labs  
125 Western Ave  
Boston, MA 02163

Terms & Conditions

Harvard Innovation Labs  
125 Western Ave  
Boston, MA 02163

Terms & Conditions