LLM Data Cost Breakdown: All You Need to Know About Data Costs for Training an LLM

1. Introduction

With the rapid development of Large Language Models (LLMs), more and more enterprises are considering applying LLMs to their actual businesses. However, in the process of implementing LLMs, data cost is often an important factor that cannot be ignored. As decision-makers, understanding the data requirements and related costs of LLM training at various stages is crucial for the successful implementation of the project.

ABAKA AI will take you on an in-depth exploration of the three key stages of LLM training: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), analyzing the data requirement characteristics of each stage and their impact on costs. We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity, as well as how to optimize data investment while ensuring model performance.

Whether you are a corporate executive just starting to explore LLM applications, or a technical leader who has already made achievements in the AI field, we will provide you with a comprehensive and practical LLM data cost assessment framework to help you navigate AI implementation decisions with ease, using our past experience to help you organize a framework for calculating costs.

2. Pre-training Stage

2.1. Dataset Scale Estimation

Estimating the required pre-training dataset size given a computational budget C is the first step in implementing an LLM project. This process involves different Scaling Laws, the most famous of which are OpenAI’s Scaling Law and DeepMind’s Chinchilla Law.

OpenAI’s research published in 2020 proposed the initial Scaling Laws, indicating a power-law relationship between model performance and model parameter count, dataset size, and computational resources. However, the Chinchilla Law proposed by DeepMind in 2022 revised this, arguing that the optimal data volume should be comparable to the model parameter count.

OpenAI Scaling Law:

$L ( N, D )=\left[ \left( \frac{N_{c}} {N} \right)^{\frac{\alpha_{N}} {\alpha_{D}}}+\frac{D_{c}} {D} \right]^{\alpha_{D}}$

DeepMind Scaling Law:

$\hat{L} ( N, D ) \triangleq E+\frac{A} {N^{\alpha}}+\frac{B} {D^{\beta}}$

The formulas represent the relationship between model performance ($L$ or $\hat{L}$) and model parameter count ($N$) and dataset size ($D$).

These two formulas represent different understandings and modeling methods of LLM scaling behavior. In practical applications, we often need to balance between model size and data volume. For example, to reduce inference costs, we can consider using smaller models with more data. Research by Hoffmann et al. [1] shows that under a fixed computational budget, a well-trained small model may perform better than an undertrained large model. Specifically, if we originally planned to train an 8B parameter model but want to reduce inference costs, we can consider replacing it with a model with fewer parameters (such as 7B) while increasing the amount of training data. This approach may not only maintain or even improve model performance but also significantly reduce deployment and operational costs.

The first step of data budgeting is crucial, determining the size of the model and the size of the pre-train dataset needed. ABAKA AI can build high-quality datasets for you, while having more stock data that can precisely match more suitable data according to your needs.

Data scraping capabilities of ABAKA AI

Data scraping capabilities of ABAKA AI

2.2. Multi-domain Data Ratio

The pre-training corpus can include various types of text data, such as web pages, academic materials, books, and relevant texts from different fields, such as legal documents, annual financial reports, medical textbooks, and other domain-specific data. In the pre-training stage, LLMs learn broad knowledge from massive unlabeled text data and store it in model parameters, thus acquiring a certain level of language understanding and generation capabilities.

A general pre-training corpus is a large-scale dataset composed of a large amount of text from different domains and sources. Research by Liu, Yang et al. [2] divides general data into eight major categories: web pages, language text, books, academic materials, code, parallel corpora, social media, and encyclopedias. In the pre-training process of the model, the diversity and quality of data are crucial, so careful design of the ratio of these different categories of data is needed when constructing the pre-training dataset.

  1. Web data: Web data is one of the most widely used sources of pre-training data. The data usually exists in Hypertext Markup Language (HTML) format, showing certain structural features, and is rich in topics, covering content from different fields and disciplines. However, web data may also contain noise and low-quality content, so careful screening and cleaning are required.

  2. Language text: Language text data mainly consists of two parts. The first part is electronic text data built based on a wide range of sources of written and oral language, usually presented in the form of large corpora of specific languages; the second part is electronic text data built based on relevant written materials in various fields or topics. For example, FinGLM covers annual reports of some listed companies from 2019 to 2021. This type of data belongs to language text materials in the financial field.

  3. Books: Book data is also one of the common data types in pre-training corpora. Compared with web pages, books have longer text content and higher data quality, both of which help improve the performance of large language models. Book data provides knowledge with both depth and breadth, allowing models to improve understanding ability and knowledge reserve while learning deeper contextual information.

  4. Academic materials: Academic material data refers to text data related to academic fields, including but not limited to academic papers, journal articles, conference papers, research reports, patents, etc. These data are written and published by experts and scholars in academia, with high professionalism and academic rigor. Including them in pre-training corpora can provide more accurate and professional information, helping models understand terminology and knowledge within academic fields. Academic literature, papers, and textbooks provide examples of professional and technical language use, as well as the latest scientific discoveries. This type of data is particularly important for improving model performance in professional fields.

  5. Code: The code data category refers to text information containing programming languages, such as Python, Java, C++, and other code snippets. Its purpose is to help models better understand programming languages and code structures. Code datasets can not only enhance programming capabilities but may also improve logical reasoning abilities. This type of data enables LLMs to understand and generate code in various programming languages, providing support for software development and code analysis tasks.

  6. Parallel corpora: Parallel corpus data refers to a collection of text or sentence pairs in different languages. These text pairs are translations of each other, where one text is in the source language (e.g., English) and the corresponding text is in the target language (e.g., Chinese). The introduction of parallel corpus data is crucial for improving the machine translation capabilities and cross-lingual task performance of large language models.

  7. Social media: Social media data refers to text content collected from various media platforms, mainly including user-generated posts, comments, and conversations between users, reflecting informal, colloquial language use. It contains a large amount of slang, new words, and diverse expressions. Although social media data may contain harmful information such as bias, discrimination, and violence, it is still crucial for the pre-training of large language models. This is because social media data is beneficial for models to learn expressive abilities in conversational communication and capture social trends, user behavior patterns, etc.

  8. Encyclopedia: Encyclopedia data refers to text information extracted from encyclopedias, online encyclopedia websites, or other knowledge databases. Data from online encyclopedia websites is written and edited by experts, volunteers, or community contributors, with a certain degree of authority and reliability. Due to its easy accessibility, it is included in pre-training corpora at a higher frequency, becoming a cornerstone for enhancing the knowledge base of large language models.

Reasonably configuring this pre-training data can significantly improve the performance and applicability of LLMs. The quality and diversity of data are often more important than the sheer volume of data. Based on the need for high-quality, multi-domain data ratios, ABAKA AI carefully considers the characteristics and value of each type of data when designing pre-training datasets, adjusting the ratio according to your specific needs to help you achieve high-quality and precise pre-training dataset ratios, reducing model training costs.

不同模型在预训练过程中所使用的语料库的数据类型分布

This image shows the distribution of data types in the corpora used by different models during pre-training. Each pie chart represents a model and indicates the proportions of various data types. Different data types are distinguished by different colors, including web pages, code, encyclopedias, books, academic materials, social media, language text, and diverse data.

2.3. Training Data Acquisition

Although open-source datasets provide a foundation for model training, many truly valuable and unique data often do not appear in public channels. Therefore, targeted crawling of data from specific domains or sources has become a key strategy for improving model performance and competitiveness. The acquisition of this part of data is very necessary. In terms of high-quality training data acquisition, ABAKA AI can provide you with deeper insights, higher timeliness, and more unique data in targeted acquisitions, helping you improve model performance and accuracy in vertical domains and enhance the model’s understanding of the latest information and trends.

Channels for targeted data acquisition usually include data crawling, commercial database subscriptions, data cooperation and exchange, etc. Except for web crawlers, other channels are too customized, so this section only discusses the relevant content of data crawling. Data crawling does not have high requirements for infrastructure, so in the following calculations, we only consider development costs.

Before development, more importantly, is to choose suitable data sources. Crawling from suitable data sources can significantly improve the model’s performance in specific domains. After determining the data source, the development and crawling costs mainly come from the following aspects:

  1. Development cost

$Budget_{dev} = (S_{dev} × D_{initial}) + (S_{dev} × D_{update})$

Where $D_{initial}$ and $D_{update}$ are the time for initial development and updating the crawling code after website updates, respectively. The complexity of the website, verification mechanisms, request complexity, etc. will all affect the development time.

  1. Maintenance cost

$Budget_{ ops} = S_{ops} × D_{crawl} × α$

Maintenance costs may not be full-time, so a coefficient $α (0 < α ≤ 1)$ can be introduced to represent the actual proportion of maintenance time needed. If the data needs continuous updating or the crawling period is very long, then maintenance personnel intervention is needed to keep the crawler running normally and respond to website changes. If the crawler system uses a distributed strategy, more maintenance support may be needed.

  1. IP proxy pool

$Budget_{ip} = (\frac{N_{req}}{N_{req_per_ip}}) × C_{ip}$

Where $N_{req}$ is the total number of requests, $N_{req_per_ip}$ is the number of requests each IP can handle, and $C_{ip}$ is the unit price of each IP. Factors such as the website’s IP restriction policy, total data volume, IP quality, IP geographic location requirements, proxy type, etc. will affect the price.

  1. Crawling material cost

$Budget_{mat} = C_{mem} × N_{mem} × (D_{crawl} / D_{mem_validity})$

Where $C_{mem}$ and $N_{mem}$ are the required number of memberships and the number of members, $D_{mem_validity}$ is the validity period of the membership (in days). Factors such as membership level, concurrent strategy, etc. will affect the final budget. If the target website requires registration or membership to download, then this cost needs to be considered.

So overall:

$Budget_{total} = S_{dev} × (D_{initial} + D_{update}) + S_{ops} × D_{crawl} × α + (N_{req} / N_{req_per_ip}) × C_{ip} + (C_{mem} × N_{mem} × D_{crawl}) / D_{MemValidity}$

Generally speaking, a vertical domain website costs between 15,00 to 15,000 USD dollars depending on the difficulty, with large social networking sites costing more. ABAKA AI can provide you with deeper insights, higher timeliness, more unique, and higher quality data, and reduce the total acquisition cost by 70%, helping you train excellent large language models in various dimensions.

2.4. Document Information Extraction

A large amount of high-quality LLM pre-training data exists in the form of PDFs or scanned images. Due to the diversity of layouts and formats and the varying quality of scanned images, utilizing this data to build datasets is a challenging task, requiring the conversion of this content into data formats like markdown for use. The core problems mainly focus on two aspects: extracting content information and layout information (including body text, titles, figure captions, images, tables, formulas) and handling the relationships between layout elements.

When processing multiple open-source datasets, ABAKA AI observed several excellent open-source solutions, such as PP-StructureV2, Marker, Vary, and Nougat, but they each have shortcomings. PP-StructureV2 cannot identify LaTeX format content and lacks necessary post-processing steps; Marker covers fewer languages and doesn’t handle figures well; Nougat has limited support for multi-column data and can identify limited languages, while Vary / Vary-toy consumes more computational resources.

Based on these situations, ABAKA AI, as a member of the Multimodal Art Projection (M-A-P) team, fully participated in building the completely open-source large language model MAP-Neo, which also open-sourced the Document Convert Pipeline. This pipeline can better balance performance and computational overhead, while the decoupling between modules brings better interpretability and makes it easier to upgrade, add, and replace different modules, providing a more flexible, efficient, and CPU-friendly solution.

In addition to using models for conversion, many vendors provide similar services, such as mathpix, Doc2x, Paodin PDFlux, pix2text, X Information, X Xun Cloud Large Model Knowledge Engine Document Parsing, etc. Therefore, we provide two ways to calculate costs below:

  1. Self-built conversion service cost

$Budget_{convert} = (\frac{N_{pages}}{R_{process}}) × C_{node} × (1 + F_{complexity}) + C_{integration}$

Where $N_{pages}$ is the total number of documents, $R_{process}$ is the number of documents processed per node per day, $C_{node}$ is the price per node per day, $F_{complexity}$ is the document complexity factor ($0 ≤ F_{complexity} ≤ 1$). Generally speaking, the layout and fonts of magazines and newspapers will be more complex, while literature and patents will have richer images and tables. These factors need to be considered when specifying budgets. $C_{integration}$ is the cost of deployment, updating strategies/models, and maintenance. This part of the cost will vary greatly depending on the task.

  1. Third-party service cost

$Budget_{convert} = \sum_{i=1}^{n} C_{tier,i} \times N_{pages,i} + C_{integration}$

Where $n$ is the number of price tiers, $C_{tier,i}$ is the price per page for the i-th tier, $N_{pages,i}$ is the number of pages in the i-th tier, $C_{integration}$ is the cost of API integration and maintenance.

The choice between these methods depends on multiple factors, including the number and type of documents, required conversion quality, availability of internal resources, and budget constraints. In fact, in most cases, easy data is converted using one’s own servers, while difficult data uses commercial-grade services.

2.5. Training Data Cleaning

Although the raw data obtained through web crawling, document conversion, and open-source datasets provides a foundation for model training, this data usually contains noise, errors, biases, and false information, which will reduce the training effectiveness of the model. Therefore, data cleaning becomes a key step in improving model performance and reliability. To obtain high-quality data, ABAKA AI can provide you with cleaner and more refined data cleaning, significantly improving data quality, thereby enhancing the model’s performance on specific tasks, strengthening the model’s ability to understand complex patterns, and reducing misleading learning due to data issues.

Fineweb data cleaning pipeline

Fineweb data cleaning pipeline

Before starting cleaning, more importantly, is to formulate appropriate cleaning strategies. This requires a thorough understanding of data characteristics, model requirements, and potential data quality issues. The formulation of cleaning strategies should consider factors such as data scale, complexity, domain characteristics, etc. In terms of cost estimation, taking the Matrix dataset of the MAP-Neo large model jointly participated by ABAKA AI and Ge Zhang et al. [3] as an example, the Matrix dataset released 4.7T tokens of data, which can be said to be one of the highest quality and largest scale bilingual datasets. The general approach to data cleaning for the Matrix dataset follows the principle of “from coarse to fine” and “from simple to complex”. We can divide the cleaning steps into the following main stages:

  1. Heuristic filtering:Heuristic rule filtering is the first line of defense, aimed at quickly identifying and deleting low-quality data. This step has low computational cost but can significantly reduce the amount of data for subsequent processing. Filtering criteria include: URL; blacklist word table; gibberish text filter; document length; proportion of special characters; proportion of short, continuous, or incomplete lines; repeated words; n-grams or paragraphs. The filtering thresholds are based on statistical analysis of large document samples. Heuristic rules can effectively identify and remove low-quality data, preventing low-quality pre-training corpora from affecting model performance. As the team used composite data from multiple sources, based on data diversity, the team specially designed cleaning methods and tailored rules for each method to maintain consistency in data quality.

  2. Data deduplication:Many studies have shown that repetitive text may lead to a decline in model performance, making deduplication a key step in corpus processing (although this point is somewhat controversial, more repetitive data may precisely indicate that this part of the data is of high quality, which is an important feature. For example, Fineweb’s view is that more deduplication does not necessarily mean better performance; if deduplication is performed across dumps, performance may actually be worse).

    a. Exact duplication:Exact document deduplication is a method used to evaluate whether an entire text is completely identical to another text. If found to be completely identical, the duplicate is deleted. Due to the large amount of data, clusters must be used for processing, and memory insufficiency problems may also occur. In practice, we store text data in batches in different storage buckets. Then process the data in each storage bucket in turn to remove duplicates.

    b. Near-duplicate:For near-duplicates, we use the MinHash LSH deduplication method to remove them as much as possible, which is particularly suitable for web data and is widely used in similarity search and duplicate detection in large datasets. It can handle very common scenarios where the text content is basically the same, but the scattered template blocks of web pages are different. The principle of MinHash is to represent a set with smaller hash values, and then these hash values can be used to estimate the Jaccard similarity between two sets. The computational cost of this step is still quite high.

    c. Similar Line:To solve the problem of the same content appearing multiple times in the text, a direct method is to divide the text into multiple lines using specific delimiters, and then compare the similarity between each line. If they are similar, subsequent lines are deleted.

    d. In addition, paragraph deduplication and substring deduplication were also performed to achieve better results.

  3. Quality screening:After data cleaning, Fineweb-edu used the LLama3-70B-Instruct model to score the data and trained a Bert-like classification model. The classification model was then used to filter the data, greatly improving data quality. In addition to using models for data quality screening, many developers use fasttext models for language identification when cleaning CC datasets.

On the left is the retention rate for processing English data and on the right is the retention rate for Chinese

On the left is the retention rate for processing English data and on the right is the retention rate for Chinese

Deduplication did not show the expected performance improvement in this experiment

Deduplication did not show the expected performance improvement in this experiment

Based on the above steps, we can calculate the cost of data cleaning:

  1. Engineer debugging and rule determination cost

$Budget_{engineer} = S_{eng} \times (T_{rules} + T_{debug})$

Where $S_{eng}$ is the developer’s daily salary, $T_{rules}$ and $T_{debug}$ are the time required for formulating and optimizing rules (USD/day).

  1. Storage costs

$Budget_{storage} = C_{storage} \times V_{data} \times T_{retention}$
F
Where $C_{storage}$ is the storage cost per TB per month, $V_{data}$ is the total data volume (TB), $T_{retention}$ is the data retention time (months).

  1. Computation costs

$Budget_{compute} = \sum_{i=1}^{n} [C_i \times \frac{V_{data,i}}{R_i} \times (1 + \beta_i \times (F_{comm} + F_{ops}))]$

Where $i$ represents the processing stage (1 to n), $C_i$ is the unit cost of computing resources for the i-th stage (USD/day), $V_{data,i}$ is the data volume for the i-th stage (TB), $R_i$ is the processing rate for the i-th stage (TB/day), $\beta_i$ is a binary indicator showing whether the i-th stage uses cluster processing (0 for single-node processing, 1 for cluster processing), $F_{comm}$ and $F_{ops}$ are the communication and operational overheads of using clusters. Using clusters is troublesome and costly, so we use heuristic filtering as the first step.

  1. Quality screening

$Budget_{quality} = C_{train} \times T_{training} + C_{data_annotation} + C_{inference} \times \frac{V_{data}}{R_{inference}}$

Where $C_{train}$ and $C_{inference}$ are the computational costs for training and inference, which usually differ significantly in price, $T_{training}$ is the training time (days), $C_{data_annotation}$ is the annotation cost, $\frac{V_{data}}{R_{inference}}$ indicates the time needed to complete inference for all data.

2.6. Data Cost Calculation

High-quality data processing comes at a cost. From data acquisition to the final cleaning process, each step involves complex computations and human resource investments, all of which translate into actual costs. This chapter will combine ABAKA AI’s previous content and rich experience to provide you with some feasible ideas, hoping to help you calculate data costs when implementing LLMs.

Based on the data processing flow described earlier, we can roughly divide data costs into the following main categories: storage costs, data acquisition costs, data conversion costs, and data cleaning costs. We hope to help you establish an intuitive budget system through ABAKA AI’s past rich experience:

  1. Storage costs:In this field, data scale is far larger than general projects, with pre-training datasets reaching PB levels. Single machines cannot meet such large-scale data storage needs, and projects also have high bandwidth requirements. Therefore, distributed storage is generally used. Distributed storage facilitates horizontal expansion, can meet growing storage needs, and has data backup and fault tolerance mechanisms, ensuring high data reliability. Multi-node parallel read and write can also improve I/O performance. Generally, the capacity price of distributed storage is about 85 USD/T (NVME + HDD), meaning 1PB of available storage space costs about 85,000 USD. Adding security redundancy and network equipment, security equipment, the cost will approach 99,000 USD/PB.
  2. Data acquisition:All historical data from a well-known large website can be estimated at around 42,500-70,500 USD, with incremental updates costing about 14,000 USD annually. For vertical domain websites, it could be anywhere from 42,00 to 14,000 USD. Video websites are three to five times more expensive than ordinary websites (bandwidth, storage), and overseas websites are two to three times more expensive (overseas proxies, overseas servers, compliance). Assuming you need to crawl 8 mainstream social media and news websites + 15 vertical domain websites (such as code, mathematics, finance), a budget of 706,000 USD would be appropriate.
  3. Document information extraction:Based on ABAKA AI’s experience, using ABAKA AI’s developed Pipeline for document conversion is more cost-effective and flexible. If using consumer-grade GPUs for conversion, the cost per page is about 0.000035 USD, far lower than mathpix’s 0.025 / 0.01 USD per page. Of course, we now see many good domestic manufacturers trying in this area, and we look forward to better models and cheaper prices from domestic service providers. All in all, including the time for Gap and debugging, estimate about 14,000 USD for every 10,000,000 pages of documents (80% using your own model + 20% using third-party services).
  4. Data cleaning:The cost of this step mainly depends on how many data sources there are and their domains. When processing very dirty data, ABAKA AI used over 1,000 cores for about a month, adding many special rules to obtain higher quality data, with a data retention rate of less than 1%. Therefore, this part of the data can be calculated as follows:

$S_{\text{eng}} + \frac{V_{\text{data}}}{100\text{T}} \times C_{\text{base}}$

That is, the data cleaning cost for each domain consists of two weeks’ salary for an algorithm engineer + 2,800 USD for cleaning every 100T, assuming the data cleaning cost increases linearly with data volume when the cluster is set up. For example, like Fineweb-edu using Llama3-70B and Bert-like models, the price is also quite affordable, just slightly increase the cost per 100T.

In summary, preparing pre-training data for LLMs is a complex and costly process. It involves multiple stages, including data acquisition, storage, document information extraction, and data cleaning, each requiring careful planning and substantial investment. The quality and diversity of data are crucial to the model’s final performance, so each stage should be optimized as much as possible within budget constraints. At the same time, we find that the value of experienced algorithm engineers cannot be overlooked. Their experience and expertise can help teams avoid many potential pitfalls and detours. In LLM projects, the cost of taking detours due to human resource issues is often surprisingly high, potentially leading to a waste of considerable time and resources.

3. SFT & RLHF Stages

In the training process of large language models (LLMs), Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two closely connected key stages. Although these two stages differ in technical implementation and specific objectives, they share significant similarities in terms of data requirements and cost composition. In this chapter, we combine these two stages for discussion, primarily because their core costs are concentrated on data annotation and requirement definition, a characteristic that results in many commonalities in data preparation and cost estimation.

3.1. Characteristics of SFT Datasets

SFT datasets consist of a series of text pairs, including “instruction input” and “answer output”. “Instruction input” represents requests made by humans to the model, covering various types such as classification, summarization, rewriting, etc. “Answer output” is the response generated by the model based on the instruction, meeting human expectations. There are four methods to construct instruction fine-tuning datasets: manual creation; model generation, such as using the Self-Instruct method; collecting and improving existing open-source datasets; and combining the above three methods.

Different ways to build SFT datasets

Different ways to build SFT datasets

There are generally two approaches to constructing artificially generated datasets. The first approach involves directly creating instruction text sets according to given requirements and rules by company employees, volunteers, annotation platform staff, and others. Whether designing instruction sets, writing annotation guidelines, or conducting actual data annotation and quality control, it requires a significant investment of human time and effort. For example, the creation of the Databricks-dolly-15k dataset involved thousands of Databricks employees who generated over 15,000 records across multiple instruction categories. The second approach involves scraping human-generated real question-and-answer data from web pages and standardizing it into an instruction format. Examples include datasets like InstructionWild, v2LCCC, and Zhihu-KOL, which construct datasets by aggregating and organizing content from social chats, code-related Q&As, and other sources.

In ABAKA AI’s past practices, the first approach has been more commonly used to construct datasets. Meanwhile, Liu, Yang, et al. [2] believe that datasets constructed in this manner are of higher quality and cleaner due to processing and review by professional annotators. After human processing, these datasets become more interpretable and more consistent with human understanding, thus increasing their explainability. Researchers have flexible control over the training samples and can adjust them according to different tasks, making them more versatile.

ABAKA AI possesses high-quality finished datasets across multiple domains

ABAKA AI possesses high-quality finished datasets across multiple domains

3.2. Characteristics of RLHF Datasets

RLHF datasets are collections of instructions that provide preference evaluations for multiple responses to the same input prompt. Typically, they consist of instruction pairs with different responses, including feedback from humans or other models. This setup reflects the relative preferences of humans or models for different responses in a given task or context. The feedback information in preference datasets is usually expressed through voting, ranking, scoring, or other forms of comparison.

Preference datasets are primarily used in the alignment phase of large models, aiming to help align model outputs more closely with human preferences and expectations. Alignment with human preferences is mainly reflected in three aspects: practicality (the ability to follow instructions), honesty (avoiding fabrication of information), and safety (avoiding the generation of illegal or harmful information).

Different ways to build the RLHF dataset

Different ways to build the RLHF dataset

RLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback) both utilize reinforcement learning methods to optimize models using feedback signals. In addition to fine-tuning with instruction datasets, preference datasets can be used to train reward models. Subsequently, the Proximal Policy Optimization (PPO) algorithm can be applied for further fine-tuning based on feedback from the reward model.

3.3. Data Cost Calculation

In the SFT and RLHF stages, data costs primarily come from the following aspects:

  1. Rule Design Cost

$Budget_{analysis} = T_{total} \times (R_{expert} \times S_{expert} + R_{engineer} \times S_{engineer} + R_{user} \times S_{user})$

In this, $R_{x} (0 < R_{x} ≤ 1)$ represents the participation ratio. Algorithm engineers $S_{expert}$ understand the model’s capability boundaries, domain experts $S_{expert}$ provide professional knowledge and insights, and $S_{user}$ offers frontline usage scenarios and requirement feedback. This step is both necessary and important. Carefully designed rules can significantly improve data quality, directly affecting model performance, and good rule design can increase annotation efficiency and reduce rework rates. Although the detailed rule design process may increase initial costs, its value far exceeds these expenses. It not only improves data and model quality but also brings long-term benefits to the entire project and organization.

  1. Instruction Dataset Construction Cost

$Budget_{instruction} = \frac{N_{instructions}}{R_{creation_speed}} \times S_{annotator} + \frac{N_{instructions} \times R_{review}}{R_{review_speed}} \times S_{reviewer}$

Where: $N_{instructions}$ is the total number of instructions, $R_{creation_speed}$ is the number of instructions an annotator can produce per hour, $S_{annotator}$ is the average hourly wage of annotators, $R_{review}$ is the review sampling rate, $S_{reviewer}$ and $R_{review_speed}$ are the average hourly wage of reviewers and the number of instructions a reviewer can review per hour, respectively.

  1. Cost of building RLHF dataset

$Budget_{RLHF} = T_{generation} \times C_{GPU_cluster} + \frac{N_{instructions} \times \alpha}{R_{ranking_speed}} \times S_{annotator} + Budget_{review}$

The first part is the inference cost for generating responses, and the second part is the cost of manual annotation. The choice of annotation method and strategy greatly affects $\alpha$. For example, if there are $N$ responses that need to be compared pairwise, then the cost of manual annotation:

$\frac{N_{instructions} \times C(N_{responses}, 2)}{R_{rank_speed}} \times S_{annotator}$

If it’s a rating system, then $R_{rank_speed}$ would significantly increase, so choosing an appropriate evaluation method is a key factor in constructing RLHF datasets. It not only affects data quality but also directly determines the cost structure. The choice and orientation of review strategies will likewise significantly impact costs. Considering the complexity of these factors and their interactions, it is indeed challenging to provide a universal cost formula, which is why we have not presented a specific formula.

In practice, it’s often necessary to validate and optimize evaluation and audit strategies through small-scale pilot tests before expanding to the full dataset. This iterative approach not only helps optimize costs but also continuously improves data quality and annotation guidelines throughout the process.

Based on ABAKA AI’s past experience, assuming we collect 1,000 IMO-level math problems, considering the demand is already well-established, the main costs will be concentrated on annotation and auditing. The cost for annotators is 20 USD per hour, with an estimated rate of one problem per hour. Including other expenses, the budget is estimated at 28,000 USD. However, if we adopt ABAKA AI’s RLHF data construction method, utilizing modern proof tools like LEAN, the processing efficiency would be much higher than response construction, handling approximately 4-6 pairs per hour.


At this point, we have established a comprehensive evaluation system that allows us to assess data prices according to requirements.

For example, if the boss wants the model to possess knowledge in a specific domain, or even become state-of-the-art in that field, we can choose CPT to add knowledge. Based on D-CPT Law[4] and REGMIX[5], we can calculate that approximately 100B of domain data might be needed. We can crawl 12 target websites to cover 70B of data, and the remaining 30B can be filtered from public datasets. After CPT, we can add a few thousand SFT data points. The data portion might cost around 42,000 USD, including: approximately 28,000 USD for data crawling from 12 websites + about 2,800 USD for downloading and filtering several dozen TB of data using the deepseek math method + constructing SFT data at about 6 USD per entry, totaling 12,000 USD for 2,000 entries.

Cost Item Cost / USD
Pre-training from scratch 140,000 - 7 million
CPT 70,000 - 1 million
SFT 5,600 - 140,000 per domain
RLHF data 1,400 - 56,000 per domain

The above estimates are based on current market data and ABAKA’s years of industry experience, providing the most common budget range framework to help you more intuitively estimate overall data cost expenses.

ABAKA AI can reduce costs by 40%-60% at various stages based on the above framework. In the process of building high-quality training datasets, ABAKA provides professional solutions based on rich data processing experience. The intelligent data engineering platform MooreData Platform and highly specialized, standardized data processing services offered by ABAKA AI empower the construction of training data, helping you train LLMs using high-quality datasets and enabling you to better understand the resources and investment required for your project.

4. Reference

  1. Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. “Training Compute-Optimal Large Language Models.” arXiv, March 29, 2022. http://arxiv.org/abs/2203.15556 .

  2. Liu, Yang, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. “Datasets for Large Language Models: A Comprehensive Survey.” arXiv, February 27, 2024. http://arxiv.org/abs/2402.18041 .

  3. Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, et al. “MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.” arXiv, June 2, 2024. http://arxiv.org/abs/2405.19327 .

  4. Que, Haoran, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, et al. “D-CPT Law: Domain-Specific Continual Pre-Training Scaling Law for Large Language Models.” arXiv, June 3, 2024. http://arxiv.org/abs/2406.01375 .

  5. Liu, Qian, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. “RegMix: Data Mixture as Regression for Language Model Pre-Training.” arXiv, July 1, 2024. http://arxiv.org/abs/2407.01492 .

Your Data Partner In The AI Industry