#P14793. [NERC 2025] LLM Training
[NERC 2025] LLM Training
题目描述
You are given a text dataset. Your task is to train LLM (Large Language Model) and find the minimal possible loss. No kidding.
A text dataset is an array of texts . Each text is a sequence of tokens. We define the set of tokens as the set of all tokens that appear in at least one text . Additionally, for each text , there is a set of positions . The token is generated by LLM if and is written by the user if .
Let us define LLM with context size as a probabilistic model , such that it defines the probability distribution of the next token of the sequence, depending on a context --- a sequence of length between and (inclusive) whose elements are from . Thus the probabilistic model is a large table of probabilities , defined for any context , and any token . Conditions and $\sum\limits_{\text{next} \in T} P_k(\text{next} | w) = 1$ should be satisfied.
The loss function of LLM with the context size is the following function defined for :
$$\mathcal{L}_k(P_k) \,\, = \,\, \sum_{i=1}^{n} \,\, \sum_{j\in L_i} \, -\log_2 P_k\!\left( \underbrace{t_i[j]}_{\text{next token}} \ \middle|\ \underbrace{t_i[\max(1,j-k)\,..\,j-1]}_{\text{context}} \right)$$Here is the substring from -th to -th token, is an empty string. So, for each text and for each token that is generated by LLM, we add to the loss the negative logarithm (base 2) of the probability that this token will be generated, depending on the substring of previous tokens (or the whole prefix, if it has length less than ). If the probability is zero, we assume that the negative logarithm is . This loss function is known as the (base 2) Cross Entropy Loss over the LLM-generated positions. The smaller the loss function value , the better LLM is.
For each , calculate the minimum possible loss that could be obtained for some --- LLM with context size . It can be proved that this minimum is reachable and is not infinite.
输入格式
The first line contains a single integer () --- the number of texts in the dataset. Text descriptions follow.
The first line of the -th text description contains a single integer () --- the length of ().
The next line contains strings , , , () --- tokens of the text . Each token consists of symbols with ASCII codes from to (printable characters).
The next line contains a string of letters and , which encodes the set . All positions with the letter are generated by LLM, and all positions with the letter are written by the user. So . It is guaranteed that the last token is generated by LLM, so .
It is guaranteed that the sum of for all () does not exceed .
输出格式
Print real numbers: for each print the minimum possible loss for all possible --- LLM with context size .
Your answers will be accepted if their absolute or relative errors do not exceed ; formally, if is your answer, and is the jury's answer, this should hold: .
4
5
1 + 1 = 2
UUUUL
5
1 + 2 = 3
UUUUL
5
2 + 1 = 3
UUUUL
5
2 + 2 = 4
UUUUL
6.000000000000
6.000000000000
4.000000000000
4.000000000000
0.000000000000
4
4
N E F <EOS>
LLLL
5
N E R C <EOS>
LLLLL
6
N E E R C <EOS>
LLLLLL
5
I C P C <EOS>
LLLLL
55.683674395584
12.490224995673
8.000000000000
8.000000000000
8.000000000000
8.000000000000
1
16
a b a c a b a d b a b d a b a c
ULLULLLLLLULLLLL
22.595941331507
12.464393446710
5.245112497837
2.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
0.000000000000
2
4
WA WA WA AC
LULL
4
AC AC WA AC
LLUL
5.509775004327
4.754887502163
4.000000000000
2.000000000000