BaseBenchDataloader
The BaseBenchDataloader
class serves as an abstract base class for implementing benchmarking dataloaders used in evaluating the reasoning capabilities of models. This documentation explains the class structure, its methods, and their usage.
Class Overview
Attributes
dataset
: ADataset
orDatasetDict
object containing the questions and answers used for evaluation.progress_bar
: Arich.progress.Progress
instance used for tracking evaluation progress.question_key
: The key in the dataset used for accessing questions.answer_key
: The key in the dataset used for accessing answers.
Constructor
BaseBenchDataloader()
Initializes a new instance of the BaseBenchDataloader
class. This is an abstract class and cannot be instantiated directly.
Methods
Static Methods
initial_default_prompt()
Returns the default system prompt for the evaluation.
@staticmethod
def initial_default_prompt() -> str:
return ""
eval_single_question(predicted_answer: str | int, answer: str | int) -> bool
Evaluates the correctness of a single predicted answer.
- Parameters:
predicted_answer
: The answer predicted by the model.answer
: The correct answer from the dataset.
- Returns:
bool
: Whether the prediction is correct.
vote_majority(output: List[str | int], answer: str | int) -> bool
Implements majority voting for multiple outputs from the model.
- Parameters:
output
: List of model outputs.answer
: The correct answer.
- Returns:
bool
: Whether the majority voted correctly.
Abstract Methods
process_question(example: dict) -> dict
Processes a question from the dataset. This method must be implemented in derived classes.
extract_answer(output: str) -> str | int
Extracts the answer from the model’s output. This method must be implemented in derived classes.
Asynchronous Methods
evaluate_model(model: LiteLLM_Model | MajorityVoteModel, system_prompt: str | None = None) -> tuple[List[dict], float]
Evaluates the performance of a model on the dataset.
- Parameters:
model
: The model instance to evaluate. It can be eitherLiteLLM_Model
orMajorityVoteModel
.system_prompt
: An optional system prompt string. If not provided, the default prompt is used.
- Returns:
tuple[List[dict], float]
: A tuple containing:results
: A list of dictionaries for each evaluation entry with keys for the question, response, correct answer, and correctness.accuracy
: The overall accuracy of the model.
Usage Example
Example Workflow
from my_project.bench_dataloader import BaseBenchDataloader
from llm_evaluation_in_reasoning.eval.model import LiteLLM_Model
class MyDataloader(BaseBenchDataloader):
def process_question(self, example):
# Custom processing logic
pass
def extract_answer(self, output):
# Custom answer extraction
pass
def eval_single_question(self, predicted_answer, answer):
return predicted_answer == answer
def vote_majority(self, output, answer):
return output.count(answer) > len(output) / 2
# Instantiate and evaluate
my_loader = MyDataloader()
model = LiteLLM_Model()
results, accuracy = await my_loader.evaluate_model(model)
print(f"Accuracy: {accuracy:.2%}")
Logging and Error Handling
Logging
This class uses the Python logging
module to log progress, debug information, and errors during evaluation.
Error Handling
Errors encountered during the evaluation process are logged and stored in the results
list with the associated question and error details.
Dependencies
logging
: Standard Python logging library.rich
: Library for rich text and progress bar visualization.datasets
: Library for loading and processing datasets.LiteLLM_Model
,MajorityVoteModel
: Model classes used for evaluation.
For more details, refer to the GitHub repository.