ARC
ARC or AI2 Reasoning Challenge is a dataset used to benchmark language models' reasoning abilities. The benchmark consists of 8,000 multiple-choice questions from science exams for grades 3 to 9. The dataset includes two modes: easy and challenge, with the latter featuring more difficult questions that require advanced reasoning.
To learn more about the dataset and its construction, you can read the original paper here.
Arguments
There are three optional arguments when using the ARC benchmark:
- [Optional] n_problems: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode.
- [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
- [Optional] mode: a ARCModeenum that selects the evaluation mode. This is set toARCMode.EASYby default.deepevalcurrently supports 2 modes: EASY and CHALLENGE.
Both EASY and CHALLENGE modes consist of multiple-choice questions. However, CHALLENGE questions are more difficult and require more advanced reasoning.
Example
The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 100 problems in ARC in EASY mode.
from deepeval.benchmarks import ARC
from deepeval.benchmarks.modes import ARCMode
# Define benchmark with specific n_problems and n_shots in easy mode
benchmark = ARC(
    n_problems=100,
    n_shots=3,
    mode=ARCMode.EASY
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. Both modes' performances are measured using an exact match scorer, focusing on the quantity of correct answers.