Winogrande
Winogrande is a dataset consisting of 44K binary-choice problems, inspired by the original WinoGrad Schema Challenge (WSC) benchmark for commonsense reasoning. It has been adjusted to enhance both scale and difficulty.
Learn more about the construction of WinoGrande here.
Arguments
There are two optional arguments when using the Winogrande
benchmark:
- [Optional]
n_problems
: the number of problems for model evaluation. By default, this is set to 1267 (all problems). - [Optional]
n_shots
: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
Example
The code below assesses a custom mistral_7b
model (click here to learn how to use ANY custom LLM) on 10 problems in Winogrande
using 3-shot CoT prompting.
from deepeval.benchmarks import Winogrande
# Define benchmark with n_problems and shots
benchmark = Winogrande(
n_problems=10,
n_shots=3,
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score
for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct answer (i.e. 'A' or 'B') in relation to the total number of questions.
As a result, utilizing more few-shot prompts (n_shots
) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.