Traditional tools for information retrieval (IR) evaluation,
such as TREC's trec_eval,
have outdated command-line interfaces with many unused features,
or `switches', accumulated over the years. They are usually
seen as cumbersome applications by new IR researchers, steepening
the learning curve. We introduce a platform-independent application
for IR evaluation with a graphical easy-to-use interface: the
TREC_Files Evaluator. The application supports most of the standard
measures used for evaluation in TREC, CLEF, and elsewhere, such
as MAP, P10, P20, and bpref, as well as the Averaged Normalized
Modified Retrieval Rank (ANMRR) proposed by MPEG for image retrieval
evaluation. Additional features include a batch mode and statistical
significance testing of the results against a pre-selected baseline.
The program is developed in Silverlight technology which
is an application framework for writing rich Internet applications.
Its run-time environment is available for most browsers and
it is compatible with Microsoft Windows, Mac OS X, Linux, FreeBSD,
and other *nix open source platforms through the Moonlight plugin.
A user must first load the relevance assessments file (qrels),
which represents the ground truth of the experiment. Next, the
user loads the experimental results file. Both files must be
in the standard TREC format. Then, all the available standard
evaluation measures can be calculated with a press of button.
All calculations are taking place locally, i.e. at the client
side.
The batch mode allows a user to evaluate several
TREC files together and calculate all the measures without manual
intervention. At the same time, the system can check statistical
significance against a selected baseline run, using a bootstrap,
one-tailed test, at significance levels 0.05, 0.01, and 0.001.
The results are presented in an interactive table and can be
sorted on any measure. Additionally, the TREC_Files Evaluator
can export the results in a LaTeX friendly format.
Given that MAP, P10, P20 and bpref constitute main evaluation
criteria for the retrieval results in TREC, CLEF, and elsewhere,
the TREC_Files Evaluator can be used in order to help either
new researchers or organizers of multi-group controlled experiments
to evaluate the submitted results more easily. Moreover, the
inclusion of ANMRR evaluation metric \cite{b2} will encourage
researchers from the image retrieval field to employ the standard
TREC format for evaluating their experiments.
Future plans include the implementation of more evaluation
metrics, such as R-prec and nDCG, as well as enabling users
to plug in their custom implementation of metrics.