Grading system requirements

From IOI Wiki
Jump to: navigation, search

Intro

This document’s main goal is to outline the necessary and preferred features that a grading system for the IOI should support. At the same time, it also documents the current state of many details of the evaluation process that are often left unspecified (both to contestant and to hosts).

Hosts need to make sure that the grading system they will use support all the necessary features and as many preferred features as possible; moreover, if the grading system differs from the current state in some of the unspecified details, they should notify the ITC and/or the ISC.

At the pre-IOI meeting, the ISC might ask for additional features, mostly about support for non-standard task types.


Summary

For details see the following sections.

Required features

  • Public announcements in English.
  • English interface.
  • Serve all translated task statements.
  • Serve additional files for each task.
  • Output only, batch and interactive tasks.
  • Subtask scoring (pass/no-pass and partial scoring).
  • Fractional point scoring.
  • Configurability of whether subtask can be solved independently in different submissions.
  • Live evaluation with latency in the small number of minutes.
  • Full feedback for contestants.
  • Details for each subtask, in particular on the index of the first failed testcases.
  • Ability to remove feedback during the contest for a specific task.
  • Live external scoreboard supporting moderate traffic.

Preferable features

  • Private communications with contestants.
  • Announcement notifications.
  • IP autologin.
  • Two steps task type support.
  • Parallel evaluation of testcases.
  • Deprioritization of evaluations not shown to the contestants.
  • Statistics and pretty graphs.


Communication

Announcements

The grading system must have the ability to display announcements on-screen in English to all contestants. Announcements should be easily visible (preferably with notifications) and sent concurrently with every audio announcements. As not all contestants will understand English, such announcements should be kept to a minimum, and take a standard form such as “Task X has been rejudged”.

Private communications

If the host desires to use the grading system to handle clarification requests, the system must be able to receive questions from contestants and send answers back. All communication should use a Unicode encoding. Paper must always be offered as a clarification channel, as it is very difficult to set up the contestants machines to allow contestants to write in their own alphabet.


Interface

Logging in

It is preferable to have the contestants already logged in the grading system interface based on their IP address. The contestant name, and optionally their photograph, should be shown in the interface to correct mistakes in the seating plan.

Language

The interface should be at least in English. No other localizations are required. If other localizations are provided, national teams must be able to contribute localizations before the IOI to avoid giving unfair advantage to contestants working in their own language (as the official IOI language, English is not considered in this).

The interface should be publicly available (e.g., in a practice mode) months before the IOI, to allow contestants to get accustomed.

Task statement

The grading system must serve the task statements in all the languages they were translated to. It is preferable to have English and the contestant’s main language highlighted (as nominated by their delegation leader).

Task attachments

The grading system must be able to serve additional files for each task; these often include sample data, sample libraries and graders (i.e., source files). In most past IOIs, task attachments have also been provided pre-installed on the contestant’s machines, but they should be always available on the web interface in case the contestant modifies them and wants to redownload them.


Tasks

Task types

It is not easy to define or name task types, as two people will have three ideas on how to name them, but here’s a possible list (drawings are stolen from “CMS configuration for tasks”, which provides examples of rare tasks like Odometer and Supper, IOI 2012):

  • Output only (frequently present): the contestants are provided with the input file of each testcase, and they must provide the output file for each. Scoring is often trivialized (one testcase per subtask, see scoring section).
  • Batch (always present): the submission (a function implementation) is linked with a host-provided grader that takes care of I/O. The output is evaluated diffing it with a correct solution, or with a checker (a program that verifies if the proposed solution is correct). Note that in some occasion the submission is also able to interact with the grader calling a specified function.

Grading system task batch.png

  • Interactive (frequently present): similar to Batch, but in this case the grader holds a secret and so for security reasons the I/O is managed by a separate process entirely under the control of the host. In some occasions, the secret can be protected keeping the Batch structure by “encrypting” it in memory, but this might lower considerably the security; such a decision should be taken in agreement with the ISC. Note that often times the submission is able to interact with the manager calling some function. Also, a substantial number of messages sent across the FIFO might take a considerable amount of time, which might hide the complexity of the contestant solution or in some cases even make the FIFO approach completely unfeasible.

Grading system task interactive.png

  • Two steps (occasionally present, in some form): the contestants must submit two functions (either in two files or in a single files), one nicknamed the “encoder” and the other the “decoder” (not that they necessarily need to have these roles). The input is preprocessed by the encoder, and then passed to the decoder through fifo, or simple file passing. There might be several complications: multiple back and forth between encoder and decoder, secrets in the graders (that should prompt the implementers to split the encoder and/or the decoder in further processes), or validation of enc(x) requirements with a process running over it between the two steps.

Grading system task twosteps.png

Task types requirements

The grading system must support batch, output only and interactive tasks. If it doesn’t support two steps tasks, or cannot be adapted to support it on the ISC request, this should be communicated to the ISC before the pre-IOI meeting.


Scoring

Score of a task

The score of a contestant C at the IOI is S(C) = \sum_i S(C, T_i), where T_i are all the tasks in the contest. The score S(C, T) of a contestant C in a task T goes from 0 to 100. It is computed as \max S_i(C, T), where S_i := S_i(C, T) is the score of the i-th submission of C on the task T. S_i is usually an integer, but occasionally it might be a floating point number; the task statement should specify how to round it.

Subtasks

Each task has several subtasks; the score of a submission is computed as the sum over all subtasks: S_i = \sum_{subtask j} ST_{i, j}; in particular, if T has 2 subtasks, each worth 50 points, and C submits two submissions, S_0 that solves the first but not the first, and S_1 that solves the second but not the first, then S(C, T) = 50, not 100. Note that it might happen that a subtask with score zero precedes a subtask with non-zero score.

As before, ST_{i, j} might be a fractional number and the task statement should specify how to round it (special attention must be paid for representability problems if the score is stored as a floating point number); it might also happens that the scores of different subtasks have different formats (see Parrot, IOI 2011). The score on a subtask, ST_{i, j}, is between 0 and a maximum score mST_j set in the task statement.

It is being trialled a change to the computation of S(C, T) from all the ST_{i, j}. With this change, S(C, T) = \sum_{subtask j} \max_{submission i} ST_{i, j}, instead of the previous S(C, T) = \max_{submission i} \sum_{subtask j} ST_{i, j}. The effect on the previous example is that that contestant would obtain 100 points. It is required that the grading system supports a configuration to choose which method to use.

A subtask is comprised of one or more testcases; a testcase is a single run of the contestant submission against a single (set of) input data. In most cases, each testcase returns a pass/no-pass outcome, and ST_{i, j} = mST_j if all testcases are pass, zero otherwise. Some subtasks gives partial results (to ensure we can discriminate contestants); there are many ways in which the score can be computed; preferably, the grading system should be easily adaptable to the following and other modalities:

  • each testcase gives a contribution to the score: ST_{i, j} = \sum_{testcase t} score(t);
  • each testcase gives a score, and the subtask score is the minimum of these: ST_{i, j} = \min_{testcases t} score(t).

Feedback

Submissions should be evaluated within a small number of minutes, possibly sharding testcases to different workers to speed up the evaluation in case of light load.

Contestants are able to see the total score and the score for each subtask. For each subtask, as a general rule only a limited number of bits of information must be shown to the contestants. For pass/no-pass subtasks, these include:

  • if it exists, the index within the subtask of the first testcases T that did not pass (but not the total number of testcases);
  • if it exists, the reason for T not to pass: time limit exceeded, wrong answer, execution error (but in this case not the specific signal), memory limit exceeded (if the system is able to recognize unambiguously it);
  • possibly, a very small hints on how close the timing was to the limit (as in “slightly above time limit”).

It is to be noted that the set of information disclosed was not prescribed in the past years, and could as well change in the next years or for specific tasks. For this reason, it is preferable that the grading system is easily modifiable to change the released information. For example, in the past years these different strategies have been used:

  1. for each testcase, provide details (pass/no-pass, reason for not pass among TLE, WA, Signal X, and memory and time usage);
  2. the same, but stopping after the first no-pass testcase (other testcases should be evaluated, possibly at lower priority, for troubleshooting reasons);
  3. provide statistics about the subtask: number of testcases by pass/no-pass with reason, optionally with maximum memory used, maximum time used.

Testcases that cannot contribute to the score (e.g., testcases of subtasks with already a no-pass result) should be evaluated nonetheless, even if not displayed to the contestants. If this is the case, it is preferable to evaluate them with lower priority.

If a reevaluation occurs, a communication should be sent to the contestant. Preferably, the grading system should be able to compute the differences in scoring after the reevaluation, in order to notify only the affected contestants if their number is small.

The grading system must be able to remove all the feedback during the contest (if the ISC decides that for some reason the feedback is misleading).

External scoreboard

A live scoreboard should be publicly accessible on a URL published on the IOIn website; it should stay live from the beginning of day 1 until the official scoreboard is published statically on the website.

The scoreboard should not indicate explicitly medals; nonetheless, only official contestants entering the medal computation should be included, and those that did not turn out should be excluded as soon as possible (preferably before day 2).

The scoreboard should be able to support a moderately high traffic: in 2015 there had been peaks of 2.5k requests per hours, but if the scoreboard does not auto-update the number of requests would be much higher.


Statistics

It is preferable for the grading system to offer statistics and/or graphs on the contest (the more the better); in particular:

  • breakdown of submissions and contestants by programming language;
  • submission rate during the contest;
  • evaluation latency during the contest;
  • score distributions (global, and by task).

Appendix: tasks that stressed the grading system

  • Parrot, IOI 2011: a basic two steps task; it is safe for the encoder to scramble the result, as long as some “encryption” is added (e.g., by summing an arbitrary constant) so that the decoder can realize that the contestant tried to cheat if it finds unexpected numbers.
  • Supper, IOI 2012: another two steps; it was actually implemented not with “encryption” but with an actual validator that ran between encoder and decoder; moreover, the output of the decoder needs to be evaluated with a checker.
  • Odometer, IOI 2012: from the contestant point of view this was an output only tasks (each “output” that the contestant will provide is an odometer program), but then it was evaluated as if it was a batch task in the odometer language, with testcases and subtasks.
  • Scales, IOI 2015: a peculiar scoring function.