Inter-Rater Reliability refers to statistical measurements that determine how similar the data collected by different raters are. A rater is someone who is scoring or measuring a performance, behavior, or skill in a human or animal. Examples of raters would be a job interviewer, a psychologist measuring how many times a subject scratches their head in an experiment, and a scientist observing how many times an ape picks up a toy.
It is important for the raters to have as close to the same observations as possible - this ensures validity in the experiment. If the raters significantly differ in their observations then either measurements or methodology are not correct and need to be refined. In some cases the raters may have been trained in different ways and need to be retrained in how to count observations so they are all doing it the same.
There are a few statistical measurements that are used to test whether or not the difference between the raters is significant. An example using inter-rater reliability would be a job performance assessment by office managers. If the employee being rated received a score of 9 (a score of 10 being perfect) from three managers and a score of 2 from another manager then inter-rater reliability could be used to determine that something is wrong with the method of scoring. There could be many explanations for this lack of consensus (managers didn't understand how the scoring system worked and did it incorrectly, the low-score manager had a grudge against the employee, etc) and inter-rater reliability exposes these possible issues so they can be corrected.