TERPECA ranking algorithm detail

Rich Bragg and Dan Egnor, Feb 2025

This paper describes the way voter rankings are combined into a final ranked list. (Readers are assumed to be familiar with the TERPECA nomination and voting process.) These notes are current for the 2024 awards year.

Goals and background

There’s no perfect way to combine diverse preferences into a single ranking (ask any scholar of political science!). TERPECA’s algorithm is intended to do a reasonable job capturing the overall consensus, with a bias to lower ranking in case of uncertainty or disagreement. If most people rank X above Y, then X should generally come above Y in the overall ranking.

Counting systems like Borda count are popular, but simple counting is skewed by which comparisons are made (each item’s so-called “strength of schedule”). A middling competitor compared to many weak opponents will score more “wins” than a strong competitor compared only to a few other strong competitors.

To do better, various systems also take opponent strength into account. Notable examples include the Elo rating system used for chess, Instant Runoff Voting used in some elections, Microsoft TrueSkill for online game matchmaking, various sports rating systems, and more besides. TERPECA uses one of these…

“Keener’s method”: Eigenvector of the preference matrix

TERPECA uses a variant of “Keener’s method”, named after James Keener who wrote an influential paper on ranking techniques which used US college football as a motivating example. TERPECA’s algorithm is based on what Keener calls the “direct approach”.

Keener’s method uses a sparse preference matrix that captures known pairwise comparisons between items being compared. The matrix is square, with one row and one column for every item being compared. Cells contain betterness values, non-negative numbers that represent the confidence that the row item is “better than” the column item. For any given pair of items A and B,

if A is definitely better than B, matrix[A][B] (A’s row, B’s column) should be high, matrix[B][A] should be low
if A is definitely worse than B, matrix[A][B] should be low, matrix[B][A] should be high
if A is about the same as B, or little is known, matrix[A][B] and matrix[B][A] should both be low or zero
self-comparisons (cells on the diagonal) are zero by convention

(Note, TERPECA’s matrix uses a slightly different scale; more on that later.)

For example, this is a preference matrix for ice cream flavors:

Flavor	vs. 🍫	vs. 🍓	vs. 🍦	vs. 💩	vs. 🦄
🍫 Chocolate	-	2	2	4	0
🍓 Strawberry	1	-	0	5	0
🍦 Vanilla	3	0	-	5	0
💩 Poop	1	0	0	-	0
🦄 Unicorn Milk	5	5	5	5	-

(Poop is unpopular, everyone loves unicorn milk, and vanilla ~> chocolate ~> strawberry)

The actual betterness metric depends on the domain. Sports might use the fraction of games won against that opponent. Ice cream might use consumer surveys. For TERPECA, betterness is based on voter rankings, as discussed below. Betterness must be normalized by comparison count, so each row’s sum is the average comparative betterness for that row’s item – a value which is still dependent on strength of schedule, but not on raw comparison count.

Then, the eigenvector of the preference matrix with the largest eigenvalue is found, which is a schedule-independent “strength score” for every item, usable for ranking.

For example, these are strength scores for the ice cream matrix above (via this online eigenvector calculator):

Flavor	Strength	Remultiplication cross-check
🍫 Chocolate	0.292	20.150 + 20.300 + 40.071 = 1.184 ~= 0.292 4.1
🍓 Strawberry	0.158	10.292 + 50.071 = 0.642 ~= 0.158 * 4.1
🍦 Vanilla	0.300	30.292 + 50.071 = 1.231 ~= 0.292 * 4.1
💩 Poop	0.071	10.292 = 0.292 ~= 0.071 4.1
🦄 Unicorn Milk	1.000	50.292 + 50.150 + 50.300 + 50.071 = 4.065 ~= 1.000 * 4.1

And these are ice cream flavors ranked by strength score:

Rank	Flavor	Strength
#1 🏆	🦄 Unicorn Milk	1.000
#2	🍦 Vanilla	0.300
#3	🍫 Chocolate	0.292
#4	🍓 Strawberry	0.158
#5	💩 Poop	0.071

It may seem like quite a leap of logic to suddenly apply an abstract concept from linear algebra! In reality the matrix has been crafted for this to make sense. Intuitively, each item’s strength is the total of each comparison score times the strength of that opponent.

By the Perron-Frobenius theorem, there is one unique eigenvector with the largest eigenvalue, if every betterness value is nonnegative and the overall matrix is “irreducible”, meaning there is a path of nonzero comparison from every item to every other item. (Otherwise there would be multiple “islands” of items with no comparison between the islands.)

This technique is not magically perfect, but it does have some useful properties:

it is largely independent of comparison count and strength of schedule
it captures transitivity (if X > Y and Y > Z, then X > Z)
it’s “smooth”; small winningness changes typically yield small strength changes
it can be related quantitatively to probability, if the subject can be modeled that way
it can be readily computed with an iterative approach

For subjective ratings like TERPECA, there is no single correct way to compute winningness; the formula has to be empirically tuned for sensible outcomes. In fact, most of the complexity of TERPECA ranking lies in the computing pairwise winningness from voter rankings…

Pairwise winningness in TERPECA

TERPECA uses a multi-factor strategy to calculate the winningness of each X vs. Y comparison:

tally direct “wins” and “losses” when X appears above or below Y in voter rankings
tally indirect (“secondary”) wins and losses when X beats P and P beats Y directly
using the Wilson confidence interval midpoint for combined wins and losses

These steps and their wrinkles are described in more detail below.

Tallies of direct wins and losses between rooms

For every X vs. Y comparison, “wins” (when X is above Y in a voter’s ranking) and “losses” (when X is below Y) are counted in both weighted and unweighted totals across all voter rankings:

flowchart LR

subgraph For every voter ranking list
  subgraph For each room X on the list
    portion("p = 1 / versions ranked") --o roomWinUnw & roomLossUnw
    portion --> weight("weight = p / sqrt(list size)") --o roomWin & roomLoss
    subgraph For each room Y below X
      roomWinUnw("unweightedWins[X vs. Y] += p")
      roomWin("weightedWins[X vs. Y] += weight")
    end
    subgraph For each room Y above X
      roomLossUnw("unweightedLosses[X vs. Y] += p")
      roomLoss("weightedLosses[X vs. Y] += weight")
    end
  end
end

In the special case where variants of a room were ranked separately but ultimately scored as one room, and the same voter ranked multiple versions in the same list, wins and losses are apportioned fractionally (p in the diagram above) for weighted and unweighted tallies. This is rare.

Inputs to weighted tallies are further divided of the square root of the voter’s list size, to soften the effect of voters with long lists.

For example, if a voter submitted this ranking:

Rank	Room
1	😣 Pain Room (hard mode)
2	🦄 Fluffy Unicorns
3	😫 Pain Room (hard with pain)
4	🔢 Escape The Eigenvector
5	😔 Pain Room (easy mode)
6	💩 Poop Ice Cream: The Escape Room Experience

This voter would have a weight of 1/sqrt(6) ~= 0.408, and if Pain Room variants are merged for scoring, variant-specific wins and losses would be divided by 3, so the following wins and losses would be tallied:

Comparison	Wins	Weighted wins	Losses	Weighted losses	notes
😔😣😫 vs. 🦄	0.333	0.136	0.667	0.272	1 variant wins, 2 lose
😔😣😫 vs. 🔢	0.667	0.272	0.333	0.136	2 variants win, 1 loses
😔😣😫 vs. 💩	1.0	0.408	0	0	all variants win
🦄 vs. 😔😣😫	0.667	0.272	0.333	0.136	wins against 2 variants, loses against 1
🦄 vs. 🔢	1.0	0.408	0	0	wins
🦄 vs. 💩	1.0	0.408	0	0	wins
🔢 vs. 🦄	0	0	1.0	0.408	loses
🔢 vs. 😔😣😫	0.333	0.136	0.667	0.272	wins against 1 variant, loses against 2
🔢 vs. 💩	1.0	0.408	0	0	wins
💩 vs. 😔😣😫	0	0	1.0	0.408	loses
💩 vs. 🦄	0	0	1.0	0.408	loses
💩 vs. 🔢	0	0	1.0	0.408	loses

(Note the symmetry that the scores for X vs. Y are the win<>loss mirror of Y vs. X.)

Weighted and unweighted wins and losses are tallied across all valid rankings for every pair of rooms, and these values are used in subsequent steps.

Wilson confidence scores for direct wins and losses

For the eigenvector algorithm to work well, preference matrix betterness scores should represent the confidence that one room should be ranked higher than another. For example, 70/100 voters ranking X above Y gives more confidence than 4/5 voters doing so.

For every X vs. Y comparison, TERPECA uses the Wilson confidence interval midpoint with a p-value of 0.05 to convert modified win/loss counts to confidence values:

flowchart LR

subgraph for every room X
  subgraph "for every room Y (not X)"
    reinflation("reinflation for X vs. Y = (unweighted wins + losses) / (weighted wins + losses)")
    reinflation --> reinflatedWins & reinflatedLosses
    weightedWins("weightedWins") --> reinflatedWins
    weightedLosses("weightedLosses") --> reinflatedLosses
    reinflatedWins((×)) --> directWilson
    reinflatedLosses((×)) --> directWilson
    directWilson((Wilson conf.)) --> directConfidence("directConfidence [X vs. Y]")
  end
end

The weighted wins and losses from earlier are “reinflated” by the average weighting before the Wilson confidence midpoint is computed. The weighting and “reinflation” means that while the influence of “supervoters” on the ranking of a room is reduced, the actual number of comparisons made is still taken into account for determining confidence.

For example, if rooms had these wins and losses (symmetric cases elided):

Comparison	Wins	Weighted wins	Losses	Weighted losses
🍎 vs. 🍌	70	30	30	20
🍎 vs. 🍒	4	3	8	6
🍌 vs. 🍒	1	1	0	0

The computed values would be as follows:

Comparison	Reinflation factor	Reinflated wins	Reinflated losses	Wilson score
🍎 vs. 🍌	(70+30)/(30+20)=2	60	40	0.596
🍎 vs. 🍒	(4+8)/(3+6)=1.5	4.5	9	0.370
🍌 vs. 🍒	(1+0)/(1+0)=1	1	0	0.211

Note again that TERPECA uses the Wilson confidence interval midpoint, not the more commonly used lower bound! With p=0.05, the midpoint is roughly equivalent to always adding two wins and two losses, then taking the simple win fraction.

This “directConfidence” value would be suitable for a preference matrix, but TERPECA has more factors to consider…

Tallying indirect “secondary” wins and losses

The eigenvalue method does infer transitive relations (if A beats B, and B beats C, then A beats C), but TERPECA adds more explicit weighting for transitive relations, which empirically improves ranking quality. This may be the most subtle aspect of the ranking algorithm.

flowchart LR

subgraph For every room X
  subgraph "For every room Y (not X)"
    subgraph "For every room P (not X, Y)"
      pivotFilter("valid pivot?") --> pivotWinConfidence & pivotLossConfidence
      pivotWinConfidence("confidence for X > P > Y")
      pivotLossConfidence("confidence for X < P < Y")
    end

    pivotWinConfidence --o indirectWinConfidence("indirectWinConfidence for X > (all P) > Y")
    pivotLossConfidence --o indirectLossConfidence("indirectLossConfidence for X < (all P) < Y")
    indirectWinConfidence & indirectLossConfidence --> reverseConfidence
    reverseConfidence((inv. Wilson)) --> indirectWins & indirectLosses
    indirectWins("indirectWins[X vs. Y]")
    indirectLosses("indirectLosses[X vs. Y]")
  end
end

For every possible X vs. Y comparison (even with no direct comparison), other rooms are examined for suitability as tertiary “pivots”.

A room P is a valid pivot for X and Y if:

P is neither X nor Y
direct comparisons are available between X and P and also between Y and P
X beats P (directConfidence[X vs. P] > 0.5) and P doesn’t lose to Y (directConfidence[Y vs. P] < 0.5)
or the other way around (Y beats P and P doesn’t lose to X)

For valid pivots, a “win confidence” (confidence in X > P > Y) and a “loss confidence” (confidence in X < P < Y) are computed as if by this pseudocode:

pivotStrength =
    min(weightedWins[X vs. P], weightedLosses[Y vs. P]) + 
    min(weightedLosses[X vs. P], weightedWins[Y vs. P]))

scale =
    min(pivotStrength, sqrt(pivotStrength)) /  // to reduce the impact of any one pivot
    (directConfidence[X vs. P] + directConfidence[Y vs. P])

indirectWinConfidence[X vs. Y] += scale * directConfidence[X vs. P]
indirectLossConfidence[X vs. Y] += scale * directConfidence[Y vs. P]
totalScale[X vs. Y] += scale

The indirect confidence value totals are then scaled as needed so the sum is no greater than the number of pivots, to better reflect the actual number of independent data points. Finally, these confidence values (derived from directConfidence values) are put through an inverse of the Wilson confidence midpoint computation to produce synthetic “indirectWins[X vs. Y]” and “indirectLosses[X vs. Y]” values.

Final winningness values for the preference matrix

As a last step, for every X vs. Y comparison, wins and losses from direct comparisons (weighted) and indirect comparisons (synthetic, as above) are added into total win/loss scores for another Wilson confidence evaluation. (Note that this does not use directConfidence values, which only feed into indirect comparisons.)

flowchart LR
subgraph For every room X
  subgraph "For every room Y (not X)"
    directWins("weightedWins") --> allWinsAdd
    indirectWins("indirectWins") --> allWinsAdd
    allWinsAdd((#plus;)) --> finalWilson

    directLosses("weightedLosses") --> allLossesAdd
    indirectLosses("indirectLosses") --> allLossesAdd
    allLossesAdd((#plus;)) --> finalWilson

    finalWilson((Wilson conf.)) --> finalConfidence("matrix[X vs. Y]")
  end
end

Once the pairwise winningness values are computed as above, the power iteration method is used to find the eigenvector with the largest eigenvalue. As discussed above, this eigenvector is directly usable for sorting the final ranked list.

Discussion

The TERPECA winningness computation may seem arbitrary in several ways…

Why are voter inputs reduced by the square root of their list size? (Why not less, or more?)
Given that inputs are weighted, why is “reinflation” done for totals?
Keener’s algorithm already captures transitive betterness; why do we mix in our own indirect comparisons?
What’s up with the scaling, limiting, and inverse-Wilson transformation of indirect comparisons?

These wrinkles were all added to solve real observed problems, most of which were related to the interplay of confidence and evidence across the matrix.

If a very experienced player ranks a game particularly high or low, this can impact unweighted win/loss counts by hundreds of points across the grid. However, this is only one person’s opinion, and can lead to outlier results that don’t reflect consensus.

Conversely, if dozens of voters clearly prefer one room to another, that should weigh heavily into ranking, even if the difference in the Wilson confidence score is relatively small.

This all is to say, within the matrix, some data points are more “independent” than others, which is not taken into account in Keener’s algorithm or other competitive ranking algorithms, most of which assume each data point is independent (typically the outcome of separately played games). So, all of the hair above including pivot analysis and various reweighting factors have all been introduced and revised over time to tweak the algorithm into better reflecting “the will of the people” as best we can establish it.

References

Literature on ranking systems

Keener, J. The Perron-Frobenius Theorem and the Ranking of Football Teams
Newman, M.E.J. Efficient Computation of Rankings from Pairwise Computations
Langville, A. Who’s #1? The Science of Rating and Ranking
Miller, E. How Not To Sort By Average Rating
Vang, J. Data Science Topics, the “Rating and Ranking” section
Wikipedia, Binomial proportion confidence interval, the “Wilson confidence interval” section
Wikipedia, Bradley-Terry model
Wikipedia, Sports Rating System
Wikipedia, Pairwise Comparison (Psychology)

Code in voter-portal/…/analyze.component.ts (only visible to project members):