Rich Bragg and Dan Egnor, Feb 2025
This paper describes the way voter rankings are combined into a final ranked list. (Readers are assumed to be familiar with the TERPECA nomination and voting process.) These notes are current for the 2024 awards year.
There’s no perfect way to combine diverse preferences into a single ranking (ask any scholar of political science!). TERPECA’s algorithm is intended to do a reasonable job capturing the overall consensus, with a bias to lower ranking in case of uncertainty or disagreement. If most people rank X above Y, then X should generally come above Y in the overall ranking.
Counting systems like Borda count are popular, but simple counting is skewed by which comparisons are made (each item’s so-called “strength of schedule”). A middling competitor compared to many weak opponents will score more “wins” than a strong competitor compared only to a few other strong competitors.
To do better, various systems also take opponent strength into account. Notable examples include the Elo rating system used for chess, Instant Runoff Voting used in some elections, Microsoft TrueSkill for online game matchmaking, various sports rating systems, and more besides. TERPECA uses one of these…
TERPECA uses a variant of “Keener’s method”, named after James Keener who wrote an influential paper on ranking techniques which used US college football as a motivating example. TERPECA’s algorithm is based on what Keener calls the “direct approach”.
Keener’s method uses a sparse preference matrix that captures known pairwise comparisons between items being compared. The matrix is square, with one row and one column for every item being compared. Cells contain betterness values, non-negative numbers that represent the confidence that the row item is “better than” the column item. For any given pair of items A and B,
matrix[A][B]
(A’s row, B’s column) should be high, matrix[B][A]
should be lowmatrix[A][B]
should be low, matrix[B][A]
should be highmatrix[A][B]
and matrix[B][A]
should both be low or zero(Note, TERPECA’s matrix uses a slightly different scale; more on that later.)
For example, this is a preference matrix for ice cream flavors:
Flavor | vs. 🍫 | vs. 🍓 | vs. 🍦 | vs. 💩 | vs. 🦄 |
---|---|---|---|---|---|
🍫 Chocolate | - | 2 | 2 | 4 | 0 |
🍓 Strawberry | 1 | - | 0 | 5 | 0 |
🍦 Vanilla | 3 | 0 | - | 5 | 0 |
💩 Poop | 1 | 0 | 0 | - | 0 |
🦄 Unicorn Milk | 5 | 5 | 5 | 5 | - |
(Poop is unpopular, everyone loves unicorn milk, and vanilla ~> chocolate ~> strawberry)
The actual betterness metric depends on the domain. Sports might use the fraction of games won against that opponent. Ice cream might use consumer surveys. For TERPECA, betterness is based on voter rankings, as discussed below. Betterness must be normalized by comparison count, so each row’s sum is the average comparative betterness for that row’s item – a value which is still dependent on strength of schedule, but not on raw comparison count.
Then, the eigenvector of the preference matrix with the largest eigenvalue is found, which is a schedule-independent “strength score” for every item, usable for ranking.
For example, these are strength scores for the ice cream matrix above (via this online eigenvector calculator):
Flavor | Strength | Remultiplication cross-check |
---|---|---|
🍫 Chocolate | 0.292 | 2*0.150 + 2*0.300 + 4*0.071 = 1.184 ~= 0.292 * 4.1 |
🍓 Strawberry | 0.158 | 1*0.292 + 5*0.071 = 0.642 ~= 0.158 * 4.1 |
🍦 Vanilla | 0.300 | 3*0.292 + 5*0.071 = 1.231 ~= 0.292 * 4.1 |
💩 Poop | 0.071 | 1*0.292 = 0.292 ~= 0.071 * 4.1 |
🦄 Unicorn Milk | 1.000 | 5*0.292 + 5*0.150 + 5*0.300 + 5*0.071 = 4.065 ~= 1.000 * 4.1 |
And these are ice cream flavors ranked by strength score:
Rank | Flavor | Strength |
---|---|---|
#1 🏆 | 🦄 Unicorn Milk | 1.000 |
#2 | 🍦 Vanilla | 0.300 |
#3 | 🍫 Chocolate | 0.292 |
#4 | 🍓 Strawberry | 0.158 |
#5 | 💩 Poop | 0.071 |
It may seem like quite a leap of logic to suddenly apply an abstract concept from linear algebra! In reality the matrix has been crafted for this to make sense. Intuitively, each item’s strength is the total of each comparison score times the strength of that opponent.
By the Perron-Frobenius theorem, there is one unique eigenvector with the largest eigenvalue, if every betterness value is nonnegative and the overall matrix is “irreducible”, meaning there is a path of nonzero comparison from every item to every other item. (Otherwise there would be multiple “islands” of items with no comparison between the islands.)
This technique is not magically perfect, but it does have some useful properties:
For subjective ratings like TERPECA, there is no single correct way to compute winningness; the formula has to be empirically tuned for sensible outcomes. In fact, most of the complexity of TERPECA ranking lies in the computing pairwise winningness from voter rankings…
TERPECA uses a multi-factor strategy to calculate the winningness of each X vs. Y comparison:
These steps and their wrinkles are described in more detail below.
For every X vs. Y comparison, “wins” (when X is above Y in a voter’s ranking) and “losses” (when X is below Y) are counted in both weighted and unweighted totals across all voter rankings:
flowchart LR
subgraph For every voter ranking list
subgraph For each room X on the list
portion("p = 1 / versions ranked") --o roomWinUnw & roomLossUnw
portion --> weight("weight = p / sqrt(list size)") --o roomWin & roomLoss
subgraph For each room Y below X
roomWinUnw("unweightedWins[X vs. Y] += p")
roomWin("weightedWins[X vs. Y] += weight")
end
subgraph For each room Y above X
roomLossUnw("unweightedLosses[X vs. Y] += p")
roomLoss("weightedLosses[X vs. Y] += weight")
end
end
end
In the special case where variants of a room were ranked separately but ultimately scored as one room, and the same voter ranked multiple versions in the same list, wins and losses are apportioned fractionally (p in the diagram above) for weighted and unweighted tallies. This is rare.
Inputs to weighted tallies are further divided of the square root of the voter’s list size, to soften the effect of voters with long lists.
For example, if a voter submitted this ranking:
Rank | Room |
---|---|
1 | 😣 Pain Room (hard mode) |
2 | 🦄 Fluffy Unicorns |
3 | 😫 Pain Room (hard with pain) |
4 | 🔢 Escape The Eigenvector |
5 | 😔 Pain Room (easy mode) |
6 | 💩 Poop Ice Cream: The Escape Room Experience |
This voter would have a weight of 1/sqrt(6) ~= 0.408, and if Pain Room variants are merged for scoring, variant-specific wins and losses would be divided by 3, so the following wins and losses would be tallied:
Comparison | Wins | Weighted wins |
Losses | Weighted losses |
notes |
---|---|---|---|---|---|
😔😣😫 vs. 🦄 | 0.333 | 0.136 | 0.667 | 0.272 | 1 variant wins, 2 lose |
😔😣😫 vs. 🔢 | 0.667 | 0.272 | 0.333 | 0.136 | 2 variants win, 1 loses |
😔😣😫 vs. 💩 | 1.0 | 0.408 | 0 | 0 | all variants win |
🦄 vs. 😔😣😫 | 0.667 | 0.272 | 0.333 | 0.136 | wins against 2 variants, loses against 1 |
🦄 vs. 🔢 | 1.0 | 0.408 | 0 | 0 | wins |
🦄 vs. 💩 | 1.0 | 0.408 | 0 | 0 | wins |
🔢 vs. 🦄 | 0 | 0 | 1.0 | 0.408 | loses |
🔢 vs. 😔😣😫 | 0.333 | 0.136 | 0.667 | 0.272 | wins against 1 variant, loses against 2 |
🔢 vs. 💩 | 1.0 | 0.408 | 0 | 0 | wins |
💩 vs. 😔😣😫 | 0 | 0 | 1.0 | 0.408 | loses |
💩 vs. 🦄 | 0 | 0 | 1.0 | 0.408 | loses |
💩 vs. 🔢 | 0 | 0 | 1.0 | 0.408 | loses |
(Note the symmetry that the scores for X vs. Y are the win<>loss mirror of Y vs. X.)
Weighted and unweighted wins and losses are tallied across all valid rankings for every pair of rooms, and these values are used in subsequent steps.
For the eigenvector algorithm to work well, preference matrix betterness scores should represent the confidence that one room should be ranked higher than another. For example, 70/100 voters ranking X above Y gives more confidence than 4/5 voters doing so.
For every X vs. Y comparison, TERPECA uses the Wilson confidence interval midpoint with a p-value of 0.05 to convert modified win/loss counts to confidence values:
flowchart LR
subgraph for every room X
subgraph "for every room Y (not X)"
reinflation("reinflation for X vs. Y = (unweighted wins + losses) / (weighted wins + losses)")
reinflation --> reinflatedWins & reinflatedLosses
weightedWins("weightedWins") --> reinflatedWins
weightedLosses("weightedLosses") --> reinflatedLosses
reinflatedWins((×)) --> directWilson
reinflatedLosses((×)) --> directWilson
directWilson((Wilson conf.)) --> directConfidence("directConfidence [X vs. Y]")
end
end
The weighted wins and losses from earlier are “reinflated” by the average weighting before the Wilson confidence midpoint is computed. The weighting and “reinflation” means that while the influence of “supervoters” on the ranking of a room is reduced, the actual number of comparisons made is still taken into account for determining confidence.
For example, if rooms had these wins and losses (symmetric cases elided):
Comparison | Wins | Weighted wins |
Losses | Weighted losses |
---|---|---|---|---|
🍎 vs. 🍌 | 70 | 30 | 30 | 20 |
🍎 vs. 🍒 | 4 | 3 | 8 | 6 |
🍌 vs. 🍒 | 1 | 1 | 0 | 0 |
The computed values would be as follows:
Comparison | Reinflation factor |
Reinflated wins |
Reinflated losses |
Wilson score |
---|---|---|---|---|
🍎 vs. 🍌 | (70+30)/(30+20)=2 | 60 | 40 | 0.596 |
🍎 vs. 🍒 | (4+8)/(3+6)=1.5 | 4.5 | 9 | 0.370 |
🍌 vs. 🍒 | (1+0)/(1+0)=1 | 1 | 0 | 0.211 |
Note again that TERPECA uses the Wilson confidence interval midpoint, not the more commonly used lower bound! With p=0.05, the midpoint is roughly equivalent to always adding two wins and two losses, then taking the simple win fraction.
This “directConfidence” value would be suitable for a preference matrix, but TERPECA has more factors to consider…
The eigenvalue method does infer transitive relations (if A beats B, and B beats C, then A beats C), but TERPECA adds more explicit weighting for transitive relations, which empirically improves ranking quality. This may be the most subtle aspect of the ranking algorithm.
flowchart LR
subgraph For every room X
subgraph "For every room Y (not X)"
subgraph "For every room P (not X, Y)"
pivotFilter("valid pivot?") --> pivotWinConfidence & pivotLossConfidence
pivotWinConfidence("confidence for X > P > Y")
pivotLossConfidence("confidence for X < P < Y")
end
pivotWinConfidence --o indirectWinConfidence("indirectWinConfidence for X > (all P) > Y")
pivotLossConfidence --o indirectLossConfidence("indirectLossConfidence for X < (all P) < Y")
indirectWinConfidence & indirectLossConfidence --> reverseConfidence
reverseConfidence((inv. Wilson)) --> indirectWins & indirectLosses
indirectWins("indirectWins[X vs. Y]")
indirectLosses("indirectLosses[X vs. Y]")
end
end
For every possible X vs. Y comparison (even with no direct comparison), other rooms are examined for suitability as tertiary “pivots”.
A room P is a valid pivot for X and Y if:
For valid pivots, a “win confidence” (confidence in X > P > Y) and a “loss confidence” (confidence in X < P < Y) are computed as if by this pseudocode:
pivotStrength =
min(weightedWins[X vs. P], weightedLosses[Y vs. P]) +
min(weightedLosses[X vs. P], weightedWins[Y vs. P]))
scale =
min(pivotStrength, sqrt(pivotStrength)) / // to reduce the impact of any one pivot
(directConfidence[X vs. P] + directConfidence[Y vs. P])
indirectWinConfidence[X vs. Y] += scale * directConfidence[X vs. P]
indirectLossConfidence[X vs. Y] += scale * directConfidence[Y vs. P]
totalScale[X vs. Y] += scale
The indirect confidence value totals are then scaled as needed so the sum is no greater than the number of pivots, to better reflect the actual number of independent data points. Finally, these confidence values (derived from directConfidence values) are put through an inverse of the Wilson confidence midpoint computation to produce synthetic “indirectWins[X vs. Y]” and “indirectLosses[X vs. Y]” values.
As a last step, for every X vs. Y comparison, wins and losses from direct comparisons (weighted) and indirect comparisons (synthetic, as above) are added into total win/loss scores for another Wilson confidence evaluation. (Note that this does not use directConfidence values, which only feed into indirect comparisons.)
flowchart LR
subgraph For every room X
subgraph "For every room Y (not X)"
directWins("weightedWins") --> allWinsAdd
indirectWins("indirectWins") --> allWinsAdd
allWinsAdd((#plus;)) --> finalWilson
directLosses("weightedLosses") --> allLossesAdd
indirectLosses("indirectLosses") --> allLossesAdd
allLossesAdd((#plus;)) --> finalWilson
finalWilson((Wilson conf.)) --> finalConfidence("matrix[X vs. Y]")
end
end
Once the pairwise winningness values are computed as above, the power iteration method is used to find the eigenvector with the largest eigenvalue. As discussed above, this eigenvector is directly usable for sorting the final ranked list.
The TERPECA winningness computation may seem arbitrary in several ways…
These wrinkles were all added to solve real observed problems, most of which were related to the interplay of confidence and evidence across the matrix.
If a very experienced player ranks a game particularly high or low, this can impact unweighted win/loss counts by hundreds of points across the grid. However, this is only one person’s opinion, and can lead to outlier results that don’t reflect consensus.
Conversely, if dozens of voters clearly prefer one room to another, that should weigh heavily into ranking, even if the difference in the Wilson confidence score is relatively small.
This all is to say, within the matrix, some data points are more “independent” than others, which is not taken into account in Keener’s algorithm or other competitive ranking algorithms, most of which assume each data point is independent (typically the outcome of separately played games). So, all of the hair above including pivot analysis and various reweighting factors have all been introduced and revised over time to tweak the algorithm into better reflecting “the will of the people” as best we can establish it.
Literature on ranking systems
Code in voter-portal/…/analyze.component.ts (only visible to project members):