Gentleman Jon —

Now it’s time for you to create a magical formula that given all the cool information you can find in logs and regarding the class played will give you a note representing your efficiency in the game.

Captainhax, 30/8/2013

I like the idea progression of the rating idea, would be cool to see that sort of statistical analysis combined with TF2Pickup for example to help create balanced teams.

Admirable, 23/3/2014

determined-challenge-accepted-l-sml

Thanks to zoob without whom none of this work would have been possible. Apologies for graph visualisation fans, this is much more like a lengthy diatribe.

Lobbies. Pickups. Valve.

Valve don’t care about 6v6. Well that’s an exaggeration but there’s not going to be a comp based lobby system any time soon and anybody interested in why already knows the reasons. This isn’t a discussion about them and the rights and wrongs of their approach, it’s about doing what’s necessary for ourselves.

Since the legendary TF2 lobby came into existence players have been expected to play nice in games featuring varying skill levels (to put it politely) and tolerate totally unbalanced games that were doomed from the start.

Some mix bots on irc channels tried to avoid this by being member only and the legendary pickup2 implements a manually moderated player/class rating. TF2Pickup was originally built with an ETF2L div balancer, but this suffers from some of the vagaries of the ETF2l API, and of course only gives information about players on ETF2L.

What’s needed is an automatic way to assess players based on their results and determine skill levels based on this that adapt over time to improvements or declinations in gameplay skill and most importantly provide more balanced games when teams are organised based on that rating.

What follows is a broad-brush description of a much more detailed process to try to produce a practical solution to this problem, the results of which have since been handed over to those with the biggest pickup web services around.

The Beginning

With the death of TF2Lobby and the rise of Center and Pickup the debate about balancing games was renewed on the ETF2L forums, and for once a public request for how to improve game balance went out. I was intrigued.

The articles I’ve written are just for fun, to call people out, wax lyrical about the greats, and mainly to generate a bit of interest and keep people thinking about the game and the scene beyond the shelf life of a cast. But on the serious side I have developed a lot of practical experience extracting data from logs and being in a position to analyse different approaches to find a practical solution.

It seemed obvious, assuming an appropriate way of assessing skill could be found, that a proper balancing system opens up all kinds of possibilities for improving in game experience and potentially giving incremental and valuable feedback to players on their progress and genuine skill level. Contacts were made, wheels were set in motion and the first job, downloading tens of thousands of logs of TF2 Center and TF2 Pickup games began. <3 logs.tf.

Why ranking just on basic gameplay stats is bad

One of the first suggestions was from fraac, who has some practical experience balancing games in a private group, was to simply balance games on kills + assists per minute. The problem with an approach like this is that the population of the services it would apply to is far too large. Who you rack up your stats against is critical in assessing the value of them, so in a community with thousands of players it’s impossible for a basic per-minute stat to have any meaning in relation to a random game because there’s no information about who they have played in the rating.

To solve this problem a rating system that gives a number based on player performances in the context of the ratings of those they played against is necessary, so if you beat players with a high rating you should get more points, and if you lose to players with low ratings you should lose points. In this way the skill across the whole population becomes more measurable as ratings are spread around merely by playing.

With a proper rating points system when you play a game and do well the points you pick up will have come from hundreds of other players in hundreds of other games over the previous weeks and months, and when you do badly the points you lose are spread around everyone you played, and everyone they will play.

For this reason it’s clear that balancing lobbies with a large population is the same things as rating players. You inevitably end up with a list of numbers that are your best guess skill ratings for particular players.

Bill Gates helps out

Another early suggestion was a simple win/loss rating. If you win more games than you lose then you must be doing something right, but this falls victim to the flaws pointed out above. However there’s an existing well known ranking system for this kind of situation, Microsoft’s Trueskill which is used for measuring player skill based on their results in team games. Just what we’re looking for right?

To investigate this I made a test group of 2000 logs. The idea was to process all the logs with the Trueskill algorithm, and then examine the last 15% to see if it had produced the ability to predict games (and therefore give the ability to create unpredictable games) and reduce the number of 5-0 games.

There is an important distinction there, just because you can’t know beforehand who is going to win a game it doesn’t mean it will be close, the competitive leagues are littered with games between closely matched teams where one side takes a massive beating, even winning best of 3 matches with an emphatic reverse in the middle of it.

The initial results were not promising

1

The vertical axis is the difference in round score at the end of a given match and the horizontal is the trueskill score. The idea is that as the Trueskill score gets closer to one the game should be more even, but as we can see from the trend line and the r2 correlation there is almost no relationship.

To provide a form of control comparison I also analysed basic in game stats with no attention given to balancing them against opponents. The results showed just how weak Trueskill had been up to this point.

2

This was the best matching raw in game stat. Note that I’m not telling you which stat it is, I’m obfuscating anything that could give any clues as to the secret sauce that powers the final system (although these early stats bear no resemblance really). As can be seen there the r2 correlation is almost as good as the Trueskill one.

A matter of class

The most obvious refinement available to us is to base the rating on player wins & losses based on their class. There was no particular additional complication in doing this with Trueskill, we just had to store a bit more data per player.

3

So the correlation is 5 times better than before, but 0.097 still amounts to almost no useful correlation with games. Comparing it to a base stat also refined on classes does show that Trueskill is starting to compare better to raw in game stats however.

4

Here the best matching raw gameplay stat produced less than half as strong a correlation to the final result. And that is effectively it for the win/loss hypothesis. A player’s individual input into the final result of games is so small that even advanced statistical analysis can’t establish more than one tenth of their influence and the best in class rating system produces little more than a statistical sludge.

Bill Gates ruins everything after all

So a new approach was needed, a new way of viewing a game for rating purposes. What if we changed our perspective so that it wasn’t a battle to win the game, but instead it was player versus player to amass particular stats and each player was a team by themselves? From using ordinary in game stats I was already getting a good picture of which stats were the most useful to do this, so plugging Trueskill into it I eagerly awaited the results…

I might still be waiting for all I know. I’m hardly using a beast PC (i5 processor) but given a list of 18 players in a Highlander game I realised that after 20 minutes of waiting without result Trueskill wasn’t going to be computationally efficient enough to do the job either for my analysis or in a real world application. The reason for this is that the algorithm uses multiple recursive functions, that is it compares everything to everything else several times, and as the number of players goes up this workload rises exponentially.

I had to abandon Trueskill, but at least it meant that there would be no legal problems given that Microsoft owns the patent. They would hardly be likely to look kindly on a game produced by a competitor being enhanced by its use.

So I needed another rating system. Fortunately the need for rating systems isn’t new and the two obvious options were ELO and Glicko. The Glicko system is newer and incorporates a value not only for rating but also for the certainty of that rating so it becomes fuzzier with inactivity. Despite these advantages it’s designed not to be used after every game, but after a group of games so it didn’t fit the scenario perfectly.

This left the venerable ELO as the prime candidate, and if it’s good enough for millions of chess players worldwide then it’s good enough for us. Some adaptations were required to get a multi-player ELO algorithm working but once it was in place it was time to run through the test logs again. Starting from scratch again I ran it without taking into account individual class skills.

5

Now this is an improvement, twice as good a correlation as the Trueskill win/loss model and beginning to enter the realms of usefulness even without looking at classes. The vertical axis shows the round differences in the log set and the horizontal axis shows the rating difference between teams. Negative numbers are an advantage to the blue team and positive to red.

I felt like I’d hit on a key improvement, it had to be possible to find better stats to measure so at this point I greatly increased the number of logs in the analysis. The next obvious step was to apply class specific ratings.

6

One again another step change in accuracy. Having achieved this level of improvement I wanted to look at what practical effects this might have on a game and whether it was reasonable to conclude that games balanced using this rating system would be closer.

To do this I decided to measure things in terms of standard deviations of rating differences between teams. In case you don’t follow that, a game where the rating difference is higher should be both more predictable, and more likely to be a 5-0. Standard deviation was used as the unit of measurement because it’s neutral to the specific numbers involved, you just get the SD from all the rating differences in the dataset and it automatically gives you the brackets that you can use as measuring yardsticks as multiples of that SD value.

The idea is that if you run a balancing algorithm against this data it would make only the closest teams and avoid the most unbalanced teams, so games that have been played with very close ratings will theoretically show what it’s like for every game a balancer creates.

7

This is the most biased group I looked at, games that should be both predictable and over quickly. A 0.7 correlation is good (roughly speaking the balancer has accounted for 70% of the factors that go into a win/loss in this game type) and 5-0 games run at 65% in this set. I looked at a number of bands between this one to the narrowest and they showed a steady trend towards unpredictable games with closer scores until I got to the last one.

8

At 0 to 0.5 standard deviations difference in team ratings the 0.08 correlation means results are highly unpredictable and 5-0 games run at 33%. But there’s a turd in this punchbowl, namely that Highlander games are slower paced and it’s a much easier game mode from which to eliminate highly biased scorelines.

6v6 games are much faster paced, a full team wipe can result in losing 3 control points at a time and games can be over in 10 minutes. So what are the effects on 6v6 only? First the unbalanced games.

9

These are actually more predictable than the mixed set. With smaller teams it seems that smashing the opposition in 6v6 is more likely, as pointed out above however uneven the game it’s harder to get a 5-0 in Highlander. So far so good, but unfortunately the trend doesn’t continue so positively.

10

This is an even narrower band of results 0 to 0.1 standard deviation difference, and although predictability is effectively completely eliminated proving that calling the game before hand should be impossible, the number of 5-0s are at 44%. This is still an improvement from the wider data set, but short of the mixed set with Highlander included. By way of trying to comfort myself I looked at another indicator of competitiveness, the time taken to hit a 5 round win difference.

2+ SD 00:14:20
1 to 2 SD 00:16:18
0.5 to 1 SD 00:17:59
0 to 0.1 SD 00:19:12

So at least in very close rated games even if they end up as a 5-0 win then at least there was more resistance.

Developing the Secret Sauce

What followed was several weeks running simulations using different combinations of stats, trying things out for medics, working on a number of datasets and examining results. In any system like this one of the main problems is players gaming it, so it’s important to avoid clear publication of the details. The secret sauce remains a reasonably closely guarded secret (unless one of the few who knows it in detail flaps their gums publicly, speaking as the originator of the work I’d really rather you didn’t).

Having found several key improvements 5 game rolls were brought down to about 28% in the test set. At this point I wanted to see how my balancer would compare to and potentially effect existing pickup systems individually. The big ones are TF2 Center, TF2 Pickup and I also wanted to compare it with the best pickup balancer available, the pickup 2 manually moderated system. Center has no balancer, TF2 Pickup an ETF2L division balancer. I tested my balancer against the full set of logs from those services going way back into the mists of time.

In order to avoid prejudicing people’s view of the services as they stand too much I’ll show some interesting data on the potential for improvements rather than the raw existing levels of one sided games, etc, but as pickup 2 is invite only anyway I’ll show the actual figures below the table.

 

Service TF2 Center TF2 Pickup Pickup 2
Predictability Improvement 0.31 0.2 0.06
One sided games improvement 27% 6% 8%

Pickup 2’s predictability could go down but it’s already very unpredictable. However it still runs at 25% one sided rolls and my research indicates that could be brought down to around 17% if the theory matches reality.

One obvious thing from the above is that the TF2 Pickup figures for one sided games don’t look as if they can be improved much, and I have struggled with the reason why. Ultimately I think it comes down to the kind of player you get in the various services. Please note this is not a judgement, we all start somewhere and it’s merely what I feel is a realistic reflection of the status quo.

I think it’s fair to say that TF2 Center has the broadest skill level, which means tactical awareness and team coordination is at its lowest. It’s more common for players to not only be unable to adapt to a tactical problem, but also not even be genuinely aware of why they’re winning or losing. In this scenario there are big gains to be made simply levelling out the measurable playing skill between teams.

TF2 Pickup on the other hand has a narrower spectrum of competitive players with some experience who know how to press an advantage and are competitive enough to go for the kill when it’s available. However they still aren’t armed with the full tool set only top players have and although they’re able to identify and use an advantage they may not be able to adapt and change tactics. Demoralisation may also be key here.

In pickup 2 broadly speaking you have players with the best game sense and the greatest degree of tactical insight (numerous top prem snobs just snorted their pizza out of their noses at this description of div 2 scrubs) who are able to diagnose problems and change their way of playing, particularly if Ipz is screaming at them, and who accept losing the least.

Final tweaks & exploitation

There are numerous considerations about further implementation and how best to exploit the ratings, but probably the most important one is in the balancer itself.

Partitioning numbers into sets is a classic computer science problem and balancing TF2 teams is a very specific subset of that. There are some esoteric solutions that can’t practically be adapted for use so I’ve experimented with an adaptation of the simple greedy method. The balancer acts like a pair of team captains picking the best player available from the class slots they have yet to fill until the teams are complete. After each round of picking the “captain” with the lowest ranked players so far goes first.

The results from this are pretty good and quite often get within the narrowest bracket I defined of 0 to 0.1 standard deviations from the “norm”. However the further you get along the picking process the more susceptible it becomes to a very strong or weak player unbalancing the teams. Because of this I implemented a second step that would be difficult to justify if dealing with very large data sets. I look for swaps between teams that would make ratings closer until no further improving swaps are possible. This routinely gives extremely close ratings, is simple, fast and easy to implement.

I have no doubt improvements in measurements are possible, the death of Trueskill is an annoyance as the eerie qualities of Bayesian analysis produce excellent results to this kind of problem, and I’ve considered ways of bringing it back but not had the time to experiment with them. Admittedly the patent problem with Microsoft might rear its head if I succeeded. However just plugging in a pet statistic without testing is not acceptable to me, a lot of things were tried and discarded because they didn’t work (including a number of very surprising factors) and I wouldn’t call adding something without proper analysis because someone likes it an improvement.

Finally there is the strong possibility of tiered lobby play with a mature data set. As the best players float to the top there’s every chance of segregating out the plebeians to provide more specialised services for all, no abusive carries smashing the bottom 10% of players whilst spamming binds and no witless medics standing there in confusion while their team screams “right click!!!” in mumble – that they haven’t joined. There is of course the potential for problems. In any pre conceived group, let’s say “prem players only” there are going to be weaker individuals. They will eventually lose their points and fall to a lower tier, so it has to be accepted that if you’re using the rating to create playing levels it’s not a friends club. If there’s no segregation necessary then it’s perfectly fine in this scenario.

In the wild

I don’t really want to talk about any implementations in the wild as I don’t want to prejudice any users for or against a service and I don’t have intimate knowledge of their approaches or how closely they have stuck to my recommendations. Exploitation of the system, implementation details and publicising are up to the services in question. I just felt this research process deserved to be documented openly with some limitations to protect the integrity of the rating. The knowledge is out there now, it’s up to others to make it happen if they want to.

I did look at the possibility of an independent world ranking by processing every log available but the problem of populations not mixing was too difficult to overcome. This had two primary manifestations, the first being that top players don’t play their main class when mixing with the proletariat anywhere near enough for their superiority to become imprinted on the ratings. There’s no hierarchical access to top games except league games which form a very small proportion, so there aren’t enough steps players have to go through taking points to get there.

The second is the other end of the spectrum, players who hardly ever play anybody but a small group and who gain large ratings by taking all their friend’s points off them. If they don’t mix with the general populace either then they will never lose points to more skilled players and inhabit a false position in the hierarchy. There are solutions to this but introducing weightings varying ratings on the basis of the breadth of opponents faced adds a whole layer of extreme complexity.

There is also a problem with class balance. Because the classes aren’t balanced some will regularly find themselves leaking points at the bottom of the scoreboard despite reasonable play. In this scenario it’s possible to see a kind of churn amongst weak classes dropping out of one skill tier only to return then drop out again, and the opposite is possible – strong class players becoming cemented in place because they can always pick up points from weaker classes regardless of poor play. It’s hard to know what to do about this without resorting to manual clunky intervention to artificially boost certain classes, but as far as I know it’s yet to become a problem in practice so only time will tell.

So, um, you mentioned a rating? And you’ve looked at all those logs…

You just need to be validated don’t you? Sorry chaps, the details of the secret experiment will remain just that. Not everyone plays seriously in these services and variations from people’s expectations will only undermine the work to the extent that it’s statistically proven above. Oh alright, I’ll give you one. Mike is the best soldier. Surprised? Well at least it shows it works on the obvious. Worst soldier ever in pickup 2? Turbomonkey. You see, it does work.

One thing I can show that might be of interest is the relative values that players amass of the various classes.

 

Demoman 1.14
Heavy 1.13
Soldier 1.09
Sniper 1.06
Scout 1.03
Medic 1.01
Engineer 1.01
Spy 1.01
Pyro 0.97

Obviously Highlander effects these things, and my apologies to any Pyro loyalists. If manual class rebalancing had to take place this seems like a reasonable starting point. Also it’s notable that ratings aren’t averaging to 1. I think this is down to inflation, a common problem with ELO but not something that should have any negative connotations for a balancer. It could be down to a tiny rounding error, but meh, it’s several hundred thousands player performances to produce a miniscule nudge upwards and it won’t have effected the balancer.

Also I realise these values are highly debatable, but efforts made to include various other stats that might bring things around to your way of thinking dilute the effectiveness of the approach and in this case the statistics are king. Everything has to be determined by the large scale trends and these are the numbers that just fall out.

I have a pickup site/bot/group! Help me!

Anyone with a legit pickup system of any kind that has an existing population large enough to justify usage can potentially benefit from this, feel free to add me to discuss it. Don’t add me if you’re just curious, I won’t tell you anything and I’ll probably just block your begging ass. If you have a service and not much manpower I may even implement a service that does the thinking for you and just gives back the players and their new ratings or maybe a balanced list.