On the last battles panel, instead of querying the server every few seconds to get potential new battles data, the server will push the information to the impacted players only, every time a new battle is available. When checking the last battles of another player from the leaderboard, battles will be refreshed every minute.
The Sprint of the Legends of Code and Magic contest was on and everything seemed to be working fine. We knew that in a 4-hour contest, we had little room for trouble, be it game bugs or performance issues. Everyone was pretty excited about the format and the game itself (a card game like HearthStone).
A few minutes into the Sprint, the dream turned nightmare: One player discovered a game-breaking bug, and others started experiencing lags.
Everything that we feared could happen, happened. The gloomy memory of the Reddit Hug came back to haunt us. As promised, here is an overview of what went on.
Activate Cheat Code
I’ll just touch on the game-breaking bug since we fixed it quickly and only a few players were impacted. It’s the type of bug that you think can’t happen. After weeks of tests, such a central part of the game cannot fail…
“An AI just summoned a 6-mana creature with just one mana. And its mana went to -5. (translated from French)”
I didn’t believe it at first, but after watching the replay, I was left feeling disillusioned:
We created test AIs which did try to summon every creature from their hand without checking if they had the required mana. For the bug to occur, though, the player had to output “PASS” before the “SUMMON” action. In this case, the referee didn’t check if the next actions were valid.
There is no good game without a nice little bug, is there?
The SQL Request That Took Longer than Expected
We had been warned, and we knew what was at stake.
“As we already discussed on Discord, the issue with a 4-hour Sprint with Wood leagues is that the bosses can take 2 hours to finish their own ranking…I really hope that CodinGame will address this issue before tomorrow […]”
Magus (in the forum)
Ranking and Submission Battles
To compute the rankings of players when they hit the “SUBMIT” button, we launch enough matches between their AI and other players’ AIs. The more matches we launch, the more accurate the final ranking will be. And the more time it will take.
However, in Wood leagues, a fast ranking is more interesting than an accurate one. So, as usual, the system launches only a few matches per AI.
To get promoted to an upper league, a player needs to rank higher than the boss of the current league. Each boss is submitted too and needs to finish its battles before anyone gets promoted. We made sure the bosses could quickly finish their matches too.
For each contest we create with the community, we also try to limit the duration of games. More to avoid stale, boring games where nothing happens than anything else, but still, the quicker a game can be played, the better it is for everyone. The duration of a LoCaM game looked acceptable. Moreover, in Wood leagues, no AI would use the 100 ms allowed every turn.
If the servers could follow the increase of charge, the computing of matches would be fast.
Actually, they did. “Codemachines”, as we call them, were not overwhelmed with the surge of users during the contest.
So what happened?
DataBase Was K.O.
The main issue came from a service that is called to refresh the last battles panel. Every 3 seconds, the panel (when open) requests the last battles of the player so it can display it. This service calls a specific SQL request, which first searches for the last battles of the player in a huge table, and once it gets them, returns information related to each battle from two other tables. The table containing the last battles is quite huge (around 300 million lines).
At some point during the Sprint, Postgres (we use PostgreSQL) changed its strategy to compute this request. Even if the result of this first search should return just a few lines, we think that Postgres expected a lot more. Instead of using the indexes to collect data from the two other tables, it decided to do a sequential scan on each table. As these two tables each contain more than 1 million lines, the time to run the entire request exploded, it got multiplied by around 50.
As the number of calls to the service was quite high, the database started to have difficulties handling all the requests — in particular, the requests from the ranking job to decide against which agents (AIs) every bot in submission should play. That’s why the submitted agents didn’t do more matches despite the codemachines being “free.” Then, it was just a vicious circle.
A lot of players submitted their code to get ranked, and eventually promoted (that’s normal). More and more people with a bot in the process of submission were waiting for the results on the last battles panel. More requests were sent to the database. More lag. GG WP.
Some players had an issue with the time it took to get promoted once above the boss after the end of battles. The countdown showed “promotion in 0 H 0 mn 0 s”. This issue came from the same root cause. The database was overloaded and the job that requested to know which agent had to be promoted was blocked.
When we realized the issue, the first thing we did to counter it was to decrease the refresh rate of the last battles. We changed it to 20 seconds instead of 3. This had almost no effect, since players needed to refresh their IDE in order to have the change.
We also tried to limit even more the number of matches each AI had to do before being ranked, from 20 matches to just 5. It doesn’t mean, though, that they were doing only 5 matches, but definitely fewer than before (more on that later).
We also removed mirror matches (ensuring that a player starts the same number of matches as the first player as he does as the second). As everyone was more or less blocked, it didn’t improve the situation much.
Then we tried to understand why the request was taking so long. To our knowledge, we had already optimized it…After some investigation (from home, as it was already late, which didn’t help), we modified the first “sub-request,” that gets the last battles, to return at most 500 results (using “limit 500”) even if it never returns more. This way, Postgres wouldn’t expect a large number of lines to be returned.
It worked. The remaining issue was that the database had a lot of pending requests in the queue. It took quite some time to reduce the queue of requests, and the Sprint was already reaching its end.
Areas for Improvement
This incident highlighted some areas for improvement.
First, there are too many services requesting information from the database in the IDE. As mentioned previously, there is the refresh of last battles, but in the same panel, you also have the refresh of the leaderboard. There is also the refresh of the division (what we call a league). Instead of relying only on services that pull data, we should try to rely on a service that pushes data to the players every time there is an update (like in Clash of Code), and if the service fails, rely on the pull service.
We’re also thinking about implementing a safeguard that would allow us to prioritize the services calls we get in case of overflow, and thus maintain core functionalities of the platform.
Second area of improvement: we should clean some of our tables. Some battles cannot be accessed anymore on the platform, so we’d better clean the corresponding lines from the 300-million-line table. It will take some time, but it’s doable. We initially kept them for statistical purposes, but nothing prevents us from keeping the statistics somewhere else.
Finally, we need to think about how we “count” the matches during a submission. When a player hits the “SUBMIT” button, we run 10 matches against other players’ agents from everywhere in the league. It helps us determine roughly where the agent should finish in the ranking. These matches are played in parallel.
Next, we run 20 matches (from 20 matches in Wood leagues to 120 in the Legend league), one match after another (so sequentially) against players very close to the position of the agent. However, if the agent is selected to play a match against another agent that is also in the process of submission, this match doesn’t count in the total of the first agent’s matches.
One of the drawbacks of this system is that an agent could play a lot of matches, yet not advance in its own submission process. This problem is more complex than it seems. We haven’t found any obvious applicable solution for now.
Once again, I’d like to express our apologies for what happened. I know you were expecting a lot from this new Sprint format, and trust me, we were too. I understand your frustration; as a player myself, I was very frustrated. We haven’t given up on this format, though, as many players have told me they liked it. We’ll try to improve it further: The league format is an impediment, as well as the timing during the week.
If you have any questions, feel free to ask them; I’ll be happy to (try to) answer them 🙂
Otherwise, I’ll see you in the Silver league, since the Marathon part of the contest is still on-going!
About the Author:
Fond of challenges and board games fan, I like to solve people’s problems. CodinGame Community Manager since July 2016, I still code a bit in Java.