Tobold's Blog
Wednesday, December 09, 2009
Self-defeating success

Living in Europe, it is only today that World of Warcraft is patched to 3.3 for me. But apparently the patch day in the US went roughly like this: Blizzard patches exciting new dungeons and a tool designed to make dungeons more popular into WoW. Lots of people want to try out the new dungeons and LFG tool. Instance servers break down. Film at 11.

As spinks so correctly remarks, there are advantages to patching a day later. Nevertheless I wouldn't be surprised if I won't be able to do an instance tonight, and get stuck with "additional instances can not be opened" and similar errors instead.

Creating popular content for MMOs isn't easy to begin with. But it must be frustrating to know that if you are *too* successful in this, the success becomes self-defeating. My programming skills are basic, and I haven't got a clue about network protocols and such; but as a player I do know that if a significant part of the population of a server is trying to do the same thing, the server bugs out or crashes, even if I can't explain the technical details behind it. The opening of the gates of Ahn'Qiraj was more a slide-show for me than a world event, which is probably why world events are distributed over many zones nowadays.

But while that is annoying for world events, which only last a very limited time, I won't be all that annoyed if patch 3.3 isn't working today, because it will be around for much longer. The new Icecrown dungeons will remain relevant content until Cataclysm comes out, which is probably still many months ahead. The new LFG system will remain with us even after that. So as predictable as instances not working on patch day is, it is likewise predictable that they will start working soon, after the first rush towards them has subsided, and Blizzard fixed a few more bugs that became evident by the "stress test" that is a patch day.

I just wonder whether we will ever have the technology to create really massive events in our massively multiplayer online games.
I pretty much stayed at work (weird huh?) and when I came home have spent the night catching up on my favorite blogs.

Last I heard instancer servers were still down at 1 AM EST in the US (technically the day after patch day).

I agree, the instances are not going anywhere and I am not in a world top ranked guild that wants that first kill on everything. So I rest assured that knowing the game will still be here tomorrow.
Scalability is one of the most mindbending areas of computer science, and having events where thousands of players all interact with each other is indeed extremely difficult, because the requirements go up exponentially.

However, the instance issue is not one of these really hard problems. It's solvable, because the bulk of the work has already been divided into tolerable chunks. At most, the instance server has to keep track of 80 players, and usually just five. All you need is enough instance servers.

But what is enough? To handle peak demand and have the hardware wasting energy on non-peak hours? Or accommodate for average demand and have outages during peaks? Mythic opted for the former and Blizzard opted for the latter. The latter does generate grumbles now and then, but it keeps most of the people happy most of the time. But for peak demand, one possible solution is cloud computing.

Basically, you have a massive cluster of general-purpose computers spread all around the globe. These computers are shared among many companies, and unless all have peak demand at the same time, there's always enough computers free to handle someone's peak demand. In quiet hours, they can shut down some of the computers and restart them when they expect demand to rise. The power grid works like this. Because power from a nuclear plant is the same as one from a wind turbine or a coal plant, it doesn't matter where the power is actually produced, and output can be adjusted without any of the users having to do anything on their end.

But what would it take to host instance servers in the cloud? A complete redesign, unfortunately. The instance servers rely on certain services like login servers, chat servers and character databases to be there. Any references to those need to be abstracted away, so that the instance server can run even if the cloud load balancing algorithms decide to host that particular instance server on a different continent.

And in some cases, those services can become bottlenecks. We all have probably been frustrated when our friends are still in the game while we're stuck on the login screen. Or the character gets stuck looting. So those services need to be transformed into cloud services as well. And turning a centralized database that needs to be up-to-date at all times into a scalable cloud service is easier said than done..
Well, the servers came up in a timely fashion, which is a good sign. Whether the instance servers have problems remains to be seen. I'm back on the night shift tonight so it has a few days to settle down for me :)
Server queues can be fixed by adding more servers. There's little point in buying them however as in a month the current server amount will more then suffice.

In my opinion the solution to this problem is hiring more servers for patch and release days. They can be put into good use a month after the patch hits so noone gets queues. And if after a month the player amount goes down you can just stop hiring.
Self-defeating seems a bit strong.

Patch day issues are part and parcel of playing and I doubt anyone is cancelling their sub over not being able to do thier daily heroic today.
@Stabs: If everyone cancelled their sub over not being able to do their heroic daily today WoW would be shut down in a few days - have you not read the patch notes??!!!
It was definitely glitchy but I didn't struggle too much in running three PUGs on two toons. The biggest problem were the players; I had one quit on each run, fortunately right at the start.

One tank quit right after we got a "additional instances can't be started" message (the only time I saw that, BTW) and we had to wait about 5 minutes for another tank. That was the "worst case" scenario, so things were really pretty good.

Other players reported getting kicked out of instances partway through, but I didn't have this problem. My biggest issue was that loading screens took a long time -- maybe 2 or 3 minutes each time. I was worried that my PUG partners would assume I was unavailable and kick me out.

As to server capacity, I think Hirvox's 'peak demand' point hits the nail on the head. If the servers weren't groaning a bit on patch day, that would mean a ton of extra capacity on normal days, which is money down the toilet. But I doubt 'the cloud' is a good solution. I'd bet that transaction latency would be too big an issue to spread servers globally. Database replication would also be a big challenge. (This is probably the reason why players are reporting lost items when drops are traded between toons from different realms.)
I guess you never played Asheron’s Call then otherwise you’d know that truly massive events were common place in that MMORPG. AC was given monthly, yes I said monthly content updates as well as server wide quests and events.
The question is more like: will the ability to have really huge amounts of players with graphics acceptable to said players ever coexist?

I'd imagine we could have some truly massive game events today without a bit of lag--- if we would accept Ultima Online level graphics.
My attempt at running forge of souls was not very successful. After probably half an hour of trying to get in, getting in, someone leaving, getting kicked when we entered LFM and completed the group, and then finally we killed the first boss and I think the instance server went down.

Then I got on my DK (different server) and used the tool to run a random level-appropriate instance. I was almost instantly offered a full AN group as a tank. The teleport was slow, but worked. Then I had to run out and to the instance anyway to get the quests outside.
Can't say I'm really fussed by the patch so won't adding to the server stress.

It will pan out within 10 days and be back to normal for 90% of the playing population.

New LFG tool is 3 years overdue and too late for me to bat my eye lids at. Other than that another patch to yawn at if you don't raid.
I don't think it goes as far as self defeating. It's actually a good problem to have, as far as problems go. Sure beats having the hardware and not the players!

Allow me to make a poor analogy: a store location has room for 100 customers inside. This is perfectly fine for any normal day of the year. However they have this great sale on Black Friday and people flock to the store because they want to buy NOW NOW NOW. The store is full with about 150 crammed inside and a 100 more people outside. Some are grumbling because of all the jostling and not being able to get in. However most are relatively happy. Is this a self-defeating situation for the store?
@Stabs: If everyone cancelled their sub over not being able to do their heroic daily today WoW would be shut down in a few days - have you not read the patch notes??!!!

That's my point Daergal.

WoW has lots of players therefore patch days are difficult therefore lots of people whine therefore it could cost them money but in fact no one cancels.

Therefore it's success not self-defeating success.

Blizzard is not in any way defeated. In fact it let's them stretch out their rather thin current content for another day.
I'd imagine we could have some truly massive game events today without a bit of lag--- if we would accept Ultima Online level graphics.
A popular method for playing large fleet battles in Eve Online is zooming out and turning all effects off until the ships are basically represented by 16x16-pixel sprites. It isn't a panacea, but it helps quite a bit if you don't have a top-of-the-line computer.
"A popular method for playing large fleet battles in Eve Online is zooming out and turning all effects off until the ships are basically represented by 16x16-pixel sprites. It isn't a panacea, but it helps quite a bit if you don't have a top-of-the-line computer." (Hirvox)

I was thinking about this too, Hirvox.

I wonder if some future MMO will have different modes for large battles. In Eve we all at times zoom in and admire (or if you're Gallente laugh at) our ship's look. But then in a big fight we zoom out and play with a tiny dot as an avatar.

Yet the server still needs to send information just in case someone zooms in.

No one is going to look at the go faster hydrofoils on the side of my space ship but they are still sent as information to every other player.

What if you play as a ship that you can look at most of the time but switch into dot mode for large fights.

Or in WoW type games play a detailed avatar but play a generic stick-man with low detail for huge battles.

It would perhaps have been a way for AoC and Warhammer Online to deliver the large scale pvp they promised but failed to give us.
If you ask me (you didn't, but I'll pretend you did) the new LFG tool is a serious case of two steps forward, one step back.

It's great that I can now queue up for instances with players from other realm. That will certainly go some way towards alleviating the dearth of people looking to run non-Northrend dungeons.

But it's no longer possible to see who is queuing for the dungeons or chat with them. Instead you get dumped* into a dungeon with four other players. One of whom says they can tank and another claiming to be able to heal. Hopefully you don't wind up with anyone off your list of ninja's, because you don't have any control over that either.


* Well I assume you'll get into a dungeon eventually, because I certainly couldn't last night despite trying on three toons on three servers.
The bottom line... will anyone quit over patch day woes?

Maybe .001% of the playerbase. It is typically accepted that if something like lag or not accessing content temporarly is the reason you left WoW then you were already out the door, it was going to happen anyway.

Should Blizzard do anything about their patches? Sure... but they don't need to.

Why throw more money at a problem that will self correct in a week.

(and fixing lag isn't as easy as turning on a few extra servers)

The problem of lag can happen on a lot of parts:

Your computer - If you look to your network card up to the scene you see on your screen, there are a lot of points of "latency" input. First is the way your network card treats the packets. As on probablt 99,99% of the homes you don't have a special network card, all information is treated with the same priority, big packets or small packets. Here you have a insertion point of latency.
Then you have the PCI bus, that connect all the devices on your computer, then the main CPU and RAM memory. Oh, also your video card memory and processor.

Well, As you see, unless you have a real pc made for gaming, part of the problem is at your desk.

Going to the network part, if you have a DSL connection, you're going to an L2 aggragator named DSLAM, that will gather hundreds of connection on a pipe (Latency here) and deliver to an L3 aggregator named BRAS (more latency) and then to the core of the network (hey, if your provider don't have a service specially to speed you gaming packets, you're one in a billion trying to get somewhere) and after some hops (routers), at Blizzard Datacenter.

For MMORPG, here's where things gets nasty.

Remember all those problems of latency on your desktop? Well, they also happens at servers. Of course they are more prepared for heavy load and have many different architecture points that helps then improve performance, but somewhere over 70% of load, congestion issues will introduce latency.

An instance is probably a memory space where that environment exists and all action happens. So you'll see a lot of information moving from the memory to the main processor and back.

More instances happening at the same time means more memory is allocated and the bus between this memories and the processor is used.

This way, only having a HUGE datacenter you'll not experience that kind of lag that happens whenever Bliz puts new content in the air.

In every business model you have a point that if you go beyond that, you're losing money. Instances crashing are that point. That's why most of the time you'll see and advice that more instances cannot be launched. It's a protection mechanism to keep things under control and have fun on an instance, not a laggy boring one.

Probably using cloud computing could help, but still you'll have to pay for the servers, energy, internet connections... there's a break point that adding more servers will cost so much that you'll need to pay more for the game.

Regarding the data that's kept at Blizz, you'll mostly find only numbers or such at the servers. Everything else is already on your computer. All your gear is just an ID number to the servers. It's your computer job to draw that as your character.

A solution on cloud could be an algorithm that "knows" which servers are more loaded then other, thus redirecting those users to that particular server. Then again, all this requires processing time and this information competes for cpu time with the game itself...

I do think that Bliz have the best they can offer. Of course it's not what we thing it's the best.

MMORPG really brings real challenges to cloud and distributed computing.
Yet the server still needs to send information just in case someone zooms in.
This is less of an issue in Eve, where the only visual differences between ships are the gun types and active modules. The modular T3 ships are closer to what characters in other MMORPGs are like, though. Nevertheless, and Eve actually does contain a few optimizations about that: Guns on ships are not drawn until the ship actually fires and the gun information is sent anyway. Similarly, active module information is not sent until the modules are activated and the client needs to draw effects like shield boosters or target painters.

Mostly, this would help with loading lag when you enter the battle. Several hundreds of players each using more than a dozen pieces of equipment means that quite a bit of information needs to be sent and processed on a very short notice.

It would perhaps have been a way for AoC and Warhammer Online to deliver the large scale pvp they promised but failed to give us.
I'm thinking a bit ahead here, but in supermassive battles one could use a system (appropriately named MASSIVE), similar to the one that was used in the LotR movie battles. Basically, the server abstracts away combatants that are not directly involved with a certain player. If you're 200 yards away from a group of players, the server just sends information about the average composition of the group and the client draws a mock battle containing fifty orcs, fourty humans and ten elves. Only when the player gets close enough to plausibly contribute to the battle the server starts sending actual information about those players.
Since I haven't been and won't be able to play this week... can I spec my low level priest into healing and level up in dungeons now? Or is LFG at low levels still empty?
@Álvaro: Um, yes. That's all basic networking and computer hardware knowledge, and I assumed everyone knew that already. I don't see why you're bringing up internal buses as sources of latency, considering that those are several orders of magnitude faster than your disk or your network connection, so they are highly unlikely to be the bottlenecks.

WoW in particular is designed for low-to-medium-end machines, and the network latency is usually minimal, assuming that you're connecting to a datacenter on the same continent. Crossing the oceans is always going to cause latency, no matter what you do. But even then, the per-player network latency is not the bottleneck unless you insist on running a Torrent client or any other heavy bandwidth-using program on the background while you play. Even the built-in background downloader can be set to start downloading after you exit the game.

In any case, the point of cloud computing from a business model perspective is that any individual company is not required to pay for servers capable of handling the peak demand until they actually need it. And when that happens, the load-balancing algorithms can automatically allocate extra servers from them from the global pool. And because the servers are interchangeable, those servers are free to serve other companies during off-peak periods.
Tobold, this is double standard.
Others are lame for not sizing their infrastructure properly.
Blizzard is just too damn successful.
They are simply not hold to the same standards as everybody else.
Nevertheless if after this patch their only problem is server load than we should all praise Blizzard Q&A team and the focus that is placed on that area.

Hilarious though is when players are asked to name the best and worst in Blizzard some might say Polish as the best and the time it takes for Blizzard to release an update as the worst. lol
Tobold, this is double standard.
Others are lame for not sizing their infrastructure properly.
Blizzard is just too damn successful.

Indeed it is. Funnily enough, Blizzard initially didn't size their infrastructure properly at launch, and have admitted that in the latest Blizzcast. While they had reserved spare servers, their estimates were wildly off the mark.

One of the developers mentioned an anecdote where the developers had just returned from some trade show where Star Wars Galaxies was being shown, disillusioned at their chances of ever putting up a fight versus the almighty Star Wars franchise. One of their bigwigs tried to snap them out of it by saying that WoW would have million users at it's peak. The developers rolled their eyes and derided the estimate as hopelessly optimistic.

That said, Blizzard is well-known for gradual iteration and polish, and their capacity planning and load balancing is not an exception. Cross-realm battlegrounds, arenas and now instance groups were all measures designed to spread the load out more evenly. There was no point having queues on one realm when an another realm had troubles getting even one battleground/instance run started. They may not move as quickly as customers would like, but at least they are going in the right general direction.
Post a Comment

<< Home
Newer›  ‹Older

  Powered by Blogger   Free Page Rank Tool