7-9 scale - the discussion

I love this idea, and it’s how I keep my own database of game rankings. There’s plenty of people who know how to convert scores to standard normal distributions; it would be pretty simple to just convert the raw scores even for an aggregation site like Game Rankings.

Hardly anyone knows how to interpret standard deviations, so I think the final score should give an actual percentage as well. For example, a score of 80% would mean that this game achieves an aggregate ranking higher than 80% of all the games reviewed. The only problem would be scores wouldn’t be static; they could change as the universe of games reviewed gets bigger. I don’t know how difficult that would be to maintain at a commercial site, but it’s not a problem for my Excel spreadsheet.

Oh shit, yeah, sorry forgot about GamerDad. My first stop for game reviews, and so I naturally forgot to mention it…

And I’m sorry to say that GameSpy and GameSpot (and the utterly terrible EGM) suffer from the MacReview syndrome that IGN is so well known for. But with added Edge-like arsiness and snideness in their reviews and rating systems. V. Poor.

If I believed what they had written I wouldn’t have bought Gun or Shadow the Hedgehog, and I’d have missed out on a lot of entertainment (OK, so shadow is a little flawed, but only a little…)

PSM used to do something like that, they’d place the score of a game in the context of 3 or 4 other games in that genre so you could see where it fell (in their opinion of course) in comparison with its competition. Then they stopped doing that. Wasn’t “edgy” enough, I guess.

I love this idea, and it’s how I keep my own database of game rankings. There’s plenty of people who know how to convert scores to standard normal distributions; it would be pretty simple to just convert the raw scores even for an aggregation site like Game Rankings.

Hardly anyone knows how to interpret standard deviations, so I think the final score should give an actual percentage as well. For example, a score of 80% would mean that this game achieves an aggregate ranking higher than 80% of all the games reviewed. The only problem would be scores wouldn’t be static; they could change as the universe of games reviewed gets bigger. I don’t know how difficult that would be to maintain at a commercial site, but it’s not a problem for my Excel spreadsheet.[/quote]
Cool, nice to know it actually works.
I can see a percentage ranking of that sort being useful, as well as graphs of score distribution curves and where the game falls on them, you could restrict the set to games of that genre, or by that reviewer, etc.

Making this tool available to the reviewer when he’s submitting the score would be a big step up, too, since it would force him to really evaluate in context.

We struggled with the rating system for our site. Everyone had a different opinion: some wanted the 1-100 rating to match industry standards, but few folks wanted to deal that level of granularity (which ends up being arbitrary, anyway). Some (like myself) didn’t want any numbers whatsoever, but to get listed on sites like GameRankings, it was argued that we had to have numbers of SOME sort.

In the end we went with a 5 point system. 5 games we love, 1 games are crap, and everything else usually falls in between.

Intresting idea about the aggregate number. I’ll have to look over what we have to date to see if that makes any difference.

Not edgy, maybe, but also not useful. No offense to the people who brought it up, but trying to come up with ways to make ratings more mathematically accurate is missing the point. A rating is not a precise, objective measurement. It’s just one person’s opinion, and any given pub has a bunch of different people generating those opinions. I mean, sure, you can come up with some model that says “game X scored higher than 80% of the other games in our database,” but what does that actually mean? Pretty much nothing. It’s just an arbitrary number. You’d probably have a different arbitrary number if they had given the review to a different person. I don’t understand what meaningful information I’m supposed to be able to derive by comparing how one person rated one game to how a different person rated a different game.

Ratings are already way more complicated than they need to be; they don’t need to get more complicated. Percentages, ten point scales with decimals, multiple rating categories cooked down into mean averages… all that stuff is utterly pointless. I’d be happy with a rating system that had only two ratings (“thumbs up” and “thumbs down”). I’d even be happy with no ratings at all.

I’ve always been a big proponent of the Thumbs Up/Thumbs Down method but you simply do not get listed on Gamerankings if you do that, and apparently that’s a big means of getting people to visit your website.

It sucks.

–Dave

GameRankings is a curse.

I agree. I think they also won’t put you in there unless your reviews are some arbitrary (rather long) length. I wanted to shorten up the reviews at GamerDad and it just can’t be done and still be listed at Gamerankings.

–Dave

Unless you are someone wanting a new fighting game and wanted to know which game PSM thinks is better, Tekken, SC, VF, MK, etc. Are you really saying comparisons are irrelevant to a review or body of reviews?

Mind elaborating? It’s a curse that it collects scores from various sites and presents an average?

Mind elaborating? It’s a curse that it collects scores from various sites and presents an average?[/quote]

Basically, yeah, it is. They’re interpreting the ratings systems to mean whatever they want. Since you and I know that many sites never use anything below a 5 on their 1-10 scales, it translates all reviews at those sites into scores of about 70% or higher, which means even bad games look much better.

Those sites/mags that use five-star scales and use the whole thing end up waaaaayyyyy at the bottom of the list of reviews if they give something three stars while everyone else was giving a game 7.5 or 8.0. What happens is unless you give games something in that 7-9 range, no one ever clicks on your review. Given this, is it no surprise that many of the smaller sites tend to have seemingly inflated scores so that they’re a bit higher on the list?

Gamerankings is partially to blame for the mess that is online game reviews.

–Dave

It does suck. Gamerankings should have paid more attention to how Rotten Tomatoes does things–they boil review scores down to the most simple common component (liked it/didn’t like it) rather than the most complex. If your review database service can’t handle reviews that stray too far from your format, then you have a sucky service. It’s not unreasonable to think that pubs shouldn’t have to tailor their format to suit Gamerankings’ needs.

Every time we give a game 3 or even 3.5 stars, we get angry phone calls from publishers threatening to pull advertising because we’ve “trashed” their game. What are they talking about? The “60” or “70” that Gamerankings attaches to our score, which is invariably 10-15 points lower than scores those games get on other sites. We’ve even sent GameRankings a conversion chart to help them line up our scores with the 7-9 scale, and they just ignore it.

I’ll have some more to say on this shortly, but I think the idea of a metasite that uses standard deviations from a site’s average to put review scores in perspective is absolutely brilliant.

Except at Rotten Tomatoes, if you use a 5-star scale, they rank 3-stars as a rotten tomato. At X-Play, where I do some reviews, a 3 out of five is not a pan.

Ratings suck.

It’s still better than Gamerankings’ system. That discrepancy would only apply to a small number of your reviews (the ones that score between 3 and wherever your real breaking point would fall… 2.5?), and it could be easily be fixed by simply asking individual sites where they want the breaking point to fall.

But yeah, ratings suck.

Change the word ‘ratings’ to ‘grades’ and presto-chango! You have a discussion about what’s wrong with education in America.

You seem to be implying that a system of measurement is useless unless it is 100% error proof and objective. I take the point of view that any measuring device – even heart rate monitors and quantum clocks – has some degree of error. It’s up to the community of users/consumers to decide whether a specific measurement system is good enough.

In theory, aggregation sites like GameRankings should provide the best index of quality, since aggregating scores tends to reduce error in test construction theory (error or bias in one score is more likely to be “evened out” with biases from other scores). In practice, GameRankings has a clumsy method of normalizing scores from different review sites (summarized well by Dave Long), so some of its potential is wasted.

Here’s the top 5 rated games in my database – all representing an unholy amalgamation of different sites, scoring systems, reviewers, and genres.
Half-Life 2
Half-Life
Civilization 4
Unreal Tournament
Unreal Tournament 2004

Here’s the bottom 5:
Extreme Paintball
Catfight
Swamp Buggy Racing
Panty Raider
MTV: Celebrity Deathmatch

If ratings are entirely subjective and arbitrary, then there should be no difference in actual quality between these two lists, since I used a method of ranking that is, to you, no different than flipping a coin. There’s no way any regular reader of Qt3 would see the quality of these two lists as equal. Therefore, rating systems – even the 7-9 ones – must be somewhat useful as an index of actual quality.

I’ve never suggested that people should only use a rating to determine their decision to buy a game, and I subscribe to all the magazines because I love to read reviews. Its obvious to me, however, that rating systems do show relevance to actual game quality, even though each individual review has some subjectivity, error & bias (as used in measurement theory). It appears that you can compare rankings across different sites that use different rating systems, assuming you properly apply methods of equalizing the data (like normalizing scores). In a perfect (statistical) world, each site would have a single reviewer. In practice, it appears consistent editorial policy curbs individual excess in grading standards.

You should record and publish those phone calls.

Yeah, or do the “Can we quote you on that?” like Bauman says he does when gets those calls.

–Dave