How (Not) to Read the Data

The problem with statistics, as you might have noticed from various less-then-honest news sources, is that not all numbers were created equal. Graphs and other data visualisations can be mis-understood, mis-interpreted, and mis-used - even accidentally!

In general terms, you need to pay attention to these things:

  • The source - Where is the data coming from? Is this an accurate sample of the group you're studying?
  • The method - How is the data being collected? No method is perfect, so what are the possible weak points?
  • Yourself - Do you understand everything correctly? Are you expecting a certain result (and possibly avoiding contradictions)?
  • The reader - Will your audience understand what you're trying to say? Could they misinterpret something?

This, incidentally, applies just as much to us (supplying the graphing tools), as to you (reading and interpreting them). But hey! At least you can let us know if we're getting it wrong. :D

The source - AO3 tag system

Quoting wrangletangle:

Stats are fun! Please use responsibly.

This is your regular wrangler reminder to everyone to please add disclaimers on your stats meta that use AO3 because:

  • AO3 does not represent all of fandom, or even most of English-speaking fandom.
  • AO3 hosts more works than just fic. Please refer to “works” instead of “fics” to be accurate.
  • AO3 numbers are subject to wrangling work speeds and necessary decisions, including metatagging. Numbers can change drastically when guidelines are changed or when technical issues force us to adjust organization. Works can be delayed being added to a filter if they use a non-canonical tag and we are short-staffed or that fandom is not currently staffed.
  • These are just the tags people remember to tag with, so please use language like “tagged with TAGNAME”, not “contain CONCEPT”.
  • Users decide ratings and categories for themselves, and they may have different ideas of what those mean. Not all “mature” works are the same, so again, “tagged with TAGNAME” instead of “are TAGNAME”.
  • AO3 tags are wrangled specifically to help users find things, not to create accurate statistical data. For example, tags that contain two or more separate concepts that don’t modify each other will generally not be wrangled at all. The work still contains the concept; the tag is simply not wrangleable.
  • Tagging is always, always in flux. Every tagging decision is made by a human being, and judgment is often involved, so what one person would connect, another might not. These individual decisions add up quickly for large tags, some on the order of several per day.
  • There’s a bug that makes the top 10 filters sometimes inaccurate. Coders are working on it.
  • There’s another bug that makes some works disappear from the filters on canonization, synning, or rename. Also working on it.

Just because the numbers are easy to pull doesn’t indicate that they’re accurate or they mean what we want them to mean. Even broadly general conclusions may be insupportable; please always disclaim.

(original post, published with permission)

The method - tag API

Beneath it all, what we're doing is really just "dumb" website scraping - we're using the same numbers you can see when you open the tag page yourself. As a result, we know only what AO3 knows (or doesn't).

  • If there isn't a tag, there isn't a filter page and we can't get the data for it.
  • We will show seemingly duplicate tags when they're being used (i.e. character names in some fandoms).
  • For tags where we show percentages, they do not add to 100% because they overlap - and we cannot, at this time, tell you how they overlap.
  • Especially for the warnings and categories, the authors' understanding varies wildly. (i.e. some people will tag "Gen" when there is no sex; some when the relationship isn't the focus; others when there is no romantic relationship at all.)

Additional Reading