camestrosfelapton posted: " Firstly, for those who are fans of such things, those Wikipedia views do follow a power-law distribution (sort of). Here is a graph of the rank and frequency with logarithmic axes. It's a straight line until we get to stories ranked 500 and beyond" Camestros Felapton
Firstly, for those who are fans of such things, those Wikipedia views do follow a power-law distribution (sort of). Here is a graph of the rank and frequency with logarithmic axes.
It's a straight line until we get to stories ranked 500 and beyond, where it turns a corner. How is this useful? It isn't but it is always fun to see.
Meanwhile, on Twitter, ErsatzCulture suggested comparing the data set I'd scraped together with one available on the Internet Science Fiction Database (ISFDB) http://www.isfdb.org/cgi-bin/stats.cgi?15
That was an interesting idea. My list has 694 notable stories, the ISFDB had 500 stories with the most views in the previous week. Safe to assume there would be a lot of overlap! It turned out that I could only match 115 stories on both lists! Now, I haven't done a careful recheck for issues such as variations on title or author name that might lead to terms not being found, so there may be more if I picked through the data.
A second issue is that the original Wikipedia list has some significant omissions. I mentioned Omelas yesterday but also Harlan Ellison's "Repent, Harlequin!" Said the Ticktockman was missing from the Wiki list even though it does have its own Wikipedia page (4,229 views in the past 30 days).
Still, 115 isn't nothing and we've got a basic idea here. Seems fair to think that some stories just get more attention than others and that latent trait should manifest in both data sets? Right?
Yuck. Aside from one really, really viewed story our basic idea doesn't seem very strong. That really interesting story is I Have No Mouth, and I Must Scream by Harlan Ellison. Computer data really loves that story. It's the second most viewed story in both data sets.
The other stories look like they are all hiding in a corner to avoid Harlan. However, we know have a big disparity in the magnitude of views because power laws are in play. Let's through logs at axes again (how data scientists chop wood).
I could slap a trendline on that and the R2 value would be OKish (35%) but that's nearly all Harlan Ellison as an outlier grabbing all the attention. Take out Harlan and that value drops to 0.054 (0.5%). Half a per cent isn't great in any discipline.
So at the end of that can I claim I have demonstrated anything? Well, we have strong statistical evidence that Harlan Ellison still attracts a lot of attention in weird and unusual ways.
No comments:
Post a Comment