Mega content analysis?

Many times I’ve been fascinated with visualization of wast amounts of data. I think one of the first impressive on i saw was “the dumpster“and today i saw another interesting TED talk featuring the work of Jonathan Harris who describes himself as “an artist and storyteller working primarily on the Internet“. Indeed, i think he presented some quite impressive projects. You can judge yourself (warning – it’s almost 20 minutes):

Listening to the talk generated two main streams of ideas in my head. One a more critical one, and the other is a more practical. I’ll start with the critical, to leave a better taste at the end, for all in all i am amused by Harris’s work.

The critical comment is basically about the way things were presented and particularly the language used during the presentation. The projects presented in the video deal mainly with graphic representation of user-generated content. As audience, we have no idea how exactly the content was sampled with an exception of “we feel fine” project where we know that it is based on a simple search of variations of the word “feel” on some mysterious database of blogs, which pretend to cover the “internet”. It is not stated, but with an exception of “time capsule” (more here) the other two projects seemed to target English speaking audience only. Watching the video, i assume that the other projects work more or less in a similar fashion. So far, it seems ok. Right?

What did disturb me in the talk is the chutzpa of Harris to refer to the entire human race while describing his projects. Well, that is particularly standing out for me as someone trying to study what is labeled as “digital divide”. First of all, however shocking it may be to some technology enthusiasts, 82.4% of world population are still not online (as of June 30, 2007). Another shocking news, is that not the entire world population speaks English. I have a study from 2000 by Suppiramaniam Nanthikesan stating that back then English speakers constituted only 48.5%-50.9% of online population (I know i need some more updated data).  Given that internet penetration rates in the developing countries were higher than those in the developed world, I doubt that 100% of people online are English speakers.  Finally, most of the people online do not create content.  For example according to a very recent PEW report only about 8% of Americans are taking full advantage of the contemporary technology, which includes blogging.  In other words, the other 92% are barely creating content (so much for web 2.0, but that is a subject for another post).  The bottom line is that i think Jonathan is blowing the thing out of proportion.  I have to admit that at no stage Jonathan claimed to do a scientific work.  He is an artists, and as such i might be judging him too harshly.  My problem probably that this is talk is representative of a trend, and this is why it triggered all the above.  At the same time  i have also to admit he is doing great marketing for his projects.

Another annoying thing was the comparison between Greek mythology and contemporary news.  Leaving aside the fact that we, as audience, have no idea what newspapers have been sampled in the “universe” project (currently the top story on Harris’s homepage), i think it is a bit overambitious to suggest that this what creates the mythology of our times.  As much as our current history is written to a large degree through mass media, scaling news on a time line of centuries, requires some selection and self-criticism.  Ironically, in the video, one of the “myths” showed by the program was Anna Nicole Smith.  With all due respect to her cultural contribution, i will be surprised if in 100 years she will be remembered as a cultural representation of our times.  So, again, as much as the tools are impressive, i think it is important to keep this kind of things in proportion, even if you are an artist.

Now, having said all the above, i admit that the presented tools are interesting.  This takes me to the more practical line of thought.  As i mentioned above, at the end of the day, what he is attempting to do is sort of a mega-content analysis.  Taking out the bombastic statements, the amounts of data analyzed in each project is huge and it is a fascinting data in itself.  I think the attempt to mechanize a massive content analysis is really interesting, as is the graphic representation of it.  Indeed, the internet presents a good and convenient opportunity to study texts – both user generated and digitized versions of the mainstream media.  Eventually we will have to figure out how to do it in other ways rather than manually.

Recently i had a conversation with Claire Cardie from the Information Science program at Cornell, who is working on natural language understanding.  It would be pretty amazing if we could combine the tools she is working on with the visualization capabilities of Harris’s projects, and the knowledge accumulated in social science (particularly communication :).  If configured correctly, i believe Jonathan’s visualization tools can help both the analysis of the data and, which is also important, its explanation to people, thus revealing a bit this convoluted field of social science to broader public.  I think this kind of cooperation can have a great potential.

For the little chance that Jonathan Harris is reading this post, i will mention a project i am participating in at the moment.  It is all about user-generated content on a wiki platform, and our aim is to have a lot of content for analysis at the end of the day.  I will definitely blog about this project more in the future, but Jonathan, if you are reading this, you are welcome to drop an email :)

Advertisements

7 Responses to “Mega content analysis?”

  1. Nadya Says:

    I wish I could watch the videos you post!

    An OT question: is there any way I can get email notifications when you reply to my comments in your blog?

  2. Dima Says:

    Why can’t you? The broadband isn’t working? Or you eventually didn’t get it?

    Good question… I know that at the bottom of this page there is a link for RSS feed for comments… I am checking how i can have it emailed…

  3. Nadya Says:

    I did get it, but it sucks :) There is not much difference with dial up in terms of speed.

  4. Dima Says:

    Hmm… digital divide!

  5. Nikos Says:

    I didn’t find this talk very interesting. Surely there are lots of data but the visualization doesn’t add much value to it. By watching this I wasn’t able to draw any conclusion that would be hard to come to previously. I guess my point can be made more clear if you contrast this with the talks that Hans Rosling has given at TED this and last year. You should especially watch the end of this year’s talk. It’s shocking!

    I don’t know if posting links in the comments works but you can find his videos here (2007) http://www.ted.com/index.php/talks/view/id/140 and here (2006) http://www.ted.com/index.php/talks/view/id/92 You can also use his tool by going to http://tools.google.com/gapminder

  6. Dima Says:

    I am glad to know that you are reading this blog :)

    Thanks for the links! I saw at least one of those videos not that long ago and played with his tool. Interesting indeed. And i think my idea of combining Harris’s visualization with some serious research data, is basically along the same lines.

  7. lisa Says:

    Rosling’s main point in his first TED talk was that visualisation is important, because with good tools, the data becomes so much more accessible and correlatable (if there’s such a word) so you can see trends and draw conclusions from it more quickly. And so in this sense, Harris’ projects are very interesting — definitely some sexy visualisations, but not necessarily producing any interesting results.

    I think you raised an important point with the natural language analysis. I know many groups are dabbling in this, but despite Chomsky breaking it all down into seemingly simple universal rules, it’s been much harder to replicate that in software, even just for English! If you remember, this was a part of Stephen Rondel’s vision for devices of the future too — that they would understand natural language and make the information it contains computable. One day…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: