Tuesday, 31 January 2012

Semantics and homophily on Twitter

I took about 600 recent tweets with the hashtag #NHSbill and analysed them for word clusters using Latent Dirichlet Allocation (a hierarchical Bayesian technique). I've been thinking of building a tool that could (1) help identify discussion themes, (2) find related tweets, and (3) suggest tweeps that have similar or different views to your own.

This is a stab at step (1). The following collections of words are summaries of the discussion themes and, given that it's Twitter, associated tweeps (hope you don't mind the associations) and links.

#nhs @silv24 proud
disaster healthcare standing
talking professionals Lansley

@martinmckee @safcbob
#nhsreform doctors
week Andrew @ijgreener
@ingridjohanna66 RCPsych

500 GPs [involved in] commissioning groups
want clinical withdrawn
amended http://www.bbc.co.uk/news/health-16771304

@rcplondon @djnicholl oppose
extraordinary mtg
@welsh_gas_doc saying
gen care pensions

support royal
alternative Lansley
paper Colleges
r4 says Times

@cpeedell @clarercgp
@marcuschown @leicesterliz
wrong  modification
enough thrown out

What was frustrating was that @profchrisham kept popping in and out of the model as I was fiddling with the stop word list ('a', 'the' ,'and' etc.). Supportive voices for the bill were uncommon in the sample of tweets so I might go for a bigger sample. Like I said before, maybe I need to find a debate that is less, eh, one-sided, on Twitter.

Just finding discussion themes, step (1), can be a way of following the various threads in a Twitter hashtag live meeting - if could be done live.


  1. @dean I've started pondering how we might build 'quick view' tools in R based around searches on the most recent tweets tagged in a particular way. eg http://blog.ouseful.info/2012/01/21/a-quick-view-over-a-mashe-google-spreadsheet-twitter-archive-of-ukgc2012-tweets/ and comments thereform. I intend to strat trying to collate scripts here: https://github.com/psychemedia/Twitter-Backchannel-Analysis