Latent Dirichlet Allocation
Part of a good SEO consultants job is to research the latest developments in the industry and keep up to date with new theories and models. One such model I came across yesterday is called Latent Dirichlet Allocation . Wikipedia describes this method as 'a generative model that allows sets of observations to be explained by unobserved groups which explain why some parts of the data are similar' - essentially, in SEO, it is a way of predicting the ranking of a site based on the keywords and related keywords used.
SEOMoz.org gives us an interesting article on the correlation between LDA and Google's ranking outcomes . The article talks about some of the ways that search engines might ranking two competing pages using a number of factors. One such factor is the Co-occurrence factor which my Keyword Association Manipulation technique takes advantage of.
Using vector spaces to describe how similar keywords are to different topics, you can generate a model of how co-occurence of words relate to different topics, for example, if one page contains the text 'he drove his car fast' the word 'drove' would be more closely related to the topic of 'vehicle piloting' because of the existence of the word 'car' in the paragraph, whereas 'he drove the ball down the fairway' would be more closely related to the topic of golf because it is near the words 'ball' and fairway'. A third passage 'she drove away every man she got close to' would be related to a completely different topic altogether.
LDA uses topic modelling to sort content based on how similarly the words are related to a topic. In the example above, the word 'drove' could be described as 60% related to vehicle piloting, 10% related to golf, 10% to collective nouns (eg droves of people) etc. then if you add the other keywords in the passage you can get a better idea of what the passage is talking about, eg, car is 90% vehicle, 0% golf, so if you have 'drove' and 'car' together you can calculate their topic similarity as follows:
Passage1:
Vehicle Piloting= 54 % (drove: 60% * car: 90% )
Golf = 0.1 % (drove:10% * car: 1%)
So passage 1 is most likely to be related to vehicle piloting.
Passage 2:
Vehicle Piloting = 0.6 % (drove:60% * fairway:1%)
Golf = 9.9 % (dove:10% * fairway:99%)
So passage 2 is closer to the topic of Golf.
Note, for the mathematics of topic modelling, I've set the maximum weighting is 99% and the minimum is 1%. The reason for this is that anything multiplied by 0 is 0, so even if you had 100 relevent keywords and 1 completely irrelevent one, the multiplication would be 0. Fuyrthermore, if you were then using an average relevency for the calculation, you would be dividing 0 by a positive number which is infinity. On a more pragmatic level, it is likely that an irrelevent word could crop up in any passage of any topic, and can either be discarded as a statistical outlier, or its effect muted with associative weighting (a topic for another time), so setting its relevency to 1% would allow the numbers to be calculated.
There are other factors that affect topic modelling, but I've covered the basic principles of how one form of topic modelling can be achieved.
