Data mining for the taboo: searching for what isn't there

 “there are […] unknown unknowns
– the ones we don’t know
we don’t know”
(not originally Donald Rumsfeld)

“the main dangers lie in the ‘unknown knowns’
—the disavowed beliefs […]
we pretend not to know about”
(Slavoj Žižek

Most search engines try to find text that exists, but what about searching for text that doesn’t?

Lest someone hands me a random text generator, I should qualify that question – what about searching for text that isn’t written but perhaps should be?

Why would you want such a thing?  To educate yourself or others perhaps.  To help find the “unknown unknowns” and “known unknowns” quoted above. To escape the social media echo chamber.  And so on (Paul Graham’s essay on what you can’t say seems relevant here).  Even new scientific paradigms could be characterised as being on a journey through the murky fields of unknown-known [1].

There is of course, a lot of work that tries to address this aim by comparing the coverage of different media sources.  But I’d like to demonstrate something more radical.  Can we algorithmically analyze a single source and find topics it unnaturally avoids covering? What I’ve documented here is an attempt to do this.  Call it a first draft, and feel free to improve upon it.  The code is on github.

The example shown here was an example result from processing the Reuters-21578 corpus, a collection of about 19,000 news bulletins from the 1990s.  The meaning of all the red and blue figures is explained in more detail below, along with the algorithm.  The short story is that the algorithm thinks that when Cuba is discussed, then ministers should be discussed as well.  In reality that doesn’t happen, so the algorithm flags it up.

Why did the algorithm make that prediction in the first place?  Because there are lots of words which link Cuba to ministers.  Some of these are so general as to be meaningless (“here”, “there”, “next”) but plenty aren’t, and that’s how probabilistic approaches work:

And why doesn’t it play out in practice?  I don’t know.  Maybe chance.  Or could it be that, back in the 1990s, US media wanted to avoid words associated with democracy when discussing the Cuban government?  Whatever your own political persuasion, I’m sure you’d agree that US-Cuba relations were pretty stony at the time. I’d like to think that the algorithm flagged up this case because that was reflected in media coverage.  That would be quite exciting.

The rest of this article describes how the algorithm works, and provides a whole load more results for you to explore at leisure.


Suppose we have a large corpus of text, divided into articles, such as a bunch of news reports, or Wikipedia pages.  If an article is to be selected at random, every word in the corpus has a certain probability of appearing in the selected article.

But suppose we know the article already contains a specific word, such as “trade”.  Now, the probability for all other words appearing in the article changes.  We might expect to see words like “dollars”, “balance” and “export” appear more often than usual in articles that contain “trade”.

So, based on the corpus we have, we can compute conditional probabilities for all pairs of words.  These are the blue figures shown above (though some are predicted not measured – I’ll come back to that).  I’ll notate these as P(i|j), which translates as “probability of i appearing in the article, given that j appears in the article”.  Computing all such probabilities results in a matrix of size (n x n) where n is the number of unique words in the corpus [2].

Now imagine for a moment that we don’t know P(i|j) for a certain pair of words.  Can we predict it from the other information we have?  Maybe.  For each other word k, we know the probability P(k|j) that k appears if j appears, and the probability P(i|k) that i appears if k appears.  You could try multiplying and adding these for every possible other word k to get a guess at the probability which I will call P’:

P'(i|j) = Σ_k P(i|k) P(k|j)

But wait!  We can’t sum probabilities like that, because it’s possible that more than one of the other words could appear at once.  How can we solve that problem?  Remembering a hack you were probably taught at school, that “at least once” means the same as “not zero times”, let’s work out the probability of i appearing with j through association with at least one intermediate word k:

1 – P'(i|j) = Σ_k 1-P(i|k) P(k|j)

This formula still isn’t right, because it assumes that the other words k occur independently of one another, which isn’t true either.  Hold that thought, I’ll come back to it later.  First let’s take a step back, remember that P’ was only meant to be a naive guess in the first place, then ask ourselves if it’s a good enough guess.

How do we test whether one variable can accurately predict another?  With a regression model.  But let’s not regress P against P’, because conditional probabilities aren’t actually very interesting.  The biggest values of P'(i|j) will be for common words that appear everywhere. Instead, let’s convert probabilities P and P’ to lifts L and L’.  L(i|j) means the factor by which j increases the probability of i, and is computed as P(i|j)/P(i).  These are the red figures given above – again, as you now understand, both predicted and actual.

Also for the sake of simplicity, let’s ignore all other words k apart from the 10 that are most likely to link j to i.  In other words, the 10 largest values of P(i|k) P(k|j).  This may seem odd, but as we go on it will make the results more comprehensible to us.

Bearing in mind that a lot of words don’t appear together at all – that is, our matrix of P(i|j) contains a lot of zeros – the most appropriate regression form might be a hurdle model. That means two models: one to predict whether a word pair will appear at all, and a second model to predict how often it appears, for the times when it does appear.

Testing these models on the Reuters dataset, the first model doesn’t work well (discrimination is actually negative – I have no explanation for this!).  The second model works alright, however: the correlation ρ between predicted and actual lift, given that the words do appear together in the first place, is 0.60.

So, it turns out that L’ is actually a reasonably good predictor of L.  But how does this help us data mine for the taboo, or “what isn’t being said”?  Well, let’s define “taboo” as a combination of words that the algorithm logically thinks should appear often, but doesn’t appear often in practice [3].  These combinations are outliers in the regression model, so let’s look at the biggest outliers.  To help us learn why they are outliers, let’s also look at words the algorithm used to link them.  (This is why restricting the algorithm to 10 links is useful).

Naive algorithm results

There are a bunch of ways we could define “biggest outliers”, so I’m going to use the following two:

(In any of the above results, click on individual lines to show the links between word pairs – i.e. reasons for predicted lift).

The results are tantalizing, but I don’t think I have sufficient knowledge of finance, nor business, to make sense of them.  Maybe you can!  I will note that a lot of “underoccurring” outliers are for word pairs that only occur once in practice: the results might be more intuitive if we insisted on a higher threshold, i.e. only considered word pairs that appear several times.  But then again, if we were data mining for what isn’t said, aren’t the results likely to be unintuitive?


How can this be improved on?  Firstly, the maths.  Earlier I mentioned some issues with the assumption of independence of k, and said “hold that thought”.  Ok, I actually wrote an extensive footnote about that, see [4] below.

Secondly, rather than looking for links between words, we could use other software to extract concepts from the text and find links between concepts rather than words. Intuitively that would filter out some noise from the results, as some results are obviously due to a confusion of words and concepts (“Puerto” and “Rico” appearing together more than expected, for example).  You would, however, have to be careful how the concept extracting process works: if “concepts” are also defined through a matrix of conditional probability such as this one, then some odd interactions between concept extraction and taboo mining may happen.

Third, you could apply this to a different corpus of text.  Funnily enough, I already have … on the entire English language version of Wikipedia.  I’m pleased to report that the first part of the hurdle model works considerably better there, and is able to predict whether word pairs appear at all with a discrimination of about 22%.  For pairs which do appear the correlation ρ is 0.53.  As to the results, I can’t understand them. Maybe it would help to split apart the fictional and factual content then have another go.

Finally, the obvious criticism here is that looking at the results is a bit like reading a horoscope, or worse, a Rorschach test!  Nebulous concepts are vaguely suggested, and how the reader interprets them reveals more about the reader than the text.  How can you prove this algorithm is any better than that?  The fact that there is significant correlation between predicted occurrence of word pairs, and their actual occurrence, is certainly encouraging.  Ultimately to prove the algorithm’s worth, though, a blind test is needed in which people read either algorithm output or randomly selected word pairs, and attempt to infer conclusions from them.

I wish I had more time for this, but alas I don’t, so having read this far – are you up for taking over the project?  Unless you’re a cultural genius, you’ll probably think that the algorithm needs improving a bit to become useful, but if you can make the right tweaks, it would be very exciting.

If you want to discuss it with me please do, otherwise, feel free to dive into the source and get on with it.  If not, well, thank you for reading this far nonetheless!


[1] Please accept my apologies for a gross oversimplification of Kuhn.

[2] Actually we discard very frequent words (appearing in more than 12% of articles) because they don’t tend to relate to specific topics.  We also discard very infrequent words (appearing fewer than 5 times overall) because they’re a sort of noise in the data.  And to keep computation manageable, we trim the remaining word list down to the 20,000 most frequent.

[3] As we are using hurdle models we restrict ourselves to combinations that do appear somewhere, but don’t appear as often as we think they should.

[4] The assumption of all other words k occurring independently is (I think) more formally stated as x,y {x,y}{i,j}=  P(x|y)=P(x).

How can we fix this? An exhaustive calculation would need to sum the probability of every combination of other words k appearing.  There are 2^(n-2) such combinations so clearly this isn’t possible, but I imagine you could sample the space of combinations of k at random to get an estimate. This means computing the probability of each P(k1  k2  k3  …|j) and each P(i | k1  k2  k3  …).  If you tried to measure these directly from the data, i.e. ditch the conditional probability matrix and see whether the combination actually exists, I imagine you would overfit the model, in the sense that most multi-word combinations would probably appear in either 0 or 1 articles.  This means you would make great predictions but have no residuals from which to infer what is taboo.

I actually tried an alternative approach using a greedy algorithm.  It works like this, for a maximum number of linking concepts m:

  1. Initialize prediction P'(i|j) to 0.
  2. Make a temporary copy of P(k|j) for all k, let’s call this Q(k|j).
  3. Repeat m times:
    1. Find the most probable word w to link j to i, i.e. pick w to maximize
      x = P(i|w) Q(w|j).
    2. Add x to our prediction P'(i|j).
    3. Modify Q(k|j) to make it conditional on w not occurring, in other words set Q(k|j) = (1-x) Q(k|j) for all k.
  4. P'(i|j) now contains our prediction.

This is identical to the naive algorithm discussed above for m=1, and I think (though haven’t proven!) that it’s mathematically correct for m=2.  As m gets higher, we over-correct for higher order terms e.g. the probability that the first two choices of w occur together.  Still, it’s an estimate, not a precise calculation: think of words as appearing in groups, and if one word is used to predict a link from i to j, then all words it occurs with get scored down to make their reuse as a predictor of an ij link less likely.  The crucial question, of course, is how well the greedy algorithm for estimating P’ performs overall?  I tried on the Reuters corpus for for m=2, 3, 10 and 20; in all cases performance is similar to the naive algorithm, sometimes worse, but for m=10 it is better at ρ=0.62.

Here are the greedy algorithm results, for the record.


Additional Assets

Showing 3 Reviews

  • Placeholder
    Subash Thota
    Originality of work
    Quality of writing
    Quality of figures
    Confidence in paper

    The title and introduction are appropriate for the content of the text. Furthermore, the article is well constructed and analysis was well performed. The article is well written and should be of great interest to the readers. Overall, it is an important study, Good Work !.

  • Placeholder
    Tom Stedall

    This seems like a very interesting method. I guess there will be outliers for different reasons? Some will be conceptual i.e. we would rightly expect that in the news some things are usually mentioned, or not usually mentioned, when certain things are under discussion. There will also presumably be a lot of outliers for other reasons too - idiomatic, linguistic, grammatical etc. How does the prediction algorithm (which is of course a choice of method) cope with these?

    On thing you could do therefore is look at the outliers for general nouns and verbs and non-conceptual words, perhaps those constituting the relatively small set of most commonly used words. Seeing what sort of outliers you got with these might give you a way to 'filter' the results, although you'd of course risk losing interesting stuff.

    More interestingly though, could you test it against known cases? for example frequently report on disproportionately small numbers of articles mentioning things whose patent relevance to each other seems inadmissible by corporate media. And this is
    just one of many possible test cases of course, and not necessarily the easiest.

    The results of this approach seem to be incredibly complex, and are certainly a little suggestive, and I think that both finding ways to simplify them and finding ways to test them against known cases would be good steps towards validating it. Nice work!

  • Placeholder
    AnonymousArmchair Critic

    Kuhn himself already grossly over simplified his subject matter - no apology for summarising Kuhn is required! 

    Just had a scan read of a print out while having a sit down ;-)

    Seems very topical with the BBC not using the word `refugee' until a week or so ago...

    be careful until I'd tested it a lot.  Is there a log of data from
    known censorship incidents?  E.g. about the Ryan Giggs Super Injunction
    (excluding twitter) from before the damn burst on that one?  What about
    that other Police shooting of a muslim family in London around the time
    of the Forest Gate one?  I forget the name, sorry.

    Luxembourg results are fun lol - is there a legit reason they're
    excluded from Moodys and S&Ps rating of their national debts though,
    e.g. they're below a threshold?  What's that Eduardo pcgg and Eduardo
    soriano?  I'm looking at you, Latin America!

    In general there's
    so much noise there it's hard to pickout a `smoking gun'.  Neat tool
    though, and nice way of showing the output.

    What happens if a lot of `original source material' is actually just copypasta from press releases - double counting, no?

    an uplift of a small probability always going to look artificially high
    - be wary about dividing by small numbers.  I suppose the field of
    interest is for a word that is used quite a lot but censored only in
    specific instances of colocation with others

    In the
    sciences, words tend to have one and one only, precisely defined
    meaning.  I'd be sure that you're not predicting combinations that are
    obviously nonsense, or you need an expert to rule out all the
    contradictory combinations, not just "quantitative hermeneutics".  Maybe
    rule that out as irrelevant anyway as above, idk.

    Does natural language and culture actually follow any sort of distribution in the first place?

    It crossed my mind that you could easily deliberately automate some censorship yourself to test this too.

    a tangent as well, could you use it to solve cryptic crosswords? 
    Setters sometimes deliberately use clues with answers outside the range
    of the thesaurus, btu which still have meaning to the person completeing
    the crossword.  Of course, the setters would then become aware of the
    tool just the same as google, and you'd provoke an other crossword arms
    race, and schisms like the current ones between `no dictionary', `no
    thesaurus' and `no google' people.

    This review has 2 comments. Click to view.


This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.