(excerpt from my new book, question #65)
News clustering is a hard problem to be solved. News articles are typically arriving to our clustering engine in a continuous streaming fashion. Therefore, a plain vanilla batch approach is not feasible. For instance, the simple idea of using k-means cannot work for two reasons. First, it is not possible to know the number of clusters a-priori because the topics are dynamically evolving. Second, the articles themselves are not available a-priori. Therefore, more sophisticate strategies are required.
One initial idea is to split the data in mini-batches (perhaps processed with Spark Streaming) and to clusters the content of each mini-batch independently. Then, clusters of different epochs (e.g. mini-batches) can be chained together.
An additional intuition is to start with k-seeds and then extend those initial k-clusters whenever a new article that is not similar enough to the initial groups arrives. In this way, new clusters are dynamically created when needed. In one additional variant, we could think about re-clustering all the articles after a number of epochs under the assumption that this will improve our target metric.
In addition to that, we can have a look to the data and perhaps notice that many articles are near-duplicates. Hence, we could aim at reducing the computational complexity by applying pseudo-linear techniques such as minHash shingling.
More sophisticate methods will aim at ranking the articles by importance. This is an even harder problem again because of the dynamic nature of the content and the absence of links, which could have allowed PageRank-type computations. If that is not possible, then a two-layer model could be considered where the importance of a news article depends on the importance of the originating news sources, which in turns depends on the importance of the emitted articles. It can be proven that this recurrent definition has a fixed-point solution.
Even more sophisticate methods can aim at extracting entities from the clusters and this is typically achieved by running a topic model detection on the evolving mini-batches.