Lately, I have been interested in Natural Language Processing techniques (NLP) and Topic Modeling. Topic Modeling identifies recurrent themes (topics) from a collection of documents (corpus). Consider, for instance, a set of newspaper articles. For tech articles, words that may occur more often than others could be: cloud, system, network, streaming. Likewise, for sports articles: goal, score, points, players, team, coach. We get the idea that it is possible to organize a corpus of news and sort it into topics: sport, technology, media, fashion, current affairs, etc.

Among the different methods available for topic modeling, there is a method called Latent Dirichlet allocation (LDA). In the context of machine learning, LDA was proposed by Blei et al. 2003. As far as I understand, it is a probabilistic and unsupervised algorithm that assumes that a document is a mixture of topics. However, bear in mind that LDA does not return how many topics there are for a corpus and does not name them. You have to give as an input how many topics you think there are in the corpus. LDA will return unnamed topics in the form of a list of weighted words, and from those, you will have their name.

Now how can we use the LDA algorithm? One of my hobbies is the collectible card game Magic: The Gathering. I started to play around 2000, and after a long break, I got back at it in 2018. It is a fascinating game and I believe that there is a lot on interesting thing to write about it in terms of data analysis. I also remember interesting article from 2017 where the author used LDA to establish deck archetypes.

In MTG, an archetype corresponds to the strategy used to win with a particular deck. For instance, Control decks are all about disrupting and stalling your opponent’s actions: countering spells, discarding cards from their hand, destroying or exiling their creatures, etc. Moreover, not all control decks are the same. You will find Mono-Blue Control decks, Blue White (UW) Control decks, Mono-Black Control, and so on. You could call them sub-archetypes. Given the longevity of the game (28 years), its card pool (20000+), and its different formats (officially 7), you probably guess that there are quite a lot of them. When reading a decklist, experienced players can narrow down to which archetype it corresponds to just based on a few cards. For instance, one of my favorite archetypes, Tron, can be described roughly with four cards: Karn Liberated, Urza’s Mine, Urza’s Tower and Urza’s Power Plant. Therefore, if I see a decklist with these cards in, it is safe to assume that it is a Tron deck.

From the paragraph above, we have set the scene for the next upcoming posts. We will use topic modeling to find archetypes in a dataset containing thousands of MTG decklists. We have a corpus of documents (the decklists), topics (archetypes), a vocabulary of words (card names), and a model (LDA). Even better, we have a starting point with the 2017 article! Also, note that since LDA returns unnamed topics, I will for sure test my MTG knowledge to see if they make any sense. What’s better than spending time looking at cards ((“Q(´▽`。)?