In the latest installment of my Data, Information and Technology Applications (DITA) module we considered text as data and looked at some of the tools that can be used to help garner meaning from, and to visualise this information.
There was one particular method of text data visualisation that we discussed that I already had some experience in, word clouds.
In my previous life I worked in marketing, mainly at international law firms, and for a pitch document for prospective clients I wanted to do something that could give a brief overview of the firm’s practice areas and strengths in a visually pleasing way. It being 2014, a word cloud seemed appropriate.
To do this I collated all the peer, journal and client comments about the firm, uploaded them into a word cloud generator (I believe I used wordclouds.com), and then formatted it into the company logo and colours. The text data needed a bit of tidying up before it could be used, I had to define extra stop words to ensure that the ones that made it to the word cloud were the most descriptive ones. Stop words are generally words like ‘the’, ‘and’ and ‘I’ that you wouldn’t necessarily want included, but I widened this to also include some frequently used terms that aren’t very distinctive or descriptive like ‘practice’. Furthermore, as names often consist of two words I had to make sure they were treated as one. For example, the firm did a lot of work in Hong Kong, and I didn’t want the words Hong and Kong to appear separately on the word cloud.
I can’t share a picture of the word cloud here for several reasons, but it was a success (as much as a word cloud can be), although it could never be considered a comprehensive description or analysis, it acted as a thematic summary of the firm’s key practices, people and strengths.
However, despite word clouds being engaging and easy to make (even I could do it), there are several limitations to them as a form of text data visualisation. They are often not clear or accurate representations of the text they are trying to summarise, as Hearst (2019) notes, “they are biased towards making you notice words for which there are few alternatives”, and furthermore, as word clouds signify the frequency a word is used by its font size, longer words can sometimes appear to be larger than shorter ones, even if they were used less frequently.
As well as this, they are static and can also be overly simplistic or uninteresting, as Temple (2019) notes, “You typically get a mixture of obvious words and common words”, which they highlight with the example that it isn’t surprising to learn that the word ‘Harry’ appeared a lot in the Harry Potter books.
Word clouds have their place, they are for simplistic, thematic, visual summaries of text. For actual meaning or analysis to be made, further steps have to be taken, either by improving the format or using different tools. Hearst (2019) offers some ways of improving word clouds that would allow more meaning to be taken from them, including organising the words within the clouds into coherent groups and visually subdividing them. Or there are tools you can use, like Voyant Tools, that incorporate word clouds but they are more interactive, and provide further analysis dashboards that allow you to more effectively analyse text data and trends of words and phrase-use from one or multiple sources.
For a bit of fun I knocked up a word cloud from the blog posts published so far from my classmates on the DITA module, which I think visualises the benefits and limitations of word clouds nicely.
Hearst, M. (2019) Word Clouds: We Can’t Make Them Go Away, So Let’s Improve Them, Available at https://medium.com/multiple-views-visualization-research-explained/improving-word-clouds-9d4a04b0722b (Accessed: 25 November 2020).
Temple, S. (2019) Word Clouds Are Lame, Available at https://towardsdatascience.com/word-clouds-are-lame-263d9cbc49b7 (Accessed: 25 November 2020).