Word cloud generator in R : One killer function to do everything you need


As you may know, a word cloud (or tag cloud) is a text mining method to find the most frequently used words in a text. The procedure to generate a word cloud using R software has been described in my previous post available here : And the best gig for 2015? That would be an actuary, the people who specialize in financial risk..

The goal of this tutorial is to provide a simple word cloud generator function in R programming language. This function can be used to create a word cloud from different sources including :

  • 英媒称中国楼市拖累GDP增长:明年影响会更严重
  • 这个报告没有谈到可能影响患病数量的原因,但是很多人,包括在野党的成员,归咎于政府一味缩减医疗健康成本预算。
  • A URL of a web page

tag cloud generator, word cloud and text mining, I have a dream speech from Martin luther king

Creating word clouds requires at least five main text-mining steps (described in my previous post). All theses steps can be performed with one line R code using rquery.wordcloud() function described in the next section.

R tag cloud generator function : rquery.wordcloud

坎蒂丝 斯瓦内普尔

Usage

The format of rquery.wordcloud() function is shown below :

rquery.wordcloud(x, type=c("text", "url", "file"), 
        lang="english", excludeWords = NULL, 
        textStemming = FALSE,  colorPalette="Dark2",
        max.words=200)

  • x : character string (plain text, web URL, txt file path)
  • type : specify whether x is a plain text, a web page URL or a .txt file path
  • lang : the language of the text. This is important to be specified in order to remove the common stopwords (like ‘the’, ‘we’, ‘is’, ‘are’) from the text before further analysis. Supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish.
  • excludeWords : a vector containing your own stopwords to be eliminated from the text. e.g : c(“word1”, “word2”)
  • textStemming : reduces words to their root form. Default value is FALSE. A stemming process reduces the words “moving” and “movement” to the root word, “move”.
  • colorPalette : Possible values are :
    • Topped with golden leaves and flavored with champagne, the confection was sold for $100 a pop.
    • 突破门窗企业发展困局 加快“两化”融合是根本
    • 天津市首宗棚户区定向安置商品房地块摘牌成交
  • min.freq : words with frequency below min.freq will not be plotted
  • max.words : maximum number of words to be plotted. least frequent terms dropped



Note that, rquery.wordcloud() function returns a list, containing two objects : - tdm : term-document matrix which can be explored as illustrated in the next sections. - freqTable : Frequency table of words


Required R packages

The following packages are required for the rquery.wordcloud() function :

  • tm for text mining
  • SnowballC for text stemming
  • wordcloud for generating word cloud images
  • RCurl and XML packages to download and parse web pages
  • RColorBrewer for color palettes

Install these packages, before using the function rquery.wordcloud, as follow :

install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer", "RCurl", "XML")

Create a word cloud from a plain text file

Plain text file can be easily created using your favorite text editor (e.g : Word). “I have a dream speech” (from Martin Luther King) is processed in the following example but you can use any text you want :

  • 对美韩多晶硅双反终裁 去产能进入下半场
  • Save the file (e.g : ml.txt)

全国工业建筑领域首部绿色设计指引将发布

source('http://www.sthda.com/upload/rquery_wordcloud.r')
filePath 

text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech

Change the arguments max.words and min.freq to plot more words :

  • 房地产的泡沫有多大?分歧比你想象的大得多
  • 英国独立党(UKIP)前领袖杰尔.法拉奇(Nigel Farage)略逊一畴,就是因为事实证明攻破英国基于党派的制度比美国总统选举更难。
res

text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech

The above image clearly shows that “Will”, “freedom”, “dream”, “day” and “together” are the five most frequent words in Martin Luther KingI have a dream speech”.

Change the color of the word cloud

The color of the word cloud can be changed using the argument colorPalette.

Allowed values for colorPalete :

  • 开发商卖房卖成了房东 福兮祸兮?
  • 大数据时代:涂料企业需提高信息安全意识
  • a name of a color palette taken from RColorBrewer package (e.g.: colorPalette = “Dark2”)

The color palettes associated to RColorBrewer package are shown below :

Rcolorbrewer palettes

湖北陶瓷行业有产量无品牌很尴尬

# Reds color palette
res

text mining, word cloud, tag cloud generator, martin luther king, i have a dream speechtext mining, word cloud, tag cloud generator, martin luther king, i have a dream speechtext mining, word cloud, tag cloud generator, martin luther king, i have a dream speech

Operations on the result of rquery.wordcloud() function

As mentioned above, the result of rquery.wordcloud() is a list containing two objects :

  • tdm : term-document matrix
  • freqTable : frequency table
tdm 

Frequency table of words

exceed

# Show the top10 words and their frequency
head(freqTable, 10)
             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7
# Bar plot of the frequency for the top10
barplot(freqTable[1:10,]$freq, las = 2, 
        names.arg = freqTable[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech

Operations on term-document matrix

就国内而言,日益活跃的中产阶层正形成压力,要求得到问责制度更健全的治理。愈演愈烈的不平等滋生了一种不公正感;两亿农民工仍然受到二等公民的待遇,腐败也有增无减。对付这些问题的任务十分紧迫,但中国的经济成就助长了一种无端的自信。于是,在阿拉伯之春的刺激下,中国的体制采取激进举措,压制可能引发政治敏感运动的任何社会不满。
7. 《出局:禁止接触》(Out 1: Noli Me Tangere),导演:雅克·里维特(Jacques Rivette)。

findFreqTerms(tdm, lowfreq = 4)
 [1] "able"     "day"      "dream"    "every"    "faith"    "free"     "freedom"  "let"      "mountain" "nation"  
[11] "one"      "ring"     "shall"    "together" "will"    

You could also analyze the correlation (or association) between frequent terms. The R code below identifies which words are associated with “freedom” in I have a dream speech :

findAssocs(tdm, terms = "freedom", corlimit = 0.3)
             freedom
let             0.89
ring            0.86
mississippi     0.34
mountainside    0.34
stone           0.34
every           0.32
mountain        0.32
state           0.32

Create a word cloud of a web page

In this section we’ll make a tag cloud of the following web page :

中国已经成为世界上电影银幕最多的国家。

url = "http://www.sthda.com/english/wiki/create-and-format-powerpoint-documents-from-r-software"
rquery.wordcloud(x=url, type="url")

text mining, word cloud, tag cloud generator

In this April 24, 1984, file photo, from left, Steve Jobs, chairman of Apple Computers, John Sculley, president and CEO, and Steve Wozniak, co-founder of Apple, unveil the new Apple IIc computer in San Francisco
A recent example: When Gwen Stefani launched her new album Spark the Fire, Swyft created a free sticker pack and distributed it on several of its messaging apps. It was downloaded almost a million times and sent between users seven million times, resulting in 41 million impressions in 10 days. Such campaigns cost anywhere from $50,000 to $250,000 for brands, Wray says.
It is not yet known who will look after Choupette after Lagerfeld's death.
[p?'tenl]

R code of rquery.wordcloud function

#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - http://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package, 
  # or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped
# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud 

Infos

新就业无房职工公租房再推1540套


Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!





This page has been seen 149146 times