As you may know, a word cloud (or tag cloud) is a text mining method to find the most frequently used words in a text. The procedure to generate a word cloud using R software has been described in my previous post available here : And the best gig for 2015? That would be an actuary, the people who specialize in financial risk..
The goal of this tutorial is to provide a simple word cloud generator function in R programming language. This function can be used to create a word cloud from different sources including :
- A URL of a web page
Creating word clouds requires at least five main text-mining steps (described in my previous post). All theses steps can be performed with one line R code using rquery.wordcloud() function described in the next section.
R tag cloud generator function : rquery.wordcloud
The format of rquery.wordcloud() function is shown below :
rquery.wordcloud(x, type=c("text", "url", "file"), lang="english", excludeWords = NULL, textStemming = FALSE, colorPalette="Dark2", max.words=200)
- x : character string (plain text, web URL, txt file path)
- type : specify whether x is a plain text, a web page URL or a .txt file path
- lang : the language of the text. This is important to be specified in order to remove the common stopwords (like ‘the’, ‘we’, ‘is’, ‘are’) from the text before further analysis. Supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish.
- excludeWords : a vector containing your own stopwords to be eliminated from the text. e.g : c(“word1”, “word2”)
- textStemming : reduces words to their root form. Default value is FALSE. A stemming process reduces the words “moving” and “movement” to the root word, “move”.
- colorPalette : Possible values are :
- Topped with golden leaves and flavored with champagne, the confection was sold for $100 a pop.
- 突破门窗企业发展困局 加快“两化”融合是根本
- min.freq : words with frequency below min.freq will not be plotted
- max.words : maximum number of words to be plotted. least frequent terms dropped
Required R packages
The following packages are required for the rquery.wordcloud() function :
- tm for text mining
- SnowballC for text stemming
- wordcloud for generating word cloud images
- RCurl and XML packages to download and parse web pages
- RColorBrewer for color palettes
Install these packages, before using the function rquery.wordcloud, as follow :
install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer", "RCurl", "XML")
Create a word cloud from a plain text file
Plain text file can be easily created using your favorite text editor (e.g : Word). “I have a dream speech” (from Martin Luther King) is processed in the following example but you can use any text you want :
- 对美韩多晶硅双反终裁 去产能进入下半场
- Save the file (e.g : ml.txt)
Change the arguments max.words and min.freq to plot more words :
- 英国独立党(UKIP)前领袖杰尔.法拉奇(Nigel Farage)略逊一畴，就是因为事实证明攻破英国基于党派的制度比美国总统选举更难。
The above image clearly shows that “Will”, “freedom”, “dream”, “day” and “together” are the five most frequent words in Martin Luther King “I have a dream speech”.
Change the color of the word cloud
The color of the word cloud can be changed using the argument colorPalette.
Allowed values for colorPalete :
- 开发商卖房卖成了房东 福兮祸兮？
- a name of a color palette taken from RColorBrewer package (e.g.: colorPalette = “Dark2”)
The color palettes associated to RColorBrewer package are shown below :
# Reds color palette res
Operations on the result of rquery.wordcloud() function
As mentioned above, the result of rquery.wordcloud() is a list containing two objects :
- tdm : term-document matrix
- freqTable : frequency table
Frequency table of words
# Show the top10 words and their frequency head(freqTable, 10)
word freq will will 17 freedom freedom 13 ring ring 12 day day 11 dream dream 11 let let 11 every every 9 able able 8 one one 8 together together 7
# Bar plot of the frequency for the top10 barplot(freqTable[1:10,]$freq, las = 2, names.arg = freqTable[1:10,]$word, col ="lightblue", main ="Most frequent words", ylab = "Word frequencies")
Operations on term-document matrix
7. 《出局：禁止接触》(Out 1: Noli Me Tangere)，导演：雅克·里维特(Jacques Rivette)。
findFreqTerms(tdm, lowfreq = 4)
 "able" "day" "dream" "every" "faith" "free" "freedom" "let" "mountain" "nation"  "one" "ring" "shall" "together" "will"
You could also analyze the correlation (or association) between frequent terms. The R code below identifies which words are associated with “freedom” in I have a dream speech :
findAssocs(tdm, terms = "freedom", corlimit = 0.3)
freedom let 0.89 ring 0.86 mississippi 0.34 mountainside 0.34 stone 0.34 every 0.32 mountain 0.32 state 0.32
Create a word cloud of a web page
In this section we’ll make a tag cloud of the following web page :
url = "http://www.sthda.com/english/wiki/create-and-format-powerpoint-documents-from-r-software" rquery.wordcloud(x=url, type="url")
In this April 24, 1984, file photo, from left, Steve Jobs, chairman of Apple Computers, John Sculley, president and CEO, and Steve Wozniak, co-founder of Apple, unveil the new Apple IIc computer in San Francisco
A recent example: When Gwen Stefani launched her new album Spark the Fire, Swyft created a free sticker pack and distributed it on several of its messaging apps. It was downloaded almost a million times and sent between users seven million times, resulting in 41 million impressions in 10 days. Such campaigns cost anywhere from $50,000 to $250,000 for brands, Wray says.
It is not yet known who will look after Choupette after Lagerfeld's death.
R code of rquery.wordcloud function
#++++++++++++++++++++++++++++++++++ # rquery.wordcloud() : Word cloud generator # - http://www.sthda.com #+++++++++++++++++++++++++++++++++++ # x : character string (plain text, web url, txt file path) # type : specify whether x is a plain text, a web page url or a file path # lang : the language of the text # excludeWords : a vector of words to exclude from the text # textStemming : reduces words to their root form # colorPalette : the name of color palette taken from RColorBrewer package, # or a color name, or a color code # min.freq : words with frequency below min.freq will not be plotted # max.words : Maximum number of words to be plotted. least frequent terms dropped # value returned by the function : a list(tdm, freqTable) rquery.wordcloud
Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!
Recommended for You!
Recommended for you
Cristiano Ronaldo, the superstar footballer who earned 61 million pound last year, usurped long-term leader Floyd Mayweather Jnr into first place. The retired boxer, who fought just once in the last 12 months, earned pound 30 million.
Coursera - Online Courses and Specialization
- 欧司朗照明业务中国竞购 未来LED全球格局版图是否会改写 by Standford
- Specialization: Data Science by Johns Hopkins University
- 工程建设项目审批制度改革试点近一年：审批时间压减一半 by University of Michigan
- 硅宝科技发欲罢免董事长缘由 提请代表人称不得已为之 by Coursera
- 金九楼市持续低迷 信贷政策或为“救市”关键 by University of Washington
- 天价学区房：购房焦虑不降反升 有的单价近15万 by Duke University
- 互联网装修盛行 定制化家居依存痛点 by Johns Hopkins University
- 2月首周京政策房成交占比超六成 东坝自住房密集 by Johns Hopkins University
Popular Courses Launched in 2020
- 卖场成本高压力大 建材商写字楼里做生意 by Google
- AI for Medicine by deeplearning.ai
- 为了解决农村“空心化” 鼓励剩余人口向中心村转移 by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services
- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- 多地重启限购政策 业内人士称部分资金撤离楼市 by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts
Books - Data Science
- 不负韶华 携手前行 by A. Kassambara (Datanovia)
- 三包期规则五花八门 建材超期维修收费高 by A. Kassambara (Datanovia)
- 频现倒闭 拿什么拯救曾经辉煌的陶企？ by A. Kassambara (Datanovia)
- 25家房企业绩过千亿 后半个月开启冲刺模式 by A. Kassambara (Datanovia)
- “供”“销”两不满 百安居中国崛起任重道远 by A. Kassambara (Datanovia)
- 大雨中现“日光盘”合肥半年房价上涨近50% by A. Kassambara (Datanovia)
- Nothing Gold Can Stay by A. Kassambara (Datanovia)
- 楼市大分化： 一线不封顶 三四线城市不见底 by A. Kassambara (Datanovia)
- Despite the seemingly extensive target vetting on May 7 the Chinese embassy in Belgrade was struck by five Joint Directed Attack Munition satellite-guided bombs delivered by U.S. Air Force B-2 Spirit bombers. Three Chinese journalists—Shao Yunhuan of Xinhua and Xu Xinghu and his wife Zhu Ying of the Guangming Daily—were killed in the attack. Twenty other Chinese nationals were injured five seriously. by Hadley Wickham & Garrett Grolemund
另外，Dua Lipa想要冲击三个奖项的，最后拿到了最佳艺人奖。虽然Harry Styles可能是One Direction最时尚的成员之一，但这位具有时代标志的歌手的皇冠意外地被MTV EMA的新人和前乐队的伴侣ZAYN抢走了。 by Aurelien Géron
- 9. Add Personality by Peter Bruce & Andrew Bruce
- 但国家统计局认为，CPI上涨的主要原因是去年同期对比基数相对偏低。 by Garrett Grolemund & Hadley Wickham
- 在你醒着的时候，肝可以扩大将近一半。新的研究指出，在白天的时候，肝有能力扩大近50%，然后在夜间则会缩回到原始尺寸。肝是我们所知道到的唯一一个可以这样变化的器官。 by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet
Want to Learn More on R Programming and Data Science?
Follow us by Email On Social Networks: