Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

[python] 基于wordcloud库绘制词云图

词云Wordcloud是文本数据的一种可视化表示方式。它通过设置不同的字体大小或颜色来表现每个术语的重要性。词云在社交媒体中被广泛使用,因为它能够让读者快速感知最突出的术语。然而,词云的输出结果没有统一的标准,也缺乏逻辑性。对于词频相差较大的词汇有较好的区分度,但对于颜色相近、频次相近的词汇来说效果并不好。因此词云不适合应用于科学绘图。本文基于python库wordcloud来绘制词云。wordcloud安装方式如下:

pip install wordcloud

0 wordcloud绘图说明

wordcloud库关于绘制词云的相关函数均由其内置类WordCloud提供。

WordCloud类初始函数如下:

WordCloud(font_path=None, width=400, height=200, margin=2,
          ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,
          color_func=None, max_words=200, min_font_size=4,
          stopwords=None, random_state=None, background_color='black',
          max_font_size=None, font_step=1, mode="RGB",
          relative_scaling='auto', regexp=None, collocations=True,
          colormap=None, normalize_plurals=True, contour_width=0,
          contour_color='black', repeat=False,
          include_numbers=False, min_word_length=0, collocation_threshold=30)

初始函数参数介绍如下:

参数类型说明
font_pathstr字体路径,中文词云绘制必须要提供字体路径
widthint输出画布宽度
heightint输出画布高度
marginint输出画布每个词汇边框边距
prefer_horizontalfloat词汇水平方向排版出现的频率
masknumpy-array为空使用默认mask绘制词云,非空用给定mask绘制词云且宽高值将被忽略
scalefloat按照比例放大画布长宽
color_funcfunc颜色设置函数
max_wordsint最大统计词数
min_font_sizeint最小字体尺寸
stopwordslist绘图要过滤的词
random_stateint随机数,主要用于设置颜色
background_colorstr背景颜色
max_font_sizeint最大字体尺寸
font_stepint字体步长
modestrpillow image的绘图模式
relative_scalingfloat词频和字体大小的关联性
regexpstr使用正则表达式分隔输入的文本
collocationsbool是否包括两个词的搭配
colormapstr给每个单词随机分配颜色,若指定color_func,则忽略该方法
normalize_pluralsbool英文单词是否用单数替换复数
contour_widthint词云轮廓尺寸
contour_colorstr词云轮廓颜色
repeatbool是否重复输入文本直到允许的最大词数
include_numbersbool是否包含数字作为短语
min_word_lengthint单词包含最少字母数

WordCloud类提供的主要函数接口如下:

  • generate_from_frequencies(frequencies):根据词频生成词云
  • fit_words(frequencies):等同generate_from_frequencies函数
  • process_text(text):分词
  • generate_from_text(text):根据文本生成词云
  • generate(text):等同generate_from_text
  • to_image:输出绘图结果为pillow image
  • recolor:重置颜色
  • to_array:输出绘图结果为numpy array
  • to_file(filename):保存为文件
  • to_svg:保存为svg文件

1 绘图实例

1.1 单个单词绘制词云

import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

text = "hello"

# 返回两个数组,只不过数组维度分别为n*1 和 1* m
x, y = np.ogrid[:300, :300]

# 设置绘图区域
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

# 绘制词云,repeat表示重复输入文本直到允许的最大词数max_words,scale设置放大比例
wc = WordCloud(background_color="white", repeat=True,max_words=32, mask=mask,scale=1.5)
wc.generate(text)

plt.axis("off")
plt.imshow(wc, interpolation="bilinear")
plt.show()

# 输出到文件
_ = wc.to_file("result.jpg")

png

1.2 基础绘制


from wordcloud import WordCloud

# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()
# 生成词云, WordCloud对输入的文本text进行切词展示。
wordcloud = WordCloud().generate(text)

import matplotlib.pyplot as plt
plt.axis("off")
plt.imshow(wordcloud, interpolation='bilinear')
plt.show()

png

# 修改显示的最大的字体大小
wordcloud = WordCloud(max_font_size=50).generate(text)

# 另外一种展示结果方式
image = wordcloud.to_image()
image.show()

png

1.3 自定义词云形状

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()

# 想生成带特定形状的词云,首先得准备具备该形状的mask图片
# 在mask图片中除了目标形状外,其他地方都是空白的
mask = np.array(Image.open("mask.png"))

# 要跳过的词
stopwords = set(STOPWORDS)
# 去除better
stopwords.add("better")

# contour_width绘制mask边框宽度,contour_color设置mask区域颜色
# 如果mask边框绘制不准,设置contour_width=0表示不绘制边框
wc = WordCloud(background_color="white", max_words=2000, mask=mask,
               stopwords=stopwords, contour_width=2, contour_color='red',scale=2,repeat=True)

# 生成图片
wc.generate(text)

# 存储文件
wc.to_file("result.png")

# 展示词云结果
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
# 展示mask图片
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

png

png

1.4 使用词频字典绘图

# pip install multidict安装
import multidict as multidict

import numpy as np

import re
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 统计词频
def getFrequencyDictForText(sentence):
    fullTermsDict = multidict.MultiDict()
    tmpDict = {}

    # 按照空格分词
    for text in sentence.split(" "):
        # 如果匹配到相关词,就跳过,这样做可以获得定制度更高的结果
        if re.match("a|the|an|the|to|in|for|of|or|by|with|is|on|that|be", text):
            continue
        val = tmpDict.get(text, 0)
        tmpDict[text.lower()] = val + 1
    # 生成词频字典
    for key in tmpDict:
        fullTermsDict.add(key, tmpDict[key])
    return fullTermsDict


def makeImage(text):
    mask = np.array(Image.open("mask.png"))

    wc = WordCloud(background_color="white", max_words=1000, mask=mask, repeat=True)
    wc.generate_from_frequencies(text)

    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()



# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()

# 获得词频字典
fullTermsDict = getFrequencyDictForText(text)
# 绘图
makeImage(fullTermsDict)

png

1.5 颜色更改

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# 文本地址
text_path = 'test.txt'
# 示例文本
scr_text = '''The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!'''

# 保存示例文本
with open(text_path,'w',encoding='utf-8') as f:
    f.write(scr_text)

# 读取文本
with open(text_path,'r',encoding='utf-8') as f:
    # 这里text是一个字符串
    text = f.read()

# 图片地址https://github.com/amueller/word_cloud/blob/master/examples/alice_color.png
alice_coloring = np.array(Image.open("alice_color.png"))
stopwords = set(STOPWORDS)
stopwords.add("better")

wc = WordCloud(background_color="white", max_words=500, mask=alice_coloring,
               stopwords=stopwords, max_font_size=50, random_state=42,repeat=True)
# 生成词云结果
wc.generate(text)
# 绘制
image = wc.to_image()
image.show()


# 绘制类似alice_coloring颜色的词云图片
# 从图片中提取颜色
image_colors = ImageColorGenerator(alice_coloring)
# 重新设置词云颜色
wc.recolor(color_func=image_colors)
# 绘制
image = wc.to_image()
image.show()

# 展示mask图片
plt.imshow(alice_coloring, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()

png

png

png

1.6 为特定词设置颜色

from wordcloud import (WordCloud, get_single_color_func)
import matplotlib.pyplot as plt


# 直接赋色函数
class SimpleGroupedColorFunc(object):
    def __init__(self, color_to_words, default_color):
        # 特定词颜色
        self.word_to_color = {word: color
                              for (color, words) in color_to_words.items()
                              for word in words}
        # 默认词颜色
        self.default_color = default_color

    def __call__(self, word, **kwargs):
        return self.word_to_color.get(word, self.default_color)


class GroupedColorFunc(object):

    def __init__(self, color_to_words, default_color):
        self.color_func_to_words = [
            (get_single_color_func(color), set(words))
            for (color, words) in color_to_words.items()]

        self.default_color_func = get_single_color_func(default_color)

    def get_color_func(self, word):
        """Returns a single_color_func associated with the word"""
        try:
            color_func = next(
                color_func for (color_func, words) in self.color_func_to_words
                if word in words)
        except StopIteration:
            color_func = self.default_color_func

        return color_func

    def __call__(self, word, **kwargs):
        return self.get_color_func(word)(word, **kwargs)


text = """The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""

# 直接输入文本时,在统计数据时是否包括两个词的搭配
wc = WordCloud(collocations=False).generate(text.lower())

# 为特定词设置颜色
color_to_words = {
    'green': ['beautiful', 'explicit', 'simple', 'sparse',
                'readability', 'rules', 'practicality',
                'explicitly', 'one', 'now', 'easy', 'obvious', 'better'],
    '#FF00FF': ['ugly', 'implicit', 'complex', 'complicated', 'nested',
            'dense', 'special', 'errors', 'silently', 'ambiguity',
            'guess', 'hard']
}

# 设置除特定词外其他词的颜色为grey
default_color = 'grey'

# 直接赋色函数,直接按照color_to_words设置的RGB颜色绘图,输出的颜色不够精细
# grouped_color_simple = SimpleGroupedColorFunc(color_to_words, default_color)

# 更精细的赋色函数,将color_to_words设置的RGB颜色转到hsv空间,然后进行绘图
grouped_color = GroupedColorFunc(color_to_words, default_color)

# 应用颜色函数
wc.recolor(color_func=grouped_color)

# 绘图
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

png

1.7 绘制中文词云

import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import numpy as np
# 读取文本
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/CalltoArms.txt
with open('CalltoArms.txt','r',encoding='utf-8') as f:
    text = f.read()

# 中文必须设置字体文件
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/fonts/SourceHanSerif/SourceHanSerifK-Light.otf
font_path =  'SourceHanSerifK-Light.otf'

# 不用于绘制词云的词汇列表
# 下载地址https://github.com/amueller/word_cloud/blob/master/examples/wc_cn/stopwords_cn_en.txt
stopwords_path = 'stopwords_cn_en.txt'
# 词云
# 模板图片
back_coloring = np.array(Image.open("alice_color.png"))

# 向jieba分词词典添加新的词语
userdict_list = ['阿Q', '孔乙己', '单四嫂子']


# 分词
def jieba_processing_txt(text):
    for word in userdict_list:
        jieba.add_word(word)

    mywordlist = []
    # 分词
    seg_list = jieba.cut(text, cut_all=False)
    liststr = "/ ".join(seg_list)

    with open(stopwords_path, encoding='utf-8') as f_stop:
        f_stop_text = f_stop.read()
        f_stop_seg_list = f_stop_text.splitlines()

    for myword in liststr.split('/'):
        if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1:
            mywordlist.append(myword)
    return ' '.join(mywordlist)
# 文字处理
text = jieba_processing_txt(text)

# margin设置词云每个词汇边框边距
wc = WordCloud(font_path=font_path, background_color="black", max_words=2000, mask=back_coloring,
               max_font_size=100, random_state=42, width=1000, height=860, margin=5,
               contour_width=2,contour_color='blue')


wc.generate(text)

# 获得颜色
image_colors_byImg = ImageColorGenerator(back_coloring)

plt.imshow(wc.recolor(color_func=image_colors_byImg), interpolation="bilinear")
plt.axis("off")
plt.figure()
plt.imshow(back_coloring, interpolation="bilinear")
plt.axis("off")
plt.show()

png

png

2 参考

### 回答1: 在使用 python 语言wordcloud 绘制词云图时,你可以先导入所需的: ```python from wordcloud import WordCloud import matplotlib.pyplot as plt ``` 然后,你需要准备数据,可以使用 Pandas 的 DataFrame 来组织数据。例如: ```python import pandas as pd # 假设你有一个包含文本的 DataFrame,列名为 "text" df = pd.DataFrame({"text": ["this is a text", "another text"]}) ``` 接下来,你可以使用 DataFrame 的 "apply" 方法来处理每一行文本,例如分词、去停用词等。最后,你可以使用 wordcloud 中的 `WordCloud` 类来绘制词云图: ```python def process_text(text): # 处理文本的函数,这里仅做示例 return text # 将每一行文本都传入处理函数,得到一个新的列 "processed" df["processed"] = df["text"].apply(process_text) # 将所有文本拼接起来,形成一个大的字符串 all_text = " ".join(df["processed"]) # 生成词云图 wordcloud = WordCloud().generate(all_text) # 绘制图像 plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show() ``` 希望这能帮到你! ### 回答2: DataFrame是pandas中的一个重要的数据结构,可以将数据按照类似于表格的形式进行存储和操作。它由行索引和列索引组成,每一列都可以是不同类型的数据。 在Python中,我们可以使用wordcloud绘制词云图,通过可视化展示文本数据中词汇的频率和重要性。绘制词云图的过程包括以下几个步骤: 1. 导入所需的:首先要导入pandaswordcloud,分别用于处理数据和绘制词云图。可以使用如下代码导入: ```python import pandas as pd from wordcloud import WordCloud ``` 2. 读取数据:使用pandas中的read_csv函数读取数据文件,将其转换为DataFrame对象,例如: ```python data = pd.read_csv('data.csv') ``` 3. 数据预处理:对于文本数据,通常需要进行一些预处理,例如去除停用词、标点符号和数字,将所有词转换为小写等操作。可以使用nltk或者自定义函数来进行数据预处理。 4. 统计词频:通过DataFrame的groupby函数将文本数据进行分组,统计每个词的频率。将统计结果保存为一个新的DataFrame对象。 5. 绘制词云图:使用WordCloud中的WordCloud函数创建一个词云对象,设置一些参数,例如词云的大小、字体、背景颜色等。然后使用该对象的generate_from_frequencies方法将词频数据传入,并使用to_image方法将词云图生成为图像。可以使用如下代码进行词云图绘制: ```python wordcloud = WordCloud(width=800, height=400, background_color='white', font_path='SimHei.ttf').generate_from_frequencies(frequency) wordcloud.to_image() ``` 以上就是使用Python对关键词DataFrame和wordcloud绘制词云图的简要介绍。通过这样的方法,我们可以直观地展示文本数据中关键词的重要性和出现频率,从而更好地理解和分析数据。 ### 回答3: DataFrame是pandas中的一种数据结构,类似于表格或电子表格。它可以用于存储和处理二维数据,支持行和列的索引,方便进行数据的筛选、统计和可视化。 而wordcloud是一个Python第三方,可以用来生成词云图词云图是一种可视化的方式,将文本数据中出现频率较高的词汇以图形化的形式展示,更直观地显示文本的关键词和分布情况。 使用wordcloud绘制词云图的基本步骤如下: 1. 导入所需的:首先,需要导入pandas读取数据,并导入wordcloud生成词云图。 2. 准备数据:将需要生成词云图的文本数据存储在DataFrame中。 3. 数据处理:根据需要,可以进行文本清洗、分词等处理,以便更好地生成词云图。例如,可以使用正则表达式去除标点符号或停用词。 4. 生成词云图:使用wordcloud中的WordCloud类创建一个词云对象,并为其指定相应的参数,如字体、背景颜色、最大词数等。然后,使用generate方法传入处理后的文本数据,生成词云图。 5. 可视化:使用matplotlib显示生成的词云图。 综上所述,通过使用DataFrame存储文本数据,并结合wordcloud绘制词云图,可以方便地展示文本数据中的关键词和频率分布情况,使数据更加直观。
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值