关联分析:关联规则挖掘应用实例

在上一篇文章中,我们主要是写到了关联分析的概念和一些挖掘算法的原理,在本篇文章中我们将以一个应用实例来简介一下挖掘算法是怎么实现和起作用的。我们以一次美国国会投票记录作为案例,使用Apriori算法,支持度设为30%,置信度为90%,挖掘出高置信度的规则。

对于尚不太了解关联规则挖掘读者来说,可以先看一下这篇文章:

https://blog.ailemon.me/2018/08/11/association-analysis-association-rule-mining-algorithm/

本文使用到的数据集:

http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

数据分析

在house-votes-84.data文件中,每一行代表着一个议员,第一列是他所属的党派,第二到第十七列,则是他对于这16个议题的投票情况。其中,“y”为赞成票,“n”为反对票,“?”为弃权票。在这里,弃权票可以看做是数据集中的噪声,在做关联规则的挖掘的时候,应该被去除。

我们可以根据y或n来分别进行挖掘。

挖掘关联规则

首先,我们可以建立一个字典变量,包含16个键,分别代表着16个议题,通过统计他们出现的次数,得到1-项集结果,并通过支持度和置信度判断是否进行剪枝。然后,对1-项集中的每一项两两组合,在全部数据中,再进行一次统计,统计两项同时出现的频数,得到2-项集结果,并通过支持度和置信度判断是否进行剪枝。接着,对2-项集中的每一项,寻找其可以与其他项进行合并生成3-项集的项,需要前缀完全相同,仅最后一个元素不同,得到3-项集结果,,并通过支持度和置信度判断是否进行剪枝。最后,重复用来计算3-项集的过程,直到再也没有新的项集可以产生为止。

挖掘结果

将2-项集、3-项集和4-项集等显示输出,包括关联的项和频数。

详细代码实现

Python 3.x

作者在Python 3.6下通过调试

  • 数据处理
def loadData(path = ''):
  '''
  加载原始数据
  '''
  f = open(path + 'house-votes-84.data')
  txt = f.read()
  f.close()
  lst_txt=txt.split('\n')
  
  data = []
  
  for txt_line in lst_txt:
    tmp=txt_line.split(',')
    data.append(tmp)
  
  return data

def preProcessing(data, vote_y_n):
  '''
  数据预处理
  以按 y 或 n 寻找关联规则
  '''
  data_pre = []
  for data_line in data:
    tmp_data = []
    for i in range(1,len(data_line)):
      if(data_line[i] == vote_y_n):
        tmp_data.append(i)
    
    if(tmp_data == []):
      continue
    data_pre.append(tmp_data)
  
  return data_pre
  • 关联规则挖掘
def rule_mining(data, suport, coincidence):
  '''
  挖掘关联规则
  '''
  dic_1 = mining_first(data, suport, coincidence)
  #print(dic_1)
  dic_2 = mining_second(data, dic_1, suport, coincidence)
  #print(dic_2)
  dic_before = dic_2
  
  dic_r = []
  
  while(dic_before != {}):
    dic_r.append(dic_before)
    dic_3 = mining_third(data, dic_before, suport, coincidence)
    dic_before = dic_3
  
  return dic_r
  pass

def mining_first(data, suport, coincidence):
  '''
  进行第一次挖掘
  挖掘候选1-项集
  '''
  dic = {}
  count = len(data)
  for data_line in data:
    for data_item in data_line:
      if(data_item in dic):
        dic[data_item] += 1
      else:
        dic[data_item] = 1
  
  assert (suport >=0 ) and (suport <= 1) , 'suport must be in 0-1'
  val_suport = int(count * suport)
  assert (coincidence >=0 ) and (coincidence <= 1) , 'coincidence must be in 0-1'
  dic_1 = {}
  for item in dic:
    if(dic[item] >= val_suport):
      dic_1[item] = dic[item]
  
  return dic_1
  
def mining_second(data, dic_before, suport, coincidence):
  '''
  进行关联规则的二次挖掘
  挖掘出候选2-项集
  '''
  dic = {}
  count = len(data)
  count2 = 0
  for data_line in data:
    # 获取元素数量
    count_item = len(data_line)
    # 每两个组合计数
    for i in range(0, count_item - 1):
      for j in range(i + 1, count_item):
        if(data_line[i] in dic_before and data_line[j] in dic_before):
          count2 += 1
          tmp = (data_line[i], data_line[j])
          if(tmp in dic):
            dic[tmp] += 1
          else:
            dic[tmp] = 1
        else:
          continue
  #print(dic)
  assert (suport >=0 ) and (suport <= 1) , 'suport must be in 0-1'
  assert (coincidence >=0 ) and (coincidence <= 1) , 'coincidence must be in 0-1'
  
  dic_2 = {}
  for item in dic:
    count_item0 = dic_before[item[0]]
    count_item1 = dic_before[item[1]]
    # 判断 支持度 和 置信度
    if((dic[item] >= suport * count) and ((dic[item] >= coincidence * count_item0) or (dic[item] >= coincidence * count_item1))):
      dic_2[item] = dic[item]
  
  return dic_2

def mining_third(data, dic_before, surpot, coincidence):
  '''
  进行关联规则的三次挖掘
  挖掘出候选3-项集或者4-项集乃至n-项集
  '''
  dic_3 = {}
  for item0 in dic_before:
    for item1 in dic_before:
      if(item0 != item1):
        #print(item0,item1)
        item_len = len(item0)
        equal = True
        tmp_item3 = []
        # 判断前n-1项是否一致
        for i in range(item_len - 1):
          tmp_item3.append(item0[i])
          if(item0[i] != item1[i]):
            equal = False
            break
        if(equal == True):
          minitem = min(item0[-1],item1[-1])
          maxitem = max(item0[-1],item1[-1])
          tmp_item3.append(minitem)
          tmp_item3.append(maxitem)
          tmp_item3 = tuple(tmp_item3)
          dic_3[tmp_item3] = 0
        else:
          continue
  #print('dic_3:',dic_3)
  for data_line in data:
    for item in dic_3:
      is_in = True
      for i in range(len(item)):
        if(item[i] not in data_line):
          is_in = False
      
      if(is_in == True):
        dic_3[item] += 1
  
  assert (suport >=0 ) and (suport <= 1) , 'suport must be in 0-1'
  assert (coincidence >=0 ) and (coincidence <= 1) , 'coincidence must be in 0-1'
  
  count = len(data)
  dic_3n = {}
  for item in dic_3:
    count_item0 = dic_before[item[:-1]]
    # 判断 支持度 和 置信度
    if((dic_3[item] >= suport * count) and (dic_3[item] >= coincidence * count_item0)):
      dic_3n[item] = dic_3[item]
  
  return dic_3n
  pass
  • 调用代码
if(__name__=='__main__'):
  data_row = loadData()
  
  data_y = preProcessing(data_row, 'y')
  data_n = preProcessing(data_row, 'n')
  
  # 支持度
  suport = 0.3
  # 置信度
  coincidence = 0.9
  
  r_y = rule_mining(data_y, suport, coincidence)
  print('vote `y`:\n',r_y)
  r_n = rule_mining(data_n, suport, coincidence)
  print('vote `n`:\n',r_n)
  
  f=open('result_mining.txt','w')
  f.write('vote `y`:\n')
  f.write(str(r_y))
  f.write('\n\nvote `n`:\n')
  f.write(str(r_n))
  f.close()

运行结果

[email protected]_me:~$ python3 vote.py
vote `y`:
[{(4, 5): 168, (4, 6): 160, (4, 14): 168, (5, 6): 197, (5, 14): 194, (6, 12): 159, (12, 14): 158, (8, 9): 192}, {(4, 5, 6): 156, (4, 5, 14): 160, (4, 6, 14): 151, (5, 6, 14): 180}, {(4, 5, 6, 14): 148}]
vote `n`:
[{(8, 9): 162, (4, 14): 163, (4, 5): 195, (5, 6): 137, (5, 14): 153}]

 

打赏作者
很喜欢这篇文章,赞赏一下作者,以激励其创作出更多好文!

您的支持将鼓励我们继续创作!

[微信] 扫描二维码打赏

[支付宝] 扫描二维码打赏

分享到社交网络:

发表评论

电子邮件地址不会被公开。 必填项已用*标注