2019-10-23

从零实现基于医疗知识图谱的问答系统(一)-数据收集

文章目录

1. 数据爬取模块
2. 数据存储模块
3. 切分词模块
4. 建立数据模块
5. MongoDB的注意事项

开始进入知识图谱的相关学习，一直苦于没有较好的实践项目，在github上找到了刘焕勇老师的开源项目，希望通过老师的开源代码进行学习，在此感激老师能将代码分享出来，原项目链接：基于医药领域的知识图谱自动问答系统。
学习笔记是根据老师的代码和思路实现一遍相同的系统，走完KGQA应用的构建流程，也作为监督自己学习进度的记录，再次感谢老师的分享！！！！！！

笔记的第一篇主要是数据的收集工作，也就是爬虫工作，老师使用的是urllib进行爬取，由于10061错误，我把操作更换为requests，之前写过关于爬虫被ban的解决办法：Python爬虫爬数据相关问题解决。博客里我提供了之前报错时的各种解决办法，亲测有效。

构建知识图谱首先需要相应的数据，构建基于医疗知识图谱的问答系统那就需要收集医疗相关的数据信息，这里使用的是寻医问药的数据。

数据收集模块放在/prepare_data文件夹下面。

开始说明前需要指出一下，如果根据刘老师在data_spider.py中定义的字段来存储数据库，其中有部分字段会显示不出来。
举个栗子，刘老师使用basic_info存储了name，category，desc，attribuites等四个字段，根据mongodb导出来的文件则这个四个字段统一包含在basic_info中，因此这里不再在刘老师的代码上去修改，只说明爬虫和数据整理的思路和MongoDB的使用。
后续生成知识图谱的json文件则直接使用刘老师在github中提供的medical.json文件。

数据爬取模块

寻医问药网首页如下：

首先对网址上的疾病链接进行分析，以感冒为例：
感冒的链接：http://jib.xywy.com/il_sii_38.htm

可以看到，上面包含了感冒的名称，介绍，病因，预防，治疗等等一系列信息，下面我们要使用爬虫把信息收集起来。
打开了其他几个疾病的链接：
高血压：http://jib.xywy.com/il_sii_169.htm
胃病：http://jib.xywy.com/il_sii_624.htm
支气管炎：http://jib.xywy.com/il_sii_9014.htm
失眠症：http://jib.xywy.com/il_sii_3484.htm
……
其他病例不再举例，通过观察可以看出，链接部分 http://jib.xywy.com/il_sii_ 都是相同的，是通过数字的叠加来组成不同的病例。所以基础链接也就是：http://jib.xywy.com/il_sii_ ，通过string类型的拼接进行循环后可以得到我们需要的内容。

需要查看一种疾病的简介，药物，护理等信息，以胃病为例：
http://jib.xywy.com/il_sii/gaishu/169.htm
中间的/gaishu/就是胃病的简介，药物的则为/drug/，其余部分不变，同样可以通过数据拼接得到，拼接完 url 后要收集 url 下面对应的数据，具体操作如下：

# 收集页面
for page in range(1,11000):
    try:
        basic_url='http://jib.xywy.com/il_sii/gaishu/%s.htm'%page # 疾病描述
        cause_url='http://jib.xywy.com/il_sii/cause/%s.htm'%page # 疾病起因
        prevent_url='http://jib.xywy.com/il_sii/prevent/%s.htm'%page # 疾病预防
        symptom_url='http://jib.xywy.com/il_sii/symptom/%s.htm'%page #疾病症状
        inspect_url='http://jib.xywy.com/il_sii/inspect/%s.htm'%page # 疾病检查
        treat_url='http://jib.xywy.com/il_sii/treat/%s.htm'%page # 疾病治疗
        food_url = 'http://jib.xywy.com/il_sii/food/%s.htm' % page # 饮食治疗
        drug_url = 'http://jib.xywy.com/il_sii/drug/%s.htm' % page #
        data={}
        data['url'] = basic_url
        data['basic_info'] = self.basicinfo_spider(basic_url)
        data['cause_info'] = self.common_spider(cause_url)
        data['prevent_info'] = self.common_spider(prevent_url)
        data['symptom_info'] = self.symptom_spider(symptom_url)
        data['inspect_info'] = self.inspect_spider(inspect_url)
        data['treat_info'] = self.treat_spider(treat_url)
        data['food_info'] = self.food_spider(food_url)
        data['drug_info'] = self.drug_spider(drug_url)
        print(page, basic_url)
		# 将数据存入数据库
        self.col.insert(data)
    except Exception as e:
        print(e,page)

收集每个页面上对应的信息前应该有一个函数能够读取对应页面的信息，这里使用的requests和代理设置来爬取，urllib.request不断提示10061错误，所以此处函数和刘老师的代码有一定出入，代码如下：

'''根据url,请求html'''
def get_html(self,url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }
    proxies = {
        'http': None,
        'https': None
    }
    html = requests.get(url=url, headers=headers, proxies=proxies)
    html.encoding = 'gbk'
    return html.text

下面按功能解析各个页面读取函数：
P.S. 在说函数解析之前，先说一下xpath，通过xpath可以获取 html 页面下对应的信息（也就是摁了F12之后可以看到的相关内容），可以参看下面链接：
xpath符号及用法说明
 xpath测试
如果不确定xpath写的句子是否正确，可以现在线上测试一下或者自己输出看看。

各个功能页面读取上的写法大同小异，只有部分地方有细微的差别。

1.basicinfo_spider

def basicinfo_spider(self,url):
    html=self.get_html(url)
    selector=etree.HTML(html)
    # 获取当前页面的标题,这里我又篡改了老师的一行代码
    title = selector.xpath('//title/text()')[0].split(',')[0]
    # 获取当前疾病的所属目录,比如感冒的目录:['疾病百科', '内科', '呼吸内科']
    category=selector.xpath('//div[@class="wrap mt10 nav-bar"]/a/text()')
    # 获取疾病的描述
    desc=selector.xpath('//div[@class="jib-articl-con jib-lh-articl"]/p/text()')
    # 获取医疗网页上的温馨提示
    # 由于后面需要继续使用lxml.etree._Element属性值,并且去除里面多余的符号,所以这里不加text()提取值
    ps=selector.xpath('//div[@class="mt20 articl-know"]/p')
    infobox=[]
    for p in ps:
        info=p.xpath('string(.)').replace('\r','').replace('\n','').replace('\xa0','').\
            replace('   ', '').replace('\t','')
        infobox.append(info)
    # 使用dict格式存储对应的数据
    basic_data = {}
    basic_data['category'] = category
    basic_data['name'] = title
    basic_data['desc'] = desc
    basic_data['attributes'] = infobox
    return basic_data

2.treat_spider

'''treat_infobox治疗信息解析'''
def treat_spider(self,url):
    html=self.get_html(url)
    selector=etree.HTML(html)
    ps=selector.xpath('//div[starts-with(@class,"mt20 articl-know")]/p')
    infobox=[]
    # 获取治疗方案信息
    for p in ps:
        info = p.xpath('string(.)').replace('\r','').replace('\n','').replace('\xa0', '').replace('   ', '').replace('\t','')
        infobox.append(info)
    return infobox

3.drug_spider
这个函数的写法，刘老师使用的写法和我有一些差别，但是得到的结果是一样的，相较下来刘老师的写法更简单。

'''drug_infobox药物治疗信息解析'''
def drug_spider(self,url):
    html=self.get_html(url)
    selector=etree.HTML(html)
    ps=selector.xpath('//div[starts-with(@class,"fl drug-pic-rec mr30")]/p/a')
    infobox=[]
    for p in ps:
        info=p.xpath('string(.)').replace('\r','').replace('\n','').replace('\xa0', '').replace('   ', '')\
            .replace('\t','').replace(' ','')
        infobox.append(info)
    return infobox

4.food_spider

'''food_infobox食物治疗信息解析'''
def food_spider(self,url):
    html=self.get_html(url)
    selector=etree.HTML(html)
    divs=selector.xpath('//div[@class="diet-img clearfix mt20"]')
    # 这个标签下有三种食物,宜吃/忌吃/推荐,所以用标号分别进行区分
    try:
        food_data={}
        food_data['good']=divs[0].xpath('./div/p/text()')
        food_data['bad']=divs[1].xpath('./div/p/text()')
        food_data['recommond']=divs[2].xpath('./div/p/text()')
    except:
        return {}
    return food_data

5.symptom_spider

'''症状信息解析'''
def symptom_spider(self,url):
    html = self.get_html(url)
    selector = etree.HTML(html)
    symptoms = selector.xpath('//a[@class="gre"]/text()')
    ps = selector.xpath('//p')
    detail = []
    for p in ps:
        info = p.xpath('string(.)').replace('\r', '').replace('\n', '').replace('\xa0', '').replace('   ', '') \
            .replace('\t', '').replace(' ', '')
        detail.append(info)
    symptoms_data = {}
    symptoms_data['symptoms'] = symptoms
    symptoms_data['symptoms_details'] = detail
    #print(symptoms_data)
    return symptoms,detail

6.inspect_spider

'''检查信息解析'''
def inspect_spider(self,url):
    html=self.get_html(url)
    selector=etree.HTML(html)
    inspects=selector.xpath('//li[@class="check-item"]/a/@href')
    return inspects

7.common_spider

'''通用模块解析'''
def common_spider(self,url):
    # 通用模块下面包含了疾病预防和疾病起因的内容
    html=self.get_html(url)
    selector=etree.HTML(html)
    ps=selector.xpath('//p')
    infobox=[]
    for p in ps:
        info=p.xpath('string(.)').replace('\r', '').replace('\n', '').replace('\xa0', '').replace('   ', '') \
            .replace('\t', '').replace(' ', '')
        if info:
            infobox.append(info)
    return '\n'.join(infobox)

最后，data_spider.py构建的类中有一个检查项抓取模块。
这里和刘老师写得有一点差距，他这里没有返回相应的data，但是不清楚后续会不会有返回值，所以这里我先让data进行返回值设置。

'''检查项抓取模块'''
def inspect_crawl(self):
    data = {}
    for page in range(1,3685):
        try:
            url='http://jck.xywy.com/jc_%s.html'%page
            html = self.get_html(url)
            data['url'] = url
            data['html'] = html
            self.db['jc'].insert(data)
            print(url)
        except Exception as e:
            print(e)
    return data

数据存储模块

通过爬虫把数据爬下来之后要把数据存储起来，刘老师用的是pymongo。
我之前装的数据库里面只有MySQL，这里统一用mongodb吧，首先是mongodb在Windows的配置和安装。
参考来源：
mongodb在windows下的配置和安装
按照上述安装方法进行安装是一定可以装上的，也是我看过的比较靠谱的安装教程了。

之前在cmd窗口下测试mongo数据库的连接操作，然而一直10061拒绝连接，后来我意识到是没有启动mongo服务，启动mongo服务命令如下：

1	mongod –dbpath D:\mongodb\data\db

启动之后重新测试mongodb的连接：

首先启动好mongodb数据库（我的配置在F盘）：

1	mongod –dbpath F:\mongodb\data\db

由于这里是医疗知识图谱，根据刘老师的命名，这里定义为medical。
在mongodb启动起来之后，重新打开一个cmd，直接输入mongodb进入数据库控制，执行命令：show dbs，可以看到当前数据库的情况。

新建medical数据库，执行 use medical，这时再执行一次show dbs，还是没有显示出对应的数据库，因为里面没有数据。
至此，数据库配置已经完成，在data_spider.py的init文件中加入配置信息：

def __init__(self):
    self.conn=pymongo.MongoClient()
    self.db=self.conn['medical']
    self.col=self.db['data']

配置完成之后，运行main_spider函数：

handler=MedicalSpider()
# 获取检查信息
handler.inspect_crawl()
# 爬取完整的医疗数据并进行存储
handler.spider_main()
print('Spider done.')

开始爬取并保存数据：

爬数据的这个过程比较耗时间，第一次爬完的时候电脑莫名自动重启然后我的数据就没了，后来重新爬，到3000多条url的时候网断了几秒钟，数据就从3000多直接跳到8000多，后来没法，我 Ctrl C 停下来，然后重新设置，从4000-8000，跑一组，再从9000-11000再跑一组，现在的数据条数是这个亚子：

我估计应该要比预期的少几百条左右，害，凑合用吧，问题不大。

完整的data_spider.py文件如下：

#!/usr/bin/env python
#coding:utf-8

import requests
import urllib.request
import urllib.parse
from lxml import etree
import pymongo
import re


'''基于寻医问药的医疗数据采集'''
class MedicalSpider:
    def __init__(self):
        self.conn=pymongo.MongoClient()
        self.db=self.conn['medical']
        self.col=self.db['data']

    '''根据url,请求html'''
    def get_html(self,url):
        headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
        }
        proxies = {
            'http': None,
            'https': None
        }
        html = requests.get(url=url, headers=headers, proxies=proxies)
        html.encoding = 'gbk'
        return html.text

    '''url解析'''
    def url_parser(self,content):
        selector=etree.HTML(content)
        urls=['http://www.anliguan.com'+i for i in selector.xpath('//h2[@class="item-title"]/a/@href')]
        return urls

    '''主要的爬取的链接'''
    def spider_main(self):
        # 收集页面
        for page in range(1,11000):
            try:
                basic_url='http://jib.xywy.com/il_sii/gaishu/%s.htm'%page # 疾病描述
                cause_url='http://jib.xywy.com/il_sii/cause/%s.htm'%page # 疾病起因
                prevent_url='http://jib.xywy.com/il_sii/prevent/%s.htm'%page # 疾病预防
                symptom_url='http://jib.xywy.com/il_sii/symptom/%s.htm'%page #疾病症状
                inspect_url='http://jib.xywy.com/il_sii/inspect/%s.htm'%page # 疾病检查
                treat_url='http://jib.xywy.com/il_sii/treat/%s.htm'%page # 疾病治疗
                food_url = 'http://jib.xywy.com/il_sii/food/%s.htm' % page # 饮食治疗
                drug_url = 'http://jib.xywy.com/il_sii/drug/%s.htm' % page #
                data={}
                data['url'] = basic_url
                data['basic_info'] = self.basicinfo_spider(basic_url)
                data['cause_info'] = self.common_spider(cause_url)
                data['prevent_info'] = self.common_spider(prevent_url)
                data['symptom_info'] = self.symptom_spider(symptom_url)
                data['inspect_info'] = self.inspect_spider(inspect_url)
                data['treat_info'] = self.treat_spider(treat_url)
                data['food_info'] = self.food_spider(food_url)
                data['drug_info'] = self.drug_spider(drug_url)
                print(page, basic_url)
                self.col.insert(data)
            except Exception as e:
                print(e,page)

    '''基本信息解析'''
    def basicinfo_spider(self,url):
        html=self.get_html(url)
        selector=etree.HTML(html)
        # 获取当前页面的标题,这里我又篡改了老师的一行代码
        title = selector.xpath('//title/text()')[0].split(',')[0]
        # 获取当前疾病的所属目录,比如感冒的目录:['疾病百科', '内科', '呼吸内科']
        category=selector.xpath('//div[@class="wrap mt10 nav-bar"]/a/text()')
        # 获取疾病的描述
        desc=selector.xpath('//div[@class="jib-articl-con jib-lh-articl"]/p/text()')
        # 获取医疗网页上的温馨提示
        # 由于后面需要继续使用lxml.etree._Element属性值,并且去除里面多余的符号,所以这里不加text()提取值
        ps=selector.xpath('//div[@class="mt20 articl-know"]/p')
        infobox=[]
        for p in ps:
            info=p.xpath('string(.)').replace('\r','').replace('\n','').replace('\xa0','').\
                replace('   ', '').replace('\t','')
            infobox.append(info)
        # 使用dict格式存储对应的数据
        basic_data = {}
        basic_data['category'] = category
        basic_data['name'] = title
        basic_data['desc'] = desc
        basic_data['attributes'] = infobox
        return basic_data

    '''treat_infobox治疗信息解析'''
    def treat_spider(self,url):
        html=self.get_html(url)
        selector=etree.HTML(html)
        ps=selector.xpath('//div[starts-with(@class,"mt20 articl-know")]/p')
        infobox=[]
        for p in ps:
            info = p.xpath('string(.)').replace('\r','').replace('\n','').replace('\xa0', '').replace('   ', '').replace('\t','')
            infobox.append(info)
        return infobox

    '''drug_infobox药物治疗信息解析'''
    def drug_spider(self,url):
        html=self.get_html(url)
        selector=etree.HTML(html)
        ps=selector.xpath('//div[starts-with(@class,"fl drug-pic-rec mr30")]/p/a')
        infobox=[]
        for p in ps:
            info=p.xpath('string(.)').replace('\r','').replace('\n','').replace('\xa0', '').replace('   ', '')\
                .replace('\t','').replace(' ','')
            infobox.append(info)
        return infobox

    '''food_infobox食物治疗信息解析'''
    def food_spider(self,url):
        html=self.get_html(url)
        selector=etree.HTML(html)
        divs=selector.xpath('//div[@class="diet-img clearfix mt20"]')
        # 这个标签下有三种食物,宜吃/忌吃/推荐,所以用标号分别进行区分
        try:
            food_data={}
            food_data['good']=divs[0].xpath('./div/p/text()')
            food_data['bad']=divs[1].xpath('./div/p/text()')
            food_data['recommond']=divs[2].xpath('./div/p/text()')
        except:
            return {}
        return food_data

    '''症状信息解析'''
    def symptom_spider(self,url):
        html = self.get_html(url)
        selector = etree.HTML(html)
        symptoms = selector.xpath('//a[@class="gre"]/text()')
        ps = selector.xpath('//p')
        detail = []
        for p in ps:
            info = p.xpath('string(.)').replace('\r', '').replace('\n', '').replace('\xa0', '').replace('   ', '') \
                .replace('\t', '').replace(' ', '')
            detail.append(info)
        symptoms_data = {}
        symptoms_data['symptoms'] = symptoms
        symptoms_data['symptoms_details'] = detail
        #print(symptoms_data)
        return symptoms,detail

    '''信息检查解析'''
    def inspect_spider(self,url):
        html=self.get_html(url)
        selector=etree.HTML(html)
        inspects=selector.xpath('//li[@class="check-item"]/a/@href')
        return inspects

    '''通用模块解析'''
    def common_spider(self,url):
        # 通用模块下面包含了疾病预防和疾病起因的内容
        html=self.get_html(url)
        selector=etree.HTML(html)
        ps=selector.xpath('//p')
        infobox=[]
        for p in ps:
            info=p.xpath('string(.)').replace('\r', '').replace('\n', '').replace('\xa0', '').replace('   ', '') \
                .replace('\t', '').replace(' ', '')
            if info:
                infobox.append(info)
        return '\n'.join(infobox)

    '''检查项抓取模块'''
    def inspect_crawl(self):
        for page in range(1,3685):
            try:
                data={}
                url='http://jck.xywy.com/jc_%s.html'%page
                html = self.get_html(url)
                data['url'] = url
                data['html'] = html
                self.db['jc'].insert(data)
                print(url)
            except Exception as e:
                print(e)
        return data

handler=MedicalSpider()
# 获取检查信息
handler.inspect_crawl()
# 爬取完整的医疗数据并进行存储
handler.spider_main()
# 数据在数据库中存储完毕
print('Spider done.')

切分词模块

刘老师定义了一个max_cut.py文件来进行词语的切分，其中包含了最大向前匹配，最大向后匹配，双向最大向前匹配。
但是这里在init中配置的 dict_path，目录下暂时没有这个文件，我认为这个部分暂时需要在mongodb里建立完文件后导出才会有对应的文件。
所以，先将切分词的 max_cut.py文件放上来，核心思想已在注释中写明。

#!/usr/bin/env python
#coding:utf-8

class CutWords():
    # 初始化设置
    def __init__(self):
        dict_path='./disease.txt'
        self.word_dict,self.max_wordlen=self.load_words(dict_path)

    # 加载词典
    def load_words(self,dict_path):
        words=list()
        max_len=0
        for line in open(dict_path):
            wd=line.strip()
            if not wd:
                continue
            if len(wd)>max_len:
                max_len=len(wd)
            words.append(max_len)
        return words,max_len

    # 最大向前匹配
    def max_forward_cut(self,sent):
        # 1.从左向右取待切分汉语句的m个字符作为匹配字段,m为大机器词典中最长词条个数.
        # 2.查找大机器词典并进行匹配。若匹配成功,则将这个匹配字段作为一个词切分出来.
        cutlist=[]
        index=0 # index计数文本当前的位置
        while index<len(sent):
            matched=False
            for i in range(self.max_wordlen,0,-1):
                cand_word=sent[index:index+i]
                if cand_word in self.word_dict:
                    cutlist.append(cand_word)
                    matched=True
                    break

            # 如果没有匹配上,则按字符切分
            if not matched:
                i=1 # 此处i置为1表示只添加一个字符
                cutlist.append(sent[index])
            index+=i
        # 返回分词结果
        return cutlist

    # 最大向后匹配
    def max_backward_cut(self,sent):
        # 1.从右往左取待切分汉语句的m个字符作为匹配字段,m为大机器词典中最长词条个数.
        # 2.查找大机器词典并进行匹配.若匹配成功,则将这个匹配字段作为一个词切分出来.
        cutlist=[]
        index=len(sent)
        max_wordlen=5
        while index>0:
            matched=False
            for i in range(self.max_wordlen,0,-1):
                tmp=(i+1) # 此处tmp的作用类似于最大前向匹配中的i,作计数用
                cand_word=sent[index-tmp:index]
                # 如果匹配上,则将字典中的字符加入到切分字符中
                if cand_word in self.word_dict:
                    cutlist.append(cand_word)
                    matched=True
                    break
            # 如果没有匹配上,则按字符切分
            if not matched:
                tmp=1
                cutlist.append(sent[index-1])

            index-=tmp
        # 倒序返回切分的字符
        return cutlist[::-1]

    # 双向最大向前匹配
    def max_biward_cut(self,sent):
        '''
        双向最大匹配法是将正向最大匹配法得到的分词结果和逆向最大匹配法得到的结果进行比较,从而决定正确的分词方法
        启发式规则:
        1.如果正反向分词结果词数不同,则取分词数量较少那个;
        2.如果分词结果词数相同:
        a.分词结果相同,说明没有歧义,可以返回任意一个结果;
        b.分词结果不同,返回其中单字较少的那个。
        '''
        forward_cutlist=self.max_forward_cut(sent)
        backward_cutlist=self.max_backward_cut(sent)
        count_forward=len(forward_cutlist)
        count_backward=len(backward_cutlist)

        def compute_single(word_list):
            num=0
            # 统计长度为1的词汇
            for word in word_list:
                if len(word)==1:
                    num+=1
            return num

        if count_forward==count_backward:
            if compute_single(forward_cutlist)>compute_single(backward_cutlist):
                return backward_cutlist
            else:
                return forward_cutlist
        elif count_backward>count_forward:
            return forward_cutlist
        else:
            return backward_cutlist

建立数据模块

OK，下面开始编写 build_data.py
首先在init文件里配置数据库连接，找到当前文件所在的目录，指定连接的数据库及其下面的collection。

def __init__(self):
    # 连接数据库
    self.conn=pymongo.MongoClient()
    # 这里的split 原先是 split('/'),刘老师在Linux实现的,我的平台是windows,所以作修改
    # cur_dir: 当前文件所在路径
    cur_dir='/'.join(os.path.abspath('__file__').split('\\')[:-1])
    self.db=self.conn['medical']
    self.col=self.db['data']
    first_words=[i.strip() for i in open(os.path.join(cur_dir,'first_name.txt'))]
    alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
                 'u', 'v', 'w', 'x', 'y', 'z']
    nums = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
    self.stop_words = first_words + alphabets + nums
    self.key_dict={
        '医保比例': 'yibao_status',
        '患病比例': 'get_prob',
        '易感人群': 'essy_get',
        '传染方式': 'get_way',
        '就诊科室': 'cure_department',
        '治疗方式': 'cure_way',
        '治疗周期': 'cure_lasttime',
        '治愈率': 'cured_prob',
        '药品明细': 'drug_detail',
        '药品推荐': 'recommand_drug',
        '推荐': 'recommand_eat',
        '忌食': 'not_eat',
        '宜食': 'do_eat',
        '症状': 'symptom',
        '检查': 'check',
        '成因': 'cause',
        '预防措施': 'prevent',
        '所属类别': 'category',
        '简介': 'desc',
        '名称': 'name',
        '常用药品': 'common_drug',
        '治疗费用': 'cost_money',
        '并发症': 'acompany'
    }
    self.cuter=CutWords()

然后通过collect_medical函数完成数据的收集整理工作。
再将整理好的数据重新存入数据库中。

# 收集医疗信息
def collect_medical(self):
    # cates: 所属类别
    cates=[]
    inspects=[]
    count=0
    for item in self.col.find():
        data={}
        basic_info=item['basic_info']
        name=basic_info['name']
        if not name:
            continue
        '''下述存储的信息均对应于data_spider.py中spider_main的命名'''
        # 基本信息
        data['名称']=name
        data['简介']='\n'.join(basic_info['desc']).replace('\r\n\t','')\
            .replace('\r\n\n\n','').replace(' ','').replace('\r\n','\n')
        category=basic_info['category']
        data['所属类别']=category
        cates+=category
        # 获取检查信息
        inspect=item['inspect_info']
        inspects+=inspect
        # 温馨提示
        attributes=basic_info['attributes']
        # 成因及预防
        data['预防措施']=item['prevent_info']
        data['成因']=item['cause_info']
        # 并发症
        data['症状']=list(set([i for i in item["symptom_info"][0] if i[0] not in self.stop_words]))
        '''注意,这段还是不熟悉,需要查表格'''
        for attr in attributes:
            attr_pair=attr.split('：')
            if len(attr_pair)==2:
                key=attr_pair[0]
                value=attr_pair[1]
                data[key]=value

        # 检查
        inspects=item['inspect_info']
        jcs=[]
        for inspect in inspects:
            jc_name=self.get_inspect(inspect)
            if jc_name:
                jcs.append(jc_name)
        data['检查']=jcs

        # 食物
        food_info=item['food_info']
        '''此处需要检查数据类型'''
        if food_info:
            data['宜食']=food_info['good']
            data['忌食']=food_info['bad']
            data['推荐']=food_info['recommond']

        # 药品
        drug_info=item['drug_info']
        '''此处需要检查数据类型'''
        data['药品推荐']=list(set([i.split('(')[-1].replace(')','') for i in drug_info]))
        data['药品明细']=drug_info
        data_modify={}
        for attr,value in data.items():
            # 获取英文表示
            attr_en=self.key_dict.get(attr)
            if attr_en:
                data_modify[attr_en]=value
            if attr_en in ['yibao_status', 'get_prob', 'easy_get', 'get_way', "cure_lasttime", "cured_prob"]:
                data_modify[attr_en]=value.replace(' ','').replace('\t','')
            elif attr_en in ['cure_department','cure_way','common_drug']:
                data_modify[attr_en]=[i for i in value.split(' ') if i]
            elif attr_en in ['acompany']:
                acompany = [i for i in self.cuter.max_biward_cut(data_modify[attr_en]) if len(i) > 1]
                data_modify[attr_en]=acompany
        try:
            self.db['medical'].insert(data_modify)
            count+=1
            print(count)
        except Exception as e:
            print(e)
    return

完整的build_data.py文件如下：

#!/usr/bin/env python
#coding:utf-8

import pymongo
from lxml import etree
import os
from .max_cut import *

class MedicalGraph:
    def __init__(self):
        # 连接数据库
        self.conn=pymongo.MongoClient()
        # 这里的split 原先是 split('/'),刘老师在Linux实现的,我的平台是windows,所以作修改
        # cur_dir: 当前文件所在路径
        cur_dir='/'.join(os.path.abspath('__file__').split('\\')[:-1])
        self.db=self.conn['medical']
        self.col=self.db['data']
        first_words=[i.strip() for i in open(os.path.join(cur_dir,'first_name.txt'))]
        alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
                     'u', 'v', 'w', 'x', 'y', 'z']
        nums = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
        self.stop_words = first_words + alphabets + nums
        self.key_dict={
            '医保比例': 'yibao_status',
            '患病比例': 'get_prob',
            '易感人群': 'easy_get',
            '传染方式': 'get_way',
            '就诊科室': 'cure_department',
            '治疗方式': 'cure_way',
            '治疗周期': 'cure_lasttime',
            '治愈率': 'cured_prob',
            '药品明细': 'drug_detail',
            '药品推荐': 'recommand_drug',
            '推荐': 'recommand_eat',
            '忌食': 'not_eat',
            '宜食': 'do_eat',
            '症状': 'symptom',
            '检查': 'check',
            '成因': 'cause',
            '预防措施': 'prevent',
            '所属类别': 'category',
            '简介': 'desc',
            '名称': 'name',
            '常用药品': 'common_drug',
            '治疗费用': 'cost_money',
            '并发症': 'acompany'
        }
        self.cuter=CutWords()

    # 收集医疗信息
    def collect_medical(self):
        # cates: 所属类别
        cates=[]
        inspects=[]
        count=0
        for item in self.col.find():
            data={}
            basic_info=item['basic_info']
            name=basic_info['name']
            if not name:
                continue
            '''下述存储的信息均对应于data_spider.py中spider_main的命名'''
            # 基本信息
            data['名称']=name
            data['简介']='\n'.join(basic_info['desc']).replace('\r\n\t','')\
                .replace('\r\n\n\n','').replace(' ','').replace('\r\n','\n')
            category=basic_info['category']
            data['所属类别']=category
            cates+=category
            # 获取检查信息
            inspect=item['inspect_info']
            inspects+=inspect
            # 温馨提示
            attributes=basic_info['attributes']
            # 成因及预防
            data['预防措施']=item['prevent_info']
            data['成因']=item['cause_info']
            # 并发症
            data['症状']=list(set([i for i in item["symptom_info"][0] if i[0] not in self.stop_words]))
            '''注意,这段还是不熟悉,需要查表格'''
            for attr in attributes:
                attr_pair=attr.split('：')
                if len(attr_pair)==2:
                    key=attr_pair[0]
                    value=attr_pair[1]
                    data[key]=value

            # 检查
            inspects=item['inspect_info']
            jcs=[]
            for inspect in inspects:
                jc_name=self.get_inspect(inspect)
                if jc_name:
                    jcs.append(jc_name)
            data['检查']=jcs

            # 食物
            food_info=item['food_info']
            '''此处需要检查数据类型'''
            if food_info:
                data['宜食']=food_info['good']
                data['忌食']=food_info['bad']
                data['推荐']=food_info['recommond']

            # 药品
            drug_info=item['drug_info']
            '''此处需要检查数据类型'''
            data['药品推荐']=list(set([i.split('(')[-1].replace(')','') for i in drug_info]))
            data['药品明细']=drug_info
            data_modify={}
            '''这段奇奇怪怪的代码要细看并写清楚注释'''
            for attr,value in data.items():
                # 获取英文表示
                attr_en=self.key_dict.get(attr)
                if attr_en:
                    data_modify[attr_en]=value
                if attr_en in ['yibao_status', 'get_prob', 'easy_get', 'get_way', "cure_lasttime", "cured_prob"]:
                    data_modify[attr_en]=value.replace(' ','').replace('\t','')
                elif attr_en in ['cure_department','cure_way','common_drug']:
                    data_modify[attr_en]=[i for i in value.split(' ') if i]
                elif attr_en in ['acompany']:
                    acompany = [i for i in self.cuter.max_biward_cut(data_modify[attr_en]) if len(i) > 1]
                    data_modify[attr_en]=acompany
            try:
                self.db['medical'].insert(data_modify)
                count+=1
                print(count)
            except Exception as e:
                print(e)
        return

    def get_inspect(self,url):
        res=self.db['jc'].find_one({'url':url})
        if not res:
            return ''
        else:
            return res['name']

    def modify_jc(self):
        for item in self.db['jc'].find():
            url=item['url']
            content=item['html']
            selector=etree.HTML(content)
            name = selector.xpath('//title/text()')[0].split('结果分析')[0]
            desc = selector.xpath('//meta[@name="description"]/@content')[0].replace('\r\n\t', '')
            self.db['jc'].update({'url': url}, {'$set': {'name': name, 'desc': desc}})

if __name__=='__main__':
    handler=MedicalGraph()
    handler.collect_medical()

MongoDB的注意事项

由于之前丢失过数据，我觉得有必要养成一个比较良好的习惯，跑完一次数据就应该备份一次，备份的命令如下：

1	mongodump -h localhost:27017 -d medical -o F:\mongoDB\backup

其中medical为数据库的名字，-o 后面的参数是文件夹的位置。
做好备份之后即使数据出现丢失也能及时恢复，恢复的命令如下：

1	mongorestore -h localhost:27017 -d medical F:\mongoDB\backup\medical

后续构建知识图谱的过程中刘老师用了medical.json文件进行读取，所以这里执行MongoDB命令来将数据导出为json格式。

1	mongoexport --host localhost --port 27017 --collection data --db medical --out F:\MongoDB\export\medical.json

参数说明：
-collection：数据集合
-db：数据库名称
-out：输出文件的位置

后续为知识图谱构建部分。

Klaus's Blog

MIA SAN MIA