[NEW] CovidDialog-CH
Readme
We are collecting the COVID-19 Language-based AI research papers and datasets. If you have paper to recommend or any suggestions, please feel free to contact us.
We released a Chinese medical dialogue dataset (CovidDialog-CH)
about
COVID-19 and other
types of pneumonia. Patients who are concerned that they may be infected by COVID-19 or other
pneumonia consult doctors and doctors provide advice. There are 12789 consultations and each
consultation has at least 8 senteces.
*Note: Our datasets are only for academic usage.
Download
The dataset is built from chunyu and all copyrights of the data
belong to chunyu . It is available on the
github page
now, welcome to download and use it!
Example
The whole file is a python list, and each item in the list is a list of dictionary,as shown below:
[{'id': 'dis', 'text': '医生您好!发烧两天,最高38.7,
头痛,全身酸疼乏力,不咳嗽,不流涕,呼吸无异常。现已退烧,头痛基本消失,
但是咽喉及中央气管略有灼烧疼,胸口两侧对称跳着疼,头顶头皮一处跳着疼,
左腹一处偶尔也跳着疼,无力,不爱吃饭。无湖北接触史,偶尔外出去超市采购,
都戴口罩。前段时间压力大,连续一周熬夜做文字工作。请问这是什么情况啊?
是不是新冠肺炎?谢谢医生!(男,35岁)'}
{'id': 'doc', 'text': '您好!这种情况有多久了?'}
{'id': 'doc', 'text': '有没有鼻塞?'}
{'id': 'doc', 'text': '咳嗽?气喘?'}
{'id': 'dis', 'text': '三天了,前两天发烧头疼,今天不烧了'}
{'id': 'dis', 'text': '只在第一天晚上鼻塞过,不咳嗽,不气喘'}
{'id': 'doc', 'text': '咳嗽?'}
{'id': 'doc', 'text': '咳嗽?'}
{'id': 'doc', 'text': '胸痛?'}
{'id': 'doc', 'text': '应该是病毒性感冒'}
{'id': 'doc', 'text': '不像肺炎'}
{'id': 'doc', 'text': '建议多喝水,休息,按时吃感冒药
如日夜百服宁片或泰诺感冒片,另外可吃些含片,清火的药如清开灵或熊胆胶囊等。'}
{'id': 'dis', 'text': '不咳嗽,胸口两侧对称地方跳着疼,头皮,左腹各有跳着疼'}
{'id': 'dis', 'text': '对了,退烧后,左嘴角上下起了一些疱疹'}
{'id': 'doc', 'text': '还是考虑病毒性感冒'}
{'id': 'dis', 'text': '病毒性感冒是不是传染?家里有孩子,请问怎样防护?'}
{'id': 'doc', 'text': '感冒会传染,'}
{'id': 'doc', 'text': '可以适当隔离,戴口罩'}
{'id': 'dis', 'text': '好的,谢谢医生!'}
{'id': 'doc', 'text': '不客气'}]
The code for read "COVID_Dialogue_Dataset.pk":
with open('COVID_Dialogue_Dataset1.pk','rb') as f:
data = pickle.load(f)
data[0]
DX Medical Dialogue Dataset
Readme
The DX medical dialogue dataset is built for medical dialogue system,
reserving the original self-reports and interaction utterances
between doctors and patients.
We annotate five types of diseases, including
allergic rhinitis, upper respiratory infection, pneumonia, children
hand-foot-mouth disease,
and pediatric diarrhea. We
extract the symptoms that appear in self-reports and conversation and normalize them into 41
symptoms. Four annotators with medical background are invited to label the symptoms in both
self-reports and raw conversations. Symptoms
appearing in self-reports are regarded as explicit symptoms
while the others are implicit symptoms. The diseases of each
medical diagnosis conversation are labeled automatically.
*Note: Our datasets are only for academic usage.
Download
We collected data from
a Chinese online health-care community
where users asking doctors for medical diagnosis or professional medical advice.
Now the DX dataset is available in
Google Could
, welcome to download and use it!
- Data Statistics: There are 527 conversational data in total. 423
conversational data are selected as the training set 104 for
testing. More detailed dataset statistics are shown in table:
Disease Quantity Symptoms Allergic rhinitis 102 24 Upper respiratory tract infection 122 23 Pneumonia 100 29 Children hand-foot-mouth disease 101 22 Pediatric diarrhea 102 33 - Data Structure: The DX dataset consists of the following three files:
dxy_dialog_data_dialog_v2.pickle
: dialog data with user goals
all_norm_symptoms.txt
: all normalized symptoms
norm_symptom_list.txt
: the match of symptoms and their normalized one