What is Target-Guided Open-Domain Conversation?
Many
practical open-domain dialogue applications have specific goals to achieve, e.g.,
psychotherapy, education, recommendation, etc.
To bridge the gap between open-domain chit-chat and task-oriented dialogue, we
propose a new task called ** target-guided open-domain conversation **.
Given a target and a starting utterance of the conversation, an agent is asked to
chat with a user starting from an arbitrary topic, and proactively guide the conversation to
the target.
Readme
To make our corpus suitable for turn-level keyword transition,
we augment the data by automatically extracting keywords of each utterance.
Specially, we apply a rule-based
keyword extractor to select candidate keywords in terms of
their word frequency, word length, and Part-Of-Speech features.
The keywords must meet the following conditions:
- Its word frequency is more than 2000 in the corpus;
- Its word length is more than 2;
- Its Part-Of-Speech should be noun, verb, or adjective.
*Note: Our datasets are only for academic usage.
Download
A Chinese TGODC dataset from the Weibo corpus
The dataset is derived from
a public
multi-turn conversation corpus
crawled from Sina Weibo, which is one
of the most popular social platforms of China. The dataset
covers rich real-world topics in our daily life, such as shopping, disease, news, etc.
We split our dataset randomly
into three parts: train set (90%), validation set (5%), and
test set (5%). Please download the processed data and follow the instructions in the
github to
setup the conversation
data.
Here are some details:
- Data Statistics: The vocabulary size is around 67K. The table below lists
data
statistics more.
Train Val Test #Conversations 824,742 45,824 45,763 #Utterances 1,104,438 60,893 60,893 #Keyword types 1,760 1,760 1,760 #Avg. keywords 2.6 2.6 2.6
The last row is the average number of keywords in each utterance.
An English TGODC dataset from the persona-chat corpus
The dataset is derived from the PersonaChat corpus
(Zhang et al., 2018)
where
crowdworkers were asked to chat naturally with
given persona. The conversations cover a broad
range of topics such as work, family, personal interest, etc;
and the discussion topics change frequently during the course of the conversations.
We re-split the data into train/valid/test sets, where
the test set contains 500 conversations with relatively frequent keywords.
Please download the processed data and follow the instructions in the
github to setup the conversation
data.
Here are some details:
- Data Statistics: The vocabulary size is around 19K. The table below lists
data
statistics more.
Train Val Test #Conversations 8,939 500 500 #Utterances 101,935 5,602 5,317 #Keyword types 2,678 2,080 1,571 #Avg. keywords 2.1 2.1 1.9
The last row is the average number of keywords in each utterance.
Structure
Both two datasets have the following data structure:
Data/
├── corpus.txt
├── start_corpus.txt
├── vocab.txt
├── embedding.txt
├── sample_start_corpus.txt
├── target_keywords_for_simulation.txt
├── train/
| ├── context.txt
| ├── keywords.txt
| ├── keywords_vocab.txt
| ├── label.txt
| ├── source.txt
| └── target.txt
├── test/
| ├── context.txt
| ├── keywords.txt
| ├── keywords_vocab.txt
| ├── label.txt
| ├── source.txt
| └── target.txt
└── valid/
├── context.txt
├── keywords.txt
├── keywords_vocab.txt
├── label.txt
├── source.txt
└── target.txt
The following types of txt files are mainly the corpus source for input:
corpus.txt
: all the utterances in the dataset
start_corpus.txt
: all the starting utterances in the dataset
vocab.txt
: vocabory of the dataset
embedding.txt
: pretrained GloVe embedding of the vocabulary
sample_start_corpus.txt
: 5 randomly sampled starting utterances for self-play
simulation
target_keywords_for_simulation.txt
: 500 randomly sampled target keywords for self-play
simulation
The following types of txt files exist in all three folders: ./train/ , ./test/ and ./valid/ :
context.txt
: extracted keywords in the context(last 2 utterances of the history)
keywords.txt
: extracted keywords in the next utterance
keywords_vocab.txt
: vocabulary of keywords in the corresponding set
label.txt
: label of the right response
source.txt
: the dialogue history
target.txt
: 20 candidate responses including the right one and 19 negative samples