What is Target-Guided Open-Domain Conversation?

    Many practical open-domain dialogue applications have specific goals to achieve, e.g., psychotherapy, education, recommendation, etc.
    To bridge the gap between open-domain chit-chat and task-oriented dialogue, we propose a new task called ** target-guided open-domain conversation **. Given a target and a starting utterance of the conversation, an agent is asked to chat with a user starting from an arbitrary topic, and proactively guide the conversation to the target.


Readme

    To make our corpus suitable for turn-level keyword transition, we augment the data by automatically extracting keywords of each utterance.
    Specially, we apply a rule-based keyword extractor to select candidate keywords in terms of their word frequency, word length, and Part-Of-Speech features. The keywords must meet the following conditions:

  • Its word frequency is more than 2000 in the corpus;
  • Its word length is more than 2;
  • Its Part-Of-Speech should be noun, verb, or adjective.

*Note: Our datasets are only for academic usage.



Download

A Chinese TGODC dataset from the Weibo corpus


    The dataset is derived from a public multi-turn conversation corpus crawled from Sina Weibo, which is one of the most popular social platforms of China. The dataset covers rich real-world topics in our daily life, such as shopping, disease, news, etc.
    We split our dataset randomly into three parts: train set (90%), validation set (5%), and test set (5%). Please download the processed data and follow the instructions in the github to setup the conversation data.
    Here are some details:

  • Data Statistics: The vocabulary size is around 67K. The table below lists data statistics more.

    Train Val Test
    #Conversations 824,742 45,824 45,763
    #Utterances 1,104,438 60,893 60,893
    #Keyword types 1,760 1,760 1,760
    #Avg. keywords 2.6 2.6 2.6

    The last row is the average number of keywords in each utterance.

An English TGODC dataset from the persona-chat corpus


    The dataset is derived from the PersonaChat corpus (Zhang et al., 2018) where crowdworkers were asked to chat naturally with given persona. The conversations cover a broad range of topics such as work, family, personal interest, etc; and the discussion topics change frequently during the course of the conversations.
    We re-split the data into train/valid/test sets, where the test set contains 500 conversations with relatively frequent keywords. Please download the processed data and follow the instructions in the github to setup the conversation data.
    Here are some details:

  • Data Statistics: The vocabulary size is around 19K. The table below lists data statistics more.

    Train Val Test
    #Conversations 8,939 500 500
    #Utterances 101,935 5,602 5,317
    #Keyword types 2,678 2,080 1,571
    #Avg. keywords 2.1 2.1 1.9

    The last row is the average number of keywords in each utterance.


Structure

Both two datasets have the following data structure:

 Data/
 ├── corpus.txt
 ├── start_corpus.txt
 ├── vocab.txt
 ├── embedding.txt
 ├── sample_start_corpus.txt
 ├── target_keywords_for_simulation.txt
 ├── train/
 |   ├── context.txt
 |   ├── keywords.txt
 |   ├── keywords_vocab.txt
 |   ├── label.txt
 |   ├── source.txt
 |   └── target.txt
 ├── test/
 |   ├── context.txt
 |   ├── keywords.txt
 |   ├── keywords_vocab.txt
 |   ├── label.txt
 |   ├── source.txt
 |   └── target.txt
 └── valid/
     ├── context.txt
     ├── keywords.txt
     ├── keywords_vocab.txt
     ├── label.txt
     ├── source.txt
     └── target.txt


The following types of txt files are mainly the corpus source for input:

corpus.txt: all the utterances in the dataset
start_corpus.txt: all the starting utterances in the dataset
vocab.txt: vocabory of the dataset
embedding.txt: pretrained GloVe embedding of the vocabulary
sample_start_corpus.txt: 5 randomly sampled starting utterances for self-play simulation
target_keywords_for_simulation.txt: 500 randomly sampled target keywords for self-play simulation

The following types of txt files exist in all three folders: ./train/ , ./test/ and ./valid/ :

context.txt: extracted keywords in the context(last 2 utterances of the history)
keywords.txt: extracted keywords in the next utterance
keywords_vocab.txt: vocabulary of keywords in the corresponding set
label.txt: label of the right response
source.txt: the dialogue history
target.txt: 20 candidate responses including the right one and 19 negative samples