Disclaimer: This is an example of a student written essay.
Click here for sample essays written by our professional writers.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Question Classification with Deep Contextualized Transformer

Paper Type: Free Essay Subject: Computer Science
Wordcount: 3376 words Published: 18th May 2020

Reference this


  Early work in this field mainly uses the

Bag-of-words (BoW) to classify sentence

types. Many recent works post

some supervised and deep-learning methods do the question classification with promising results (Lee and Dernoncourt, 2016). However, most of these approaches treat the sentence as text classification, treating each sentence in isolation, causing them to be unable to have a contextual dependence on the words of the sentence. Following the context of sentences, many times would cause a different meaning for a different order of words in the sentence.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

 This work draws some recent advances in NLP research, like BERT (Jacob et al., 2018) and Elmo (Peters et al., 2018), to produce a sentence classification model to quickly and correctly pick out the question sentence from the target text. Compared with regular algorithms for treating the QA problems, the Self-learning algorithm can do the contextualized word representation to get the contextualized word meaning in the sentences. Specifically, we use the hierarchical deep neural network with the Self-learning algorithm to model different types of question text, including statement questions, which is a specific type of question in the questionnaires. The research works to achieve state-of-the-art performance for classifying the QA problem.


The latest work for Question and Answer problems is to use the Stanford Parse Tree. We build on prior work and develop a new method to handle the Question and Answer problem with the Deep Contextualized Transformer to manage some aberrant expression. We conduct extensive evaluations of the SQuAD and SwDA dataset and show significant improvement over QA problem classification of industry needs. We also investigate the impact of different models for the accuracy and efficiency of the problem answers. It shows that our new method is more effective for QA problems with higher accuracy.

Keywords: QA Classification, NLP, Selflearning, Self-attention

  1. Introduction

The Question and Answer system (QA) is widespread in the current industry needs. Every week, one company should face hundreds and thousands of questionnaires for the products they publish. QA is a massive problem in Natural Language Processing (NLP), with the application of problem answers, sentence recognizes, etc. Here are several types of problems, such as Wh-question, statement question, statement, etc. Each type of question has a corresponding label such as question or

We demonstrate how performance could improve with a combination of different level

models: the hierarchical deep neural network

for classification, self-learning and self-attention model like Bert for the single words embedding, and previous label of the training data with the SQuAD dataset. Finally, we explore different methods to find an effective method to classify the QA problem.

2. Related Work

We focus on two primary methods with recent research. One treats text as text classification, in which each utterance is classified isolation, while another one treats the text using Contextualized Word Representation Algorithms, such as BERT with self-attention or Elmo.

Text Classification:Lee and Dernoncourt (2016) build a vector representing for each utterance and use either RNN or CNN to predict the text details to classify the sentence type.

Self-learning: Jacob et al. (2018) used the BERT, and Peters et al. (2018) used Elmo to embed the text to the vector to give the contextual relationship of the sentence for each utterance. Then use RNN-based or CNN-based hierarchical neural networks to learn and model multiple levels of utterance.

3. Model

The task of QA classification takes a sentence S as an input, which varies the length sequence of utterance U= { u1

, u2

, u3

, …, uN

}. For each utterance u1 

U, there has a

length value of li  

L and a corresponding target label yi 

Y, which represents the QAs result associated with the corresponding sentence.


 Figure 1. The graph of the model Architecture

 Figure 1 shows the overall architecture of the model, which involve several main components. (1) A self-learning Algorithm to encoding the sentence with the self-attention (2) A Combination-level RNN to handle the output of the encoding and classify the label of the sentence. We describe the details below.

3.1 Context-aware Self-learning

Self-learning algorithm encodes a variable-length sentence into a fixed size. There are two types of the algorithm; one base on the Self–Attention and another just base on the deep contextualization word representation.

3.1.1 Deep contextualization word representation

The model uses the BiLM to consider the difference position of utterances within the sequence. Inspired by Peters et al. (2018), we

will encode a variable-length sequence using attention mechanism that considers the different position, token, and segment within the sequence. Inspired by Devin et al. (2018) and Tran et al. (2017), we use the Combination-Level RNN (Section 3.2) into a self-attractive encoder (Lin et al. 2017). The result of the encoding will get out a 2-D vector for each sentence. We follow the instruction of Vipul Raheja and Joel Tetreault (2019) and Joel Tetreault and Liu et al. (2019) to explain the modification below.

 The utterance ti is also mapped into the embedding layer and result in s-dimensional embedding for each word in sequence based on the Transformer (Vaswani et al. 2017). Then the embedding would be put into the bidirectional-GRU layer.

 Based on Vipul Raheja and Joel Tetreault (2019) describe, the contextual self-attention score can compute as:


Here WS1 is a weight matrix, WS2 and WS3 is a matrix of parameters. b is a bias of vector representing. Equation 2 can be treated as a 2-layer MLP with bias, and da with hidden unit.

3.2 Combination-level RNN

The utterance representation hi from past two models are pass into the combination-level RNN. As Figure 1, we would pass all of the hidden layers concatenated into a final representation Ri of each utterance. Then we put into CRF layer to figure out the relationship between label and the context of the utterances. It is not independently decoding of the label of the utterances; it should consider all of the relationships of the sentences, then

use PCA and t-SNE to reduce the dimensions from a higher level to a lower level. Then we use the Combination-Level RNN (Section 3.2) which provide us the previous hidden state of utterance encode. It provides us the context relationship in the sentences and combines all hidden states of words in sentences. After that, the deep contextualization word representation encoder encodes the combination into the 2-D vectors of each sentence. We follow the instruction of the Peters at el. (2018) to explain our modification below.

 An utterance ti, which is the sequence of the sentence, is mapping into the embedding layer. The deep contextualization representation uses biLM to combine the forward and backend LM. The formulation of the process:

 Moreover, we weight the perform of the model with computing follows:

 In (1), the sjtask is softmax-normalized weights, and the scalar parameter γtask allows the task model to scale the entire vector. In the simple case,the representation would choose the top layer and E(Rk) = .

3.1.2 Self-Attention

For each word in utterances, we would use some Self-Attention model to encode them, and the most popular Self-Attention model base on BERT (Devin et al. 2018). The model

use the self- attention for the task. The several the Natural Language Toolkit Dataset (NLTK) (Steven Bird and Edward Loper, 2002) as another significant resource for the test case. We use the training, validation, and test splits

as defined in Lee and Dernoncourt (2016).




















 Table 1. Number of Sentences in the Dataset. |T| represents the number of classes and |N| represents the sentence size

 Table 1 shows the statistics for both datasets. They both exist many kinds of the labels of the class to classify the kind of sentences they are. There are some special DA classes in both datasets, such as Tag-Question in SwDA and Statement-Question in NLTK. Both datasets make over 25% of the question type labels in each set.

5. Result

We compare the classification accuracy of our model with several other models (Table 2). For methods use attention and deep contextualization word representation in some approach to model the sentence of questionnaires documents, even some of them




TF-IDF GloVe (2014)



Li and Wu (2016)




Lee and Dernoncourt (2016)



Our Method

Table 2.  QA Classification Accuracy of the different approaches

give out the most related decoder to decode them to the related labels.

3.3 Super-attractive

The model that we use combines the all final representative of the combination for hidden layers by the self-learning and self- attention. It can help us figure out what the labels of those utterances and give out the result. The score we compute for the algorithm is to calculate the accuracy of the correct labels in the classifications though Hossin M. and Sulaiman M.N. (2015) suggests. Also, we apply some advanced check for the question and answer problem. For those that are an unsure sentence, we would put them into the parser tree to have another classification. The parser tree we use is based on the Huang (2018). We use its Tensor Product Representation to rebuild our parser tree for our model. In our model, we use the bi-LSTM with the attention algorithm to rebuild the parser tree and get the tree graph with POS tags, which is useful to calcify the structure of the sentence. After that, we use the graph we get to analyze the structure of utterances and give out the classification of the unsure sentence in the document. Finally, we will give out the combination result to the users to check the question and answer problems.

4. Data

We evaluate the accuracy of the classification model with one standard dataset, the Switchboard Dialogue Act Corpus (SwDA) (Jurafsky et al., 1997) consisting of 43 classes, and made the word extension with the Stanford Question Answering Dataset use


6. Conclusion

We developed a new model which perform the QA classification with attention and make comparisons with the common-use algorithms by testing with the SwDA dataset. We experience different utterance representation method and show that the context details highly depend on the classification performance. Working with attention and combination level to the classification, which has not previously been applied in this kind of task, enable the model can learn more from the context and get more real meaning of words in utterance than before. It helps improve the performance of the classification for those kinds of tasks.

 As future work, we would try more attention mechanisms, such as block self-attention (Shen et al., 2018b), or hierarchical attention (Yang et al., 2016), hypergraph attention (Song et al. 2019). Because they can incorporate the information from different representation for the various position and they can capture both local and long-range context dependency.

use the self- attention for the task. The several


Ji Young Lee and Franck Dernoncourt. 2016. Sequential short-text classification with recurrent and convolutional neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 515–520. Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.

Dan Jurafsky, Liz Shriberg, and Debra Biasca. 1997. Switchboard SWBD-DAMSL shallow-discoursefunction annotation coders manual, draft 13. Technical report, University of Colorado at Boulder Technical Report 97-02.

Rajpurkar P, Zhang J, Lopyrev K, Liang P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv [cs.CL].

Steven Bird and Edward Loper. July 2002. NLTK: The Natural Language Toolkit. arXiv [cs.CL]

Vipul Raheja and Joel Tetreault. May 2019. Dialogue Act Classification with Context-Aware Self-Attention. arXiv [cs.CL]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. July 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv [cs.CL]

Quan Hung Tran, Ingrid Zukerman, and Gholamreza Haffari. 2017. A hierarchical neural model for learning sequences of dialogue acts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 428–437. Association for Computational Linguistics.

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In International Conference on Learning Representations 2017 (Conference Track).

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.

Wei Li and Yunfang Wu. 2016. Multi-level gated recurrent neural network for dialog act classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1970–1979. The COLING 2016 Organizing Committee.


. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.2

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics.

Qiuyuan Huang, Li Deng, Dapeng Wu, Chang Liu, and Xiaodong He. Feb 2018. Attentive Tensor Product Learning. arXiv [cs.CL].

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018b. Bi-directional block selfattention for fast and memory-efficient sequence modeling. In International Conference on Learning Representations.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489. Association for Computational Linguistics.

Song Bai, Feihu Zhang, and Philip H.S. Torr. Jan 2019. Hypergraph Convolution and Hypergraph Attention. arXiv [cs.CL].



Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: