Reading Comprehension Systems

Updated: Mar 26 2021

Who is this for: If you have basic understanding of machine learning and neural nets, and want to get a current picture of question answering systems. Links are provided if you want to delve further into each topic.

Acknowledgement: Many thanks to Danqi Chen's thesis for much of the background contained here.

Reading comprehension is probably the best way to figure out if a machine can actually understand language. There's been a lot of improvement in the past few years in question answering machine learning systems and how they apply to reading comprehension tasks.

To understand what question answering systems do, it's important to look at both the dataset and the models involved.

Early Systems

QUALM (1977)

First off, why is answering questions difficult? Although this paper was from 1977, it gives insight about the difficulty of answering questions from a conceptual level.

What's hard about answering questions is that you must understand what the question means. Lehnert looked at various ways that you can fail at understanding a question, and created a theory about what processes are involved.

  1. The memory processes of question answering are independent of language
    • if someone starts speaking to you in a foreign language, you need to understand the vocabulary and grammar of that language but you do not need to acquire new cognitive processes for answering the question
  2. Questions are understood on many different levels
  3. Understanding questions entails conceptual categorization
    • Q: How are we going to eat tonight?
    • A: With silverware
    • The question was answered incorrectly because it was placed in the wrong conceptual category. The answer indicated that the question was asking about what instruments would be used for eating, but in the context, the question should have been understood as 'What are we going to do in order to eat tonight?'
  4. Context affects conceptual categorization
  5. Rules of conversational continuity are needed to understand some questions
    • J: Did B go to class?
    • M: I don't know.
    • J: Why not?
    • M: I wasn't there
    • M has to interpret a question of J's in terms of the previous dialog.
  6. A good answer is more than a correct answer: appropriateness counts
    • J: Hello M! Thanks for inviting me over.
    • M: Would you like something to drink?
    • J: Some coffee?
    • M: Sure
    • Here 'Some coffee?' is a polite response to an offer to drink something. J is saying he would like some coffee if M has any.
  7. Shifts in interpretive focus alter meaning
    • Q: Why did J roller skate to McDonald's last night?
    • There are different ways of interpreting this depending on the focus:
      • A: Because he was hungry. (focus on McDonald's)
      • A: Because he did not want to drive. (focus on roller skating)
  8. Focus directs attention to variations on expectations
    • In the above question about J roller skating to McDonald's, attention is drawn based on knowledge about J's behaviour. If J was very health conscious, then the focus of the question would be on McDonald's rather than roller skating.
  9. Ambiguity of focus rarely occurs in context
    • When a question is asked in the context of a story, focus is immediately established based on your memory.
  10. The more inference an answer carries, the better the answer
  11. A difficult retrieval problem may point to weaknesses in the memory representation
    • Context: J went into the kitchen and poured himself some milk.
    • Q: Where did J get the milk?
    • A: From the refridgerator
    • This is a natural answer to the question anyone would make. But does anyone really think that there is a puddle of milk in the refridgerator? No, there is a milk container of some sort in the refridgerator, but very few people will mention this milk container.
    • A person must understand that mentioning the conceptual representation of a milk container is too obvious and does not need to be mentioned.
    • A conceptual representation ofr physical objects have to be used for questions about things in the physical space
  12. Good answers can involve knowledge state assessment
  13. The same question doesn't always get the same answer
    • There can be hundreds of ways to answer a question depending on how much information you want to provide, or the tone that you want to deliver the answer. How does the system decide this?
  14. Sometimes inference must be made at the time of question answering
  15. You can't expect to always find exactly what you had in mind
  16. A good search strategy knows when it has the answer: smart heuristics know when to quit

QUALM's 2 processes:

  • understanding the question
    • Conceptual parse
    • Memory internalization
    • Conceptual categorization
    • Inferential analysis
  • finding an answer

The system was based on hand-coded scripts though, and is difficult to generalize to domains that are outside of the intended context.

Deep Read (1999)

Hirsch et al. created a reading comprehension dataset of 60 development and 60 test stories that were based on material for 3rd to 6th graders where each story is followed by short-answer questions. Using a simple bag-of-words approach, the system picked an appropriate answer sentence 30-40% of the time.

For example:

2a (Question): Who gave books to the new libary?
2b (Bag): {books gave library new the to who}

The system would measure the match by the intersection between the two word sets. In the above example, there would be a match of 1 because of the mutual match of the word book.

Other linguistic techniques were also used such as stemming, pronoun resolution, ...

Early machine learning approaches

In early ML systems, question answering was framed as a supervised machine learning problem where researchers collected human-labeled training examples, mapping a passage and question pair with the appropriate answer.

Here are examples of a few reading comprehension datasets

CNN/Daily Mail (cloze style) (Herman et al., 2015)

passage: passage:( @entity4 ) if you feel a ripple in the force today , it may be the newsthat the official @entity6 is getting its first gay character . according to the sci-fi [email protected] , the upcoming novel “ @entity11 ” will feature a capable but flawed @entity13official named @entity14 who “ also happens to be a lesbian . ” the character is the firstgay figure in the official @entity6 – the movies , television shows , comics and booksapproved by @entity6 franchise owner @entity22 – according to @entity24 , editor of “@entity6 ” books at @entity28 imprint @entity26 .

question: characters in “” movies have gradually become more diverse

MCTest (multiple choice) (Richardson et al., 2013)

passage:Once upon a time, there was a cowgirl named Clementine. Orange was herfavorite color. Her favorite food was the strawberry. She really liked her Blackberryphone, which allowed her to call her friends and family when out on the range. Oneday Clementine thought she needed a new pair of boots, so she went to the mall. BeforeClementine went inside the mall, she smoked a cigarette. Then she got a new pair ofboots. She couldn’t choose between brown and red. Finally she chose red, which theseller really liked. Once she got home, she found that her red boots didn’t match her bluecowgirl clothes, so she knew she needed to return them. She traded them for a brownpair. While she was there, she also bought a pretzel from Auntie Anne’s.

question: What did the cowgirl do before buying new boots?

potential answers: A. She ate an orange B. She ate a strawberry C. She calledher friend D. She smoked a cigarette
answer: D

SQuAD (span prediction) (Rajpurkar et al., 2016)

passage: Super Bowl 50 was an American football game to determine the championof the National Football League (NFL) for the 2015 season. The American FootballConference (AFC) championDenverBroncos defeated the National Football Conference(NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. Thegame was played on February 7, 2016, at Levi’s Stadium in the San Francisco Bay Areaat Santa Clara, California. As this was the 50th Super Bowl, the league emphasizedthe ”golden anniversary” with various gold-themed initiatives, as well as temporarilysuspending the tradition of naming each Super Bowl game with Roman numerals (underwhich the game would have been known as ”Super Bowl L”), so that the logo couldprominently feature the Arabic numerals 50.

question: Which NFL team won the Super Bowl 50?

answer: Denver Broncos

NarrativeQA (free-form text) (Kocisky et al., 2018)

passage: . . . In the eyes of the city, they are now considered frauds. Five years later, Rayowns an occult bookstore and works as an unpopular children s entertainer with Winston;Egon has returned to Columbia University to conduct experiments into human emotion;and Peter hosts a pseudo-psychic television show. Peter’s former girlfriend Dana Barretthas had a son, Oscar, with a violinist whom she married then divorced when he receivedan offer to join the London Symphony Orchestra...

question: How is Oscar related to Dana?

answer: He is her son.

Early machine learning models were better than earlier rule-based systems. But they still had weaknesses that made them difficult to generalize, as Danqi states:

  • the models relied on linguistic systems that were usually trained in one domain (eg. news articles)
  • it is difficult to create features from linguistic representations
  • datasets were still too small (MCTest only used 1480 examples for training)

Deep learning approaches


The most prominent dataset used for deep learning QA approaches is SQuAD.


The creation of the SQuAD (Stanford question answering dataset) dataset was an important milestone because of its high quality and automatic evaluation. It contains 107,785 question-answer pairs on 536 wikipedia articles and the answer to each question is a span of text from the relevant passage. The SQuAD 2.0 dataset includes over 50,000 unanswerable questions written adversarially to look similar to answerable ones. So to do well on 2.0, systems have to determine when no answer is possible.

Human performance achieved an F1 score of 89.45% while the best performing system achieved a score of 93.18% (Ant Service Intelligence Team) as of writing (March 2021). It's amazing that neural nets have surpassed human performance on this dataset!

The top performing models are end to end neural networks. They don't use hand-coded linguistic features and all features are learned. They also take advantage of word embeddings.

It's also inspired other SQuAD style QA datasets in different languages Source:

Limitations include:

  • questions are restricted to those that must be answered using a single span in a passage (excludes yes/no, counting questions)
  • very few why and how questions in the dataset
  • most examples are simple and don't require complex reasoning through multiple sentences
  • uses a small subset of wikipedia articles (536 out of over 6 million english articles)
  • questions are based on the passage (likely to mirror sentence structure and reuse same words, resulting in easier questions)

Natural Questions (Google)

Questions are real anonymized queries presented to the Googel search engine. An annotator is give nthe question along with a Wikipedia page from the top 5 search results, and annotates a long answer and a short answer if present on the page, or marks null if no answer is present.

There are 307,373 training examples with single annotations, 7830 examples with 5-way annotations for development data and 7842 examples 5-way annotated as test data.


Question: can you make and receive calls in airplane mode?
Wikipedia page: Airplane_mode
Long answer: Airplane mode, aeroplane mode, flight mode,offline mode, or standalone mode is a setting available on manysmartphones, portable computers, and other electronic devices that,when activated, suspends radio-frequency signal transmission bythe device, thereby disabling Bluetooth, telephony, and Wi-Fi.GPS may or may not be disabled, because it does not involve trans-mitting radio waves
Short answer: BOOLEAN:NO

Each training example contains (question, wikipedia page, long answer, short answer) quadruples.


  • questions are natural because they are real queries from people
  • larger dataset than previous QA datasets

Narrative QA

This is a dataset that has two settings:

  • answer questions based on a summary of a book or movie (similar to SQuAD)
  • answer questions based on a full book or movie script

The generated answers are also free-form and annotators were encouraged to use their own words and not copy, making this dataset very difficult. Because of the free-form nature, it's also difficult to evaluate easily.


Another challenging QA dataset where the questions require reasoning over multiple documents to answer.
The types of questions include these question words: which, who, what, how, where, when, yes/no comparisons, and more.

Problem formulation and evaluation

Question answering tasks generally fall in one of these 4 categories:

  • Cloze style
    • the question contains a placeholder.
  • Multiple choice
  • Span prediction
    • the answer is a single span in the passage
  • Free form answer
    • an answer can be of any length

When evaluating QA tasks, it depends on the task.

For multiple choice and cloze style questions, the accuracy is the percentage of questions that have the exact correct answer.

For span prediction tasks, these are two scores that can be used:

  • exact match: gives a full credit of 1.0 if the predicted answer is equal to the correct answer, and 0 otherwise
  • F1 score: gives the average word overlap between the predicted and correct answers
    • F1=2xPrecisionxRecallPrecision+RecallF1=\frac{2 x Precision x Recall}{Precision + Recall}

For free-form tasks, there isn't an ideal evaluation metric yet.

Neural models

Deep neural net models have become SOTA in solving question answering tasks. There are three main building blocks for these models:

  • word embeddings
    • each word is represented by vectors that have been learned from large unlabeled text corpora
  • recurrent neural networks
    • works better than standard neural networks since inputs and outputs can be of different lengths and standard NNs don't share features across positions of text. Maybe most importantly, they allow information to persist.
    • LSTMs are a kind of RNN that work much better than the standard RNN
  • attention mechanism
    • identifies information in an input most relevant to accomplishing a task

Extractive QA

Jurasky et al.

Extractive QA tasks are those where a user asks a question and the system finds the answer span within a collection of documents. For efficiency reasons, this is usually split up into two stages:

  • retrieval: returns relevant documents from the collection
  • reader: neural net model extracts the answer spans

The retrieval stage needs to index all the documents and find the most relevant n documents very quickly. There are many different approaches to this:

As for the reader model, take the Stanford Attentive Reader as an example.


The question is encoded by mapping each word into its word embedding and applying a bi-directional LSTM.

The passage encoding is also similar. In addition to the word embeddings, each word in the passage also has some manual features (part-of-speech, named entity recognition tags, normalized term frequency).

The other part of passage encoding is to encode how relevant it is with respect to the question. This is done by determining:

  • exact match
  • aligned question embeddings (calculated via dot product between mapings of word embeddings)

The model trains two separate classifiers, one predicts the start position of the answer span and the other predicts the end position.

Innovations in extractive QA models are due to the use of contextual word embeddings, and different attention mechanisms.

Limitations of SQuAD based QA systems

There have been more advances on neural net models trained on SQuAD, to the point that they surpass human performance.

But there are criticisms against these SQuAD based models.

Take this adversarial example from Jia and Liang, 2017.

Article: Super Bowl 50
Paragraph: "Peyton Manning became the first quarter-back ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in SuperBowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.
Question: “What is the name of the quarterback who was 38 in Super Bowl XXXIII?”
Original prediction: John Elway
Prediction under adversary: Jeff Dean

The 'adversarial' sentence in the paragraph is the last sentence (italicized). It's considered adversarial because it adds information that may resemble an answer, but does not actually contain the answer. The authors Jia and Liang noted that when adversarial sentences were added, the accuracy of models drops from an average of 75% F1 score to 36%.

So why is this happening? It's fair to say that extractive QA models only focus on surface-level information of text. They may just be very sophisticated text matching programs and do not deeply understand the text. Since SQuAD question answer pairs themselves do not contain many questions that require a deep understanding, this seems to be reflected in the models on the leaderboard.