Crowdly

Add to Chrome

Universities
moodle.iitdh.ac.in
Natural Language Processing

Natural Language Processing

Looking for Natural Language Processing test answers and solutions? Browse our comprehensive collection of verified answers for Natural Language Processing at moodle.iitdh.ac.in.

Get instant access to accurate answers and detailed explanations for your course questions. Our community-driven platform helps students succeed!

Given a sentence with k tokens, how many n-grams with frequency greater than zero can be obtained from the sentence where n is an arbitrary natural number ?

What would be the adjusted count of bigram {“A”, “B”} if we had to observe the above maximum likelihood estimate for {“A”, “B”} without applying Laplace Smoothing ? Do not be concerned with finding a whole number.

Consider a vocabulary consisting of k tokens. How many n-grams can you construct from the vocabulary where n is an arbitrary natural number ? The frequency of the n-gram need not be greater than zero.

Considering the definition of edit distance which assigns a weight of 1 to insertions and deletions whereas 2 to substitutions, what is the minimum edit distance between lead and deal ?

Naive Bayes is a generative model. Let P(d | c) be the probability of observing a document d given that it belongs to c. P(c) represents the fraction of documents belonging to class c. What are P(d|c) and P(c) respectively called ?

c. Likelihood and Prior

A spam filter classifies emails as Spam (S) or Not Spam (¬S) using the Naïve Bayes algorithm. Given a dataset, the following probabilities are known:

P(S)=0.4 (40 % of emails are spam)

P(¬S)=0.6 (60% of emails are not spam).

70% of spam emails contain "offer" and 10% of non-spam emails contain "offer".

If a new email contains the word offer, find the probability that it is spam.

Consider three language models, A, B and C. Upon evaluating each of their performances on a test set we observe that A obtains a perplexity score of 962, B a perplexity score of 170 and C a perplexity score of 109. Which of the following is most likely to be the rationale behind the difference in performance between the three.

a. A is a trigram model, B is a unigram model, C is a bigram model.

❌

b. A and B trained on the test set.

c. A is a unigram model, B is a bigram model, C is a trigram model

d. A is a bigram model, B is a unigram model, C is a trigram model.

View this question

Consider the Dynamic Programming based solution of finding the minimum edit distance between two strings of different lengths. Let the substitution cost be S, insertion cost be I and deletion cost be D. Let the element in the rth row and cth column of the DP table be represented by the tuple (r, c)

After filling in R rows of the DP table we attempt to fill in the Cth column of the (R + 1) th row. It is observed that the element in (r, c) = 2 ; (r+1, c-1) = 3 ; (r, c - 1) = 4. Furthermore the terminal characters at (r+1, c) are not equal. Deduce the entry at (r + 1, c) from the above information.

Assume that your corpus consists of 1000 unique characters. The Byte Pair Encoding algorithm runs on your corpus for 500 iterations creating a new merge every iteration. The algorithm outputs a vocabulary at the end of its execution. What is the size of this vocabulary i.e. how many elements are in the vocabulary ?

Which of the following is a valid bigram from the sentence "I love NLP"?

c. ("I", "love", "NLP")