The Main Concept (MC) may or may not be explicitly mentioned in the sentence.
Debate topics that focus on a single concept are chosen and that have at least 1000 matches for the query q1 = MC. Then 100 topics that satisfy the above query are randomly selected as a development set (dev-set) and 50 topics are chosen as a test set.
Divide q1-set into two sets - c1 and c2. c1 contains the token that before MC. Class c1 consists of 183K sentences with P(c1) = 0.0986.
This is followed by standard pre-processing like tokenization, stop-word removal, lower-casing, pos-tagging using OpenNLP.
Formally defining n1 as the number of sentences in c1 that contain w in the sentence suffix; n2 as the number of sentences in c2 that contain w;
Psuff (c1|w) = n1/(n1 + n2).
Claim Lexicon is the set of words that satisfies
Psuff (c1|w) > P(c1)
CL should contain words that are indicative of claims in a general sense. Nouns, single-character-token, county-specific terms are excluded. This results in a lexicon of 586 words.
Using this, corpus-wide claim detection can be performed by adding sentence re-ranking, boundary detection and simple filters. Sentences are ranking by the average of two scores
To evaluate the performance, crowd labelling is applied on the predicted claims for the 150 topics in the dev and test set. For each topic, at most top 50 predictions were labelled and the prediction is considered correct if majority annotators marked it as a claim.
This unsupervised approach outperforms previous work that used a supervised approach over a manually pre-selected set of articles.
The test results are better for the test set compared to dev set suggesting it can generalize new topics as well. The precision increases for topics which have an increased number of sentences matching the CSQ.
Limitation: Not suitable for complex topics or topics not supported by Wikipedia. Focuses only on the patter that->MC