CoreNLPParser tag() should allow properties overloading · nltk/nltk#2112

(7 comments) (0 reactions) (0 assignees)Python (12,712 stars) (2,826 forks)batch import

buggood first issuestanford api

Description

With the current CoreNLPParser.tag(), the "retokenization" by Stanford CoreNLP is unexpected:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111\xa01111\xa01111', 'NUMBER')]

The expected behavior should be:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]

Proposed solution is to allow properties arguments overloading for .tag() and .tag_sents(), i.e. at https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348 and by default use properties = {'tokenize.whitespace':'true'} because we are concatenating the tokens by spaces in tag_sents().


    def tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a list of
        tokens.

        :param sentences: Input sentences to tag
        :type sentences: list(list(str))
        :rtype: list(list(tuple(str, str))
        """
        # Converting list(list(str)) -> list(str)
        sentences = (' '.join(words) for words in sentences)
        if properties == None:
            properties = {'tokenize.whitespace':'true'}
        return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

    def tag(self, sentence, properties=None):
        """
        Tag a list of tokens.

        :rtype: list(tuple(str, str))

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
        >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
        >>> parser.tag(tokens)
        [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
        ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        >>> tokens = "What is the airspeed of an unladen swallow ?".split()
        >>> parser.tag(tokens)
        [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
        ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
        ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
        """
        return self.tag_sents([sentence], properties)[0]

    def raw_tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a string.

        :param sentences: Input sentences to tag
        :type sentences: list(str)
        :rtype: list(list(list(tuple(str, str)))
        """
        default_properties = {'ssplit.isOneSentence': 'true',
                              'annotators': 'tokenize,ssplit,' }

        default_properties.update(properties or {})

        # Supports only 'pos' or 'ner' tags.
        assert self.tagtype in ['pos', 'ner']
        default_properties['annotators'] += self.tagtype
        for sentence in sentences:
            tagged_data = self.api_call(sentence, properties=default_properties)
            yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                    for tagged_sentence in tagged_data['sentences']]

That should enforce the list of string tokens input by the users.

Details on https://stackoverflow.com/questions/52250268/why-do-corenlp-ner-tagger-and-ner-tagger-join-the-separated-numbers-together

If we allow the .tag() to overload the properties before the raw_tag_sents, that'll also allow users to easily handle cases like #1876

Contributor guide

Tech stack: python
Domain: backenddata
Issue type: feature
Difficulty: 2
Estimated time: 1-3 hours
Activity status: active
Clarity: clear
Prerequisites: PythonNLTK basics
Newbie friendliness: 70
Research direction: The issue requests adding a 'properties' parameter to the 'tag()' and 'tag sents()' methods in 'nltk/parse/corenlp.py' to allow users to override CoreNLP properties, specifically to avoid retokenization by passing {'tokenize.whitespace':'true'}. The proposed changes are clearly outlined in the issue body, including code snippets. Review the current implementation of these methods (around line 348) and the 'raw tag sents' method to understand how properties are handled. The fix involves modifying the method signatures and default property handling. No linked PRs or assignees are present, so this is open for contribution.