samtecspg/articulate
View on GitHubValidate new user examples to avoid misaligned data
Open
#216 opened on May 18, 2018
apienhancementhelp wantedneeds designnice to haveui
Description
If you have a large training set but you have examples wrongly tagged you will get a warning until rasa start to train with the crf_entity_extractor.
A warning will look something like these:
Misaligned entity annotation in sentence 'This is a misaligned sentence'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces).
On rasa the validation is performed at:
def _from_json_to_crf(self,
message, # type: Message
entity_offsets # type: List[Tuple[int, int, Text]]
):
# type: (...) -> List[Tuple[Text, Text, Text, Text]]
"""Convert json examples to format of underlying crfsuite."""
from spacy.gold import GoldParse
doc = message.get("spacy_doc")
gold = GoldParse(doc, entities=entity_offsets)
ents = [l[5] for l in gold.orig_annot]
if '-' in ents:
logger.warn("Misaligned entity annotation in sentence '{}'. "
"Make sure the start and end values of the "
"annotated training examples end at token "
"boundaries (e.g. don't include trailing "
"whitespaces).".format(doc.text))
if not self.component_config["BILOU_flag"]:
for i, label in enumerate(ents):
if self._bilou_from_label(label) in {"B", "I", "U", "L"}:
# removes BILOU prefix from label
ents[i] = self._entity_from_label(label)
return self._from_text_to_crf(message, ents)
One option to validate new user examples is exposing a validate method from rasa that executes the validation on the combinations generated by the new example and the entities that are highlighted on it.
This would require some UI work to let user know that there is an error with the example he is trying to add.