stanfordnlp/stanza

Distributed Processing with Stanza CoreNLP Interface

Open

#720 创建于 2021年6月13日

在 GitHub 查看
 (9 评论) (0 反应) (0 负责人)Python (6,926 star) (905 fork)batch import
enhancementhelp wantedquestion

描述

Hi,

I've been really liking how Stanza just "works" out of the box since the last month or so. However, I have recently hit a wall and the documentation is a little sparse on the Stanza CoreNLP client. The problem is this - I want to extract relations from a large collection of text (~90k sentences on average). To do this on a single machine sequentially would be prohibitively time consuming. Hence I want to develop a distributed interface that can run the extractor on separate cores with different chunks of data.

I detail my current approach and issues below:

On a machine, I start a java server with the following command:

java -Xmx5G -cp "/path/to/stanza_corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 \
-timeout 60000 \
-threads 8 \
-maxCharLength 100000 \
-quiet False \
-annotators openie -preload -outputFormat serialized

Upon inspection of the logs, I find this message: java.lang.IllegalArgumentException: annotator "openie" requires annotation "IndexAnnotation" and apparently this requires all the other annotators to be loaded.

On the client side, I have implemented a rudimentary solution like this:

from stanza.server import CoreNLPClient
from joblib import Parallel, delayed

def chunker(iterable, chunk_size):
    return (
        iterable[pos : pos + chunk_size]
        for pos in range(0, len(iterable), chunk_size)
    )

def worker_fn(client, batch):
    annotated = []
    for txt in batch:
        ann = client.annotate(txt)
        annotated.append(ann)
    return annotated

# the server end point is the default localhost:9000
client = CoreNLPClient(annotators=["openie"], start_server=False)
data = pickle.load(files[0].open('rb'))
texts = [d['sent_text'] for d in data]
tasks = (delayed(worker_fn)(client, chunk) for chunk in chunker(texts, 100))
result = Parallel(n_jobs=8, backend='multiprocessing', prefer='processes')(tasks)
client.stop()

Doing this results in MaybeEncodingError where the message being passed around is the complete tokenization, pos-tagged shard of the sentence text. However, I only intend to pass around the annotation triplets.

My questions then are:

  1. Can multiprocessing be implemented using the stanza.server interface?
  2. How can we make it work with parallel libraries like multiprocessing or joblib?

I look forward to your suggestions, hint etc.

Thanks

贡献者指南