Distributed Processing with Stanza CoreNLP Interface · stanfordnlp/stanza#720

(9 评论) (0 反应) (0 负责人)Python (6,926 star) (905 fork)batch import

enhancementhelp wantedquestion

描述

Hi,

I've been really liking how Stanza just "works" out of the box since the last month or so. However, I have recently hit a wall and the documentation is a little sparse on the Stanza CoreNLP client. The problem is this - I want to extract relations from a large collection of text (~90k sentences on average). To do this on a single machine sequentially would be prohibitively time consuming. Hence I want to develop a distributed interface that can run the extractor on separate cores with different chunks of data.

I detail my current approach and issues below:

On a machine, I start a java server with the following command:

java -Xmx5G -cp "/path/to/stanza_corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 \
-timeout 60000 \
-threads 8 \
-maxCharLength 100000 \
-quiet False \
-annotators openie -preload -outputFormat serialized

Upon inspection of the logs, I find this message: java.lang.IllegalArgumentException: annotator "openie" requires annotation "IndexAnnotation" and apparently this requires all the other annotators to be loaded.

On the client side, I have implemented a rudimentary solution like this:

from stanza.server import CoreNLPClient
from joblib import Parallel, delayed

def chunker(iterable, chunk_size):
    return (
        iterable[pos : pos + chunk_size]
        for pos in range(0, len(iterable), chunk_size)
    )

def worker_fn(client, batch):
    annotated = []
    for txt in batch:
        ann = client.annotate(txt)
        annotated.append(ann)
    return annotated

# the server end point is the default localhost:9000
client = CoreNLPClient(annotators=["openie"], start_server=False)
data = pickle.load(files[0].open('rb'))
texts = [d['sent_text'] for d in data]
tasks = (delayed(worker_fn)(client, chunk) for chunk in chunker(texts, 100))
result = Parallel(n_jobs=8, backend='multiprocessing', prefer='processes')(tasks)
client.stop()

Doing this results in MaybeEncodingError where the message being passed around is the complete tokenization, pos-tagged shard of the sentence text. However, I only intend to pass around the annotation triplets.

My questions then are:

Can multiprocessing be implemented using the stanza.server interface?
How can we make it work with parallel libraries like multiprocessing or joblib?

I look forward to your suggestions, hint etc.

Thanks

贡献者指南

技术栈: pythonjavarest api
领域: backendperformance
议题类型: research
难度: 4
预计时间: 3-5 days
活动状态: stale
清晰度: clear
前置要求: Python multiprocessingJava CoreNLPStanza CoreNLPClientjoblib
新手友好度: 30
研究方向: The issue involves implementing distributed annotation with Stanza's CoreNLP interface. First, review the source file `stanza/server/client.py` to understand the `CoreNLPClient` implementation and the `annotate` method. Consider investigating the Java server startup command to resolve the `IndexAnnotation` error by loading all required annotators. Explore using `joblib` with a shared client per process to avoid serialization of large objects, possibly by passing annotation IDs instead of full text. Examine the existing example scripts in the repository for multiprocessing patterns.