描述
Hi,
I've been really liking how Stanza just "works" out of the box since the last month or so. However, I have recently hit a wall and the documentation is a little sparse on the Stanza CoreNLP client. The problem is this - I want to extract relations from a large collection of text (~90k sentences on average). To do this on a single machine sequentially would be prohibitively time consuming. Hence I want to develop a distributed interface that can run the extractor on separate cores with different chunks of data.
I detail my current approach and issues below:
On a machine, I start a java server with the following command:
java -Xmx5G -cp "/path/to/stanza_corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 \
-timeout 60000 \
-threads 8 \
-maxCharLength 100000 \
-quiet False \
-annotators openie -preload -outputFormat serialized
Upon inspection of the logs, I find this message: java.lang.IllegalArgumentException: annotator "openie" requires annotation "IndexAnnotation" and apparently this requires all the other annotators to be loaded.
On the client side, I have implemented a rudimentary solution like this:
from stanza.server import CoreNLPClient
from joblib import Parallel, delayed
def chunker(iterable, chunk_size):
return (
iterable[pos : pos + chunk_size]
for pos in range(0, len(iterable), chunk_size)
)
def worker_fn(client, batch):
annotated = []
for txt in batch:
ann = client.annotate(txt)
annotated.append(ann)
return annotated
# the server end point is the default localhost:9000
client = CoreNLPClient(annotators=["openie"], start_server=False)
data = pickle.load(files[0].open('rb'))
texts = [d['sent_text'] for d in data]
tasks = (delayed(worker_fn)(client, chunk) for chunk in chunker(texts, 100))
result = Parallel(n_jobs=8, backend='multiprocessing', prefer='processes')(tasks)
client.stop()
Doing this results in MaybeEncodingError where the message being passed around is the complete tokenization, pos-tagged shard of the sentence text. However, I only intend to pass around the annotation triplets.
My questions then are:
- Can multiprocessing be implemented using the
stanza.serverinterface? - How can we make it work with parallel libraries like
multiprocessingorjoblib?
I look forward to your suggestions, hint etc.
Thanks