[SOLVED] Performant and scalable strategy for multi-lingual NLP processing with Apache Spark


This Content is from Stack Overflow. Question asked by IAmJustAWizard


There are message producers that crawl, and push news to a single topic called “news-feed-topic” at rate of 500 messages per minute. In the future the messages rate might be even higher (5000-10000 per minute) as other news sources are introduced.

Each message in the topic must be proceeded by NLP pipeline that finds text’s sentiment, and detects named entities (NER). It might even do more (e.g. NEL) in the future. This information is attached to the message, and sent forward to another topic.


If the message producer would crawl only English news, it would be sufficient enough to create a structured streaming Spark job processing each message, and piping it through model(s) trained on English language. Simple.

However, there are obviously other languages used in the news such as Mandarin, French, Russian, etc. which complicates matters as each incoming message can be in any language.

This implies that we first must detect the message’s language. After we know the language for certain (> 95%), we proceed with model that matches the language.

Now this is where things get complicated. How do you architecture system like this with Apache Spark and messaging queue?

Solution #1

The first approach I thought of consists of previously mentioned language detector job which acts as decision tree forwarding data to its own language topic (news-topic-en, news-topic-fr, etc.).

For each language topic there is Spark job that collects the data, and processes with respective language model.

solution number uno

However, the strategy has some downsides:

  1. Complexity in the system design increases with every new language introduced. What if there are 25 languages that need to be supported? Is this manageable and understandable by team?
  2. Operational Complexity – If there are 25 languages that need to be supported than there must be matching number of Spark Jobs that must run with each having their own codebase.
  3. Resource demanding – As languages support grows, so does the resource demand. Is running 25 Spark jobs for NLP jobs OK?

Solution #2

Another solution would be to simply create a single Spark job that handles all the languages. However, how much memory would it require to use all the 25 language models, and what about the performance? Does it scale?

Solution ?

What do you think? How would you solve this?


Check the Answers

For more tutorials visit Jtuto.com

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 4.0.

people found this article helpful. What about you?