AWS EMR pyspark using lower version of numpy when running pandas_udf function

Issue

emr=5.33.1
pyspark=2.4.7
numpy(desired) = 1.21.5
numpy(seems to be running) = 1.16.5

I am trying to use pandas_udf to distribute inference of tensorflow model. However whenever I run function wrapped in pandas_udf function it outputs ImportError numpy.core.multiarray failed to import

This seems to be problem related to version of numpy (according to multiple SO questions + running on local machine with numpy==1.21.5 seems to work fine).

When creating EMR cluster bootstrap.sh gets ran and after its applications gets installed which overrides already installed libraries specified in bootstrap.sh

To avoid such error there has been lot of hacks:

These hacks work for python, importing numpy and printing its version gives correct version however when using pyspark pandas_udf function it still seems to use numpy 1.16.5 which results in ImportError numpy.core.multiarray failed to import

How to make AWS EMR pyspark use version of numpy installed using bootstrap when performing pandas_udf?



Solution

This question is no yet answered , you can answer using the following form

The name that will be mentioned in the answer if it's confirmed
Example : How can I instruct Kotlin to never allow a Java method to return null?
Example : https://jtuto.com/mercure/how-can-i-tell-the-kotlin-compiler-that-a-java-method-will-never-return-null/



Source. Question asked by haneulkim


This Post is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.