AWS EMR pyspark using lower version of numpy when running pandas_udf function


numpy(desired) = 1.21.5
numpy(seems to be running) = 1.16.5

I am trying to use pandas_udf to distribute inference of tensorflow model. However whenever I run function wrapped in pandas_udf function it outputs ImportError numpy.core.multiarray failed to import

This seems to be problem related to version of numpy (according to multiple SO questions + running on local machine with numpy==1.21.5 seems to work fine).

When creating EMR cluster gets ran and after its applications gets installed which overrides already installed libraries specified in

To avoid such error there has been lot of hacks:

These hacks work for python, importing numpy and printing its version gives correct version however when using pyspark pandas_udf function it still seems to use numpy 1.16.5 which results in ImportError numpy.core.multiarray failed to import

How to make AWS EMR pyspark use version of numpy installed using bootstrap when performing pandas_udf?


Question asked by haneulkim

