[SOLVED] Millisecond behind latest jumps after Flink version upgrade

Issue

This Content is from Stack Overflow. Question asked by 781850685

I noticed a very strange behavior with a recent version bump from Flink 1.14.4 to 1.15.2. My project consumes around 30K records per second from a sharded kinesis stream, and during the version upgrade, it will follow the best practice to first trigger a savepoint from the running job, start the new job from the savepoint and then remove the old job. So far so good, and the above logic has been tested multiple times without any issue for 1.14.4. Usually, after the version upgrade, our job will have a few minutes delay for millisecond behind latest, but it will catch up with the speed quickly(within 30mins). Our savepoint is around one hundred MBs big, and our job DAG will become 90 – 100% busy with some backpressure when we redeploy but after 10-20 minutes it goes back to normal.

Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade from a running 1.14.4 job, I can see a savepoint has been created and the new job is running, all the metrics look fine, except suddenly millisecond behind the latest jumps to 10 hours!! and it takes days for my application to catch up with the kinesis stream latest record. I don’t understand why it jumps from 0 second to 10+ hours when we restart the new job. The only main change I introduced with version bump is to change failOnError from true to false, but I don’t think this is the root cause.

I have one assumption, I tried to redeploy the new 1.15.2 job by changing our parallelism, redeploying a job from 1.15.2 does not introduce a big delay, so I assume the issue above only happens when we bump version from 1.14.4 to 1.15.2? I did try to bump it twice and I see the same 10hrs+ jump in delay.

Any insights are welcome, thank you.



Solution

While looking through the Flink 1.15 changes related to the Kinesis consumer, there’s nothing obvious that stands out for me. I would recommend filing a Jira ticket with the Flink community on this issue. See https://issues.apache.org/jira/projects/FLINK/issues


This Question was asked in StackOverflow by 781850685 and Answered by Martijn Visser It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?