[🐛 BUG]: Downscale workers didn't work correctly #2092

Smolevich · 2024-12-16T16:59:00Z

No duplicates 🥲.

I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

TEMPORAL_ROADRUNNER_MAX_WORKERS_COUNT=20
TEMPORAL_ROADRUNNER_WORKERS_COUNT=5

When starting the Roadrunner server with the configuration specified below, I observed an issue related to worker scaling:
• Initially, the number of workers in the configuration file was set to 5.
• The maximum number of workers allowed under load for Temporal Activity is 20.

During a heavy load scenario involving tens of thousands of workflows and hundreds of thousands of activities, the scaling correctly increased the worker count. The total number of Activity workers reached 28.

However, once the load dropped to zero (all activities were fully processed), the number of workers did not scale back down to the original value of 5.

Version (rr --version)

2024.3.0 (build time: 2024-12-05T18:39:32+0000, go1.23.4), OS: linux, arch: amd64

How to reproduce the issue?

version: "3"

rpc:
    listen: tcp://0.0.0.0:6001

server:
    command: "php ../bin/console app:temporal-worker-run"

temporal:
    namespace: ${TEMPORAL_NAMESPACE}
    address: ${TEMPORAL_ADDRESS}
    activities:
        max_jobs: 100
        allocate_timeout: 360s
        command: "php ../bin/console app:temporal-worker-run"
        num_workers: ${TEMPORAL_ROADRUNNER_WORKERS_COUNT}
        destroy_timeout: 1s
        dynamic_allocator:
          max_workers: ${TEMPORAL_ROADRUNNER_MAX_WORKERS_COUNT}
          spawn_rate: 5
          idle_timeout: 10s
    metrics:
        address: ${TEMPORAL_ADDRESS_METRICS}
        prefix: "mars"
        type: "summary"

metrics:
    address: ${RR_ADDRESS_METRICS}

logs:
    mode: production
    level: debug
    output: stderr

otel:
  resource:
    service_name: "roadrunner"
    service_version: "1.0.0"
    service_namespace: "${OTEL_RESOURCE_NAMESPACE}"
    service_instance_id: "${HOSTNAME}"
  exporter: otlp
  endpoint: "${OTEL_EXPORTER_OTLP_ENDPOINT}"
  headers:
    api-key: "${NEW_RELIC_API_KEY}"

Steps to Reproduce

    1.	Configure Roadrunner with an initial worker count of 5 and a maximum scaling limit of 20.
2.	Create a single Workflow with at least one Activity inside it.
3.	Submit approximately 10,000 Workflows with the corresponding Activities to Temporal.
4.	Observe that the worker count scales up under the load (e.g., reaching 28 workers).
5.	Allow all Activities to be processed so that the load drops completely to zero.
6.	Check the worker count after load reduction.

Relevant log output

{"level":"info","ts":1734352476002676546,"logger":"temporal","msg":"Activity complete after timeout.","Namespace":"stage-10","TaskQueue":"default","WorkerID":"default:20f10cbd-3161-4b99-98da-390fc6e94208","WorkflowID":"grouping_auto_pso_new_bi_main_10e6e20e-03fe-45a5-aa40-cdea46e79880_20241216","RunID":"c4af45d9-149a-4028-a2e7-110e9a91e925","ActivityType":"grouping_candidates.isOperBrandPlanLimitExceeded","Attempt":1,"Result":"<nil>","Error":"activity_pool_execute_activity:\n\tstatic_pool_exec:\n\tallocate_dynamically: failed to reset the TTL listener"}
{"level":"info","ts":1734352476002767912,"logger":"temporal","msg":"Task processing failed with client side error","Namespace":"stage-10","TaskQueue":"default","WorkerID":"default:20f10cbd-3161-4b99-98da-390fc6e94208","WorkerType":"ActivityWorker","Error":"context deadline exceeded"}
{"level":"debug","ts":1734352476002978390,"logger":"server","msg":"No free workers, trying to allocate dynamically","idle_timeout":10,"max_workers":20,"spawn_rate":5}
{"level":"debug","ts":1734352476003002985,"logger":"server","msg":"dynamic allocator listener already started, trying to allocate worker immediately with 2s timeout"}
{"level":"debug","ts":1734352476010849713,"logger":"temporal","msg":"workflow task started","time":1}

The text was updated successfully, but these errors were encountered:

rustatian · 2024-12-23T15:13:01Z

Hey @Smolevich 👋🏻
As I mentioned in the Discord discussion, could you please create a repo (or attach worker+activity+workflow working code) with a code to reproduce an issue?

Smolevich added B-bug Bug: bug, exception F-need-verification labels Dec 16, 2024

Smolevich assigned rustatian Dec 16, 2024

rustatian mentioned this issue Dec 23, 2024

[🧹 CHORE]: Autoscale. First look, bugs, proposals #2086

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 BUG]: Downscale workers didn't work correctly #2092

[🐛 BUG]: Downscale workers didn't work correctly #2092

Smolevich commented Dec 16, 2024

rustatian commented Dec 23, 2024

[🐛 BUG]: Downscale workers didn't work correctly #2092

[🐛 BUG]: Downscale workers didn't work correctly #2092

Comments

Smolevich commented Dec 16, 2024

No duplicates 🥲.

What happened?

Version (rr --version)

How to reproduce the issue?

Relevant log output

rustatian commented Dec 23, 2024