Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: Downscale workers didn't work correctly #2092

Open
1 task done
Smolevich opened this issue Dec 16, 2024 · 1 comment
Open
1 task done

[🐛 BUG]: Downscale workers didn't work correctly #2092

Smolevich opened this issue Dec 16, 2024 · 1 comment
Assignees
Labels
B-bug Bug: bug, exception F-need-verification

Comments

@Smolevich
Copy link
Contributor

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

TEMPORAL_ROADRUNNER_MAX_WORKERS_COUNT=20
TEMPORAL_ROADRUNNER_WORKERS_COUNT=5

When starting the Roadrunner server with the configuration specified below, I observed an issue related to worker scaling:
• Initially, the number of workers in the configuration file was set to 5.
• The maximum number of workers allowed under load for Temporal Activity is 20.

During a heavy load scenario involving tens of thousands of workflows and hundreds of thousands of activities, the scaling correctly increased the worker count. The total number of Activity workers reached 28.

However, once the load dropped to zero (all activities were fully processed), the number of workers did not scale back down to the original value of 5.

Version (rr --version)

2024.3.0 (build time: 2024-12-05T18:39:32+0000, go1.23.4), OS: linux, arch: amd64

How to reproduce the issue?

version: "3"

rpc:
    listen: tcp://0.0.0.0:6001

server:
    command: "php ../bin/console app:temporal-worker-run"

temporal:
    namespace: ${TEMPORAL_NAMESPACE}
    address: ${TEMPORAL_ADDRESS}
    activities:
        max_jobs: 100
        allocate_timeout: 360s
        command: "php ../bin/console app:temporal-worker-run"
        num_workers: ${TEMPORAL_ROADRUNNER_WORKERS_COUNT}
        destroy_timeout: 1s
        dynamic_allocator:
          max_workers: ${TEMPORAL_ROADRUNNER_MAX_WORKERS_COUNT}
          spawn_rate: 5
          idle_timeout: 10s
    metrics:
        address: ${TEMPORAL_ADDRESS_METRICS}
        prefix: "mars"
        type: "summary"

metrics:
    address: ${RR_ADDRESS_METRICS}

logs:
    mode: production
    level: debug
    output: stderr

otel:
  resource:
    service_name: "roadrunner"
    service_version: "1.0.0"
    service_namespace: "${OTEL_RESOURCE_NAMESPACE}"
    service_instance_id: "${HOSTNAME}"
  exporter: otlp
  endpoint: "${OTEL_EXPORTER_OTLP_ENDPOINT}"
  headers:
    api-key: "${NEW_RELIC_API_KEY}"

Steps to Reproduce

    1.	Configure Roadrunner with an initial worker count of 5 and a maximum scaling limit of 20.
2.	Create a single Workflow with at least one Activity inside it.
3.	Submit approximately 10,000 Workflows with the corresponding Activities to Temporal.
4.	Observe that the worker count scales up under the load (e.g., reaching 28 workers).
5.	Allow all Activities to be processed so that the load drops completely to zero.
6.	Check the worker count after load reduction.

Relevant log output

{"level":"info","ts":1734352476002676546,"logger":"temporal","msg":"Activity complete after timeout.","Namespace":"stage-10","TaskQueue":"default","WorkerID":"default:20f10cbd-3161-4b99-98da-390fc6e94208","WorkflowID":"grouping_auto_pso_new_bi_main_10e6e20e-03fe-45a5-aa40-cdea46e79880_20241216","RunID":"c4af45d9-149a-4028-a2e7-110e9a91e925","ActivityType":"grouping_candidates.isOperBrandPlanLimitExceeded","Attempt":1,"Result":"<nil>","Error":"activity_pool_execute_activity:\n\tstatic_pool_exec:\n\tallocate_dynamically: failed to reset the TTL listener"}
{"level":"info","ts":1734352476002767912,"logger":"temporal","msg":"Task processing failed with client side error","Namespace":"stage-10","TaskQueue":"default","WorkerID":"default:20f10cbd-3161-4b99-98da-390fc6e94208","WorkerType":"ActivityWorker","Error":"context deadline exceeded"}
{"level":"debug","ts":1734352476002978390,"logger":"server","msg":"No free workers, trying to allocate dynamically","idle_timeout":10,"max_workers":20,"spawn_rate":5}
{"level":"debug","ts":1734352476003002985,"logger":"server","msg":"dynamic allocator listener already started, trying to allocate worker immediately with 2s timeout"}
{"level":"debug","ts":1734352476010849713,"logger":"temporal","msg":"workflow task started","time":1}
@rustatian
Copy link
Member

Hey @Smolevich 👋🏻
As I mentioned in the Discord discussion, could you please create a repo (or attach worker+activity+workflow working code) with a code to reproduce an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception F-need-verification
Projects
None yet
Development

No branches or pull requests

2 participants