Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Store-gateway is stuck in starting phase #10279

Open
riteshsonawane1372 opened this issue Dec 19, 2024 · 13 comments
Open

Bug: Store-gateway is stuck in starting phase #10279

riteshsonawane1372 opened this issue Dec 19, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@riteshsonawane1372
Copy link

What is the bug?

I recently uploaded around 90GB of TSDB data directly to S3 after that, my store gateway is stuck in the starting phase. I have enabled debug logs but don't see any error (sharing ss). I have used this approach previously for more than 7 times, but now it is causing this problem. [CONTEXT: Doing influx to mimir migration, using promtool to generate tsdb]. Image

Image

How to reproduce it?

Push tsdb block to s3 and query the data using grafana for timestamps greater than 24hrs.

What did you think would happen?

I don't know why it is taking so long to load tsdb block. It was working previously.

What was your environment?

deployment was done using puppet on VM. Currently running single instance on 1 VM.

Any additional context to share?

No response

@riteshsonawane1372 riteshsonawane1372 added the bug Something isn't working label Dec 19, 2024
@riteshsonawane1372
Copy link
Author

@narqo can you help here?

@narqo
Copy link
Contributor

narqo commented Dec 19, 2024

From the store-gateway's logs on the screenshot can you tell if the system's managed to load all the discovered blocks, or it has stuck loading any of the remaining ones (e.g. compare the IDs of the blocks in the loaded new block lines and those that you have in the object store)?

You may want to collect a goroutine profile and Go runtime trace to explore where exactly it stuck.

What was your environment?

Also, could you share more details about the system. Which version of Mimir is that? Can you share the configuration options you're running with (please, make sure to redact any sensitive information from the config)? Can you attach the whole log file from the point of store-gateway's start (the text not a screenshot)?

@riteshsonawane1372
Copy link
Author

mimir --version
Mimir, version 2.14.1 (branch: , revision: )
  go version:       go1.23.2
  platform:         linux/amd64

Config File

multitenancy_enabled: false
common:
   storage:
     backend: s3
     s3:
       endpoint: ""
       region: us-east-1
       access_key_id: ""
       secret_access_key: ""
frontend:
  port: 9001
  address: 0.0.0.0

limits:
  compactor_block_upload_enabled: true
  creation_grace_period: 0
  past_grace_period: 0
  compactor_blocks_retention_period: 6y
  out_of_order_time_window: 42720h
  max_global_series_per_user: 0
  ingestion_burst_size: 260000
  ingestion_rate: 200000
  max_label_names_per_series: 100
  max_metadata_length: 2048
  ignore_ooo_exemplars: true
  max_fetched_chunks_per_query: 0

server:
  log_level: debug
  http_listen_port: 9009
  http_listen_address: 0.0.0.0
  http_server_read_timeout: 6000s
  http_server_write_timeout: 30m
  http_server_idle_timeout: 30m
  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 524288000
  grpc_server_max_send_msg_size: 524288000
  grpc_server_max_concurrent_streams: 2000

distributor:
  ring:
    kvstore:
      store: memberlist
  pool:
    health_check_ingesters: true
ingester:

  read_circuit_breaker:
    enabled: true
    initial_delay: 45s
    cooldown_period: 20s
    failure_threshold_percentage: 800
    request_timeout: 50000s
  push_circuit_breaker:
    enabled: true
    cooldown_period: 20s
    initial_delay: 45s
    request_timeout: 50000s
  instance_limits:
    max_series: 0
    max_inflight_push_requests: 0
  ring:
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512
    kvstore:
      store: inmemory
    replication_factor: 1

blocks_storage:
  backend: s3
  s3:
    endpoint: ""
    access_key_id: ""
    insecure: true
    region: us-east-1
    secret_access_key: ""
    bucket_name: ""

  tsdb:
    dir: /media/storage/mimir/tsdb

  bucket_store:
    bucket_index:
      idle_timeout: 24h
      max_stale_period: 720h
    sync_interval: 15m
    ignore_deletion_mark_delay: 8h
    ignore_deletion_mark_while_querying_delay: 7h
    sync_dir: /media/storage/mimir/tsdb-sync
    block_sync_concurrency: 24
    tenant_sync_concurrency: 4
    meta_sync_concurrency: 30

compactor:
  block_ranges: [2h0m0s,12h0m0s,24h0m0s,168h0m0s]
  data_dir: /media/storage/mimir/compactor
  compaction_concurrency: 4

store_gateway:
  sharding_ring:
    wait_stability_max_duration: 1m
    replication_factor: 1 # Change this back to 3 
ruler_storage:
  backend: filesystem
  filesystem:
    dir: /media/storage/mimir/rules

Log

dia/storage/mimir/tsdb-sync/anonymous/01JEAB0BMBA9QYY1T0Q30RRTFK/sparse-index-header
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.04501268Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=209.793899ms id=01JEATD323G2APVKHP5NNK4YZE
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045042897Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JF22PTVSN1447970YM3HMTKB
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045155145Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=debug msg="index header file size" bytes=13453
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045367867Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loading sparse index-header from disk" id=01JF22PTVSN1447970YM3HMTKB path=/media/storage/mimir/tsdb-sync/anonymous/01JF22PTVSN1447970YM3HMTKB/sparse-index-header
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.045435572Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loaded sparse index-header from disk" id=01JF22PTVSN1447970YM3HMTKB path=/media/storage/mimir/tsdb-sync/anonymous/01JF22PTVSN1447970YM3HMTKB/sparse-index-header elapsed=69.567µs
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.057445191Z caller=push.go:211 level=error user=anonymous msg="detected an error while ingesting Prometheus remote-write request (the request may have been partially ingested)" httpCode=500 err="distributor is unavailable (current state: Terminated)"
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.057486624Z caller=logging.go:128 level=warn trace_id_unsampled=148d21a492a2c6e1 msg="POST /api/v1/push (500) 82.065µs Response: \"distributor is unavailable (current state: Terminated)\\n\" ws: false; Accept-Encoding: gzip; Content-Encoding: snappy; Content-Length: 36053; Content-Type: application/x-protobuf; User-Agent: Telegraf/1.32.3 Go/1.23.3; X-Prometheus-Remote-Write-Version: 0.1.0; "
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.05929124Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=215.225264ms id=01JEP98PZMYW3XT54KMDTXC01R
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059311234Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JDYM57NQ68RYJ0E469N0SX75
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059401735Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=debug msg="index header file size" bytes=6610
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059603143Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loading sparse index-header from disk" id=01JDYM57NQ68RYJ0E469N0SX75 path=/media/storage/mimir/tsdb-sync/anonymous/01JDYM57NQ68RYJ0E469N0SX75/sparse-index-header
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.059655761Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loaded sparse index-header from disk" id=01JDYM57NQ68RYJ0E469N0SX75 path=/media/storage/mimir/tsdb-sync/anonymous/01JDYM57NQ68RYJ0E469N0SX75/sparse-index-header elapsed=53.336µs
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.068255828Z caller=bucket.go:452 level=info user=anonymous msg="loaded new block" elapsed=215.735913ms id=01JF0WHVQFA3ATXY1PJYYT2XNC
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.068275417Z caller=bucket.go:443 level=debug user=anonymous msg="loading new block" id=01JE6P94DXXKFY2G5QN6GE4BDG
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.0683482Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=debug msg="index header file size" bytes=4139
Dec 19 12:59:07 "" mimir[23503]: ts=2024-12-19T12:59:07.069191484Z caller=spanlogger.go:111 user=anonymous method=indexheader.NewStreamBinaryReader caller=log.go:168 level=info msg="loading sparse index-header from disk" id=01JE6P94DXXKFY2G5QN6GE4BDG path=/media/storage/mimir/tsdb-sync/anonymous/01JE6P94DXXKFY2G5QN6GE4BDG/sparse-index-header

@riteshsonawane1372
Copy link
Author

It is continuously loading new blocks. I've been unable to query anything in the past 24 hours ts. Previously I was able to query 90 days of data. But after pushing the last 80GB of TSDB data it is stuck in the starting phase. On the local disk it has loaded around 6GB of index header at tsdb-sync location. For testing, I have also stopped the lazy loading but it is not working.

@riteshsonawane1372
Copy link
Author

@pracucci can you help here?

@pracucci
Copy link
Collaborator

But after pushing the last 80GB of TSDB data it is stuck in the starting phase. On the local disk it has loaded around 6GB of index header at tsdb-sync location.

If the store-gateway is stuck in the starting phase and the local disk utilization is also growing, then it means the store-gateway is loading the new blocks (maybe very slow, but loading). On the contrary, if the disk utilization is not growing, then it looks stuck as you say. Which one of the two?

@riteshsonawane1372
Copy link
Author

The disk space is growing, so in the current scenario, we saw something interesting. The blocks at /media/storage/tsdb-sync/anonymous are at 319910, which doesn't exceed this number. When we calculated all the blocks in s3, they were 416138. Even if we delete all blocks at tsdb-sync, the store-gateway will again get the headers for 319910 only, not 416138. I don't know why we don't see any error.

@riteshsonawane1372
Copy link
Author

We tested the same thing on a K8s cluster with a default config (we just added S3 credentials), and the store gateway is still loading new blocks.

@riteshsonawane1372
Copy link
Author

I calculated the tsdb block so there are a total 416138 blocks with the total size of 1.083TB

@narqo
Copy link
Contributor

narqo commented Dec 24, 2024

deployment was done using puppet on VM. Currently running single instance on 1 VM.
...
The blocks at /media/storage/tsdb-sync/anonymous are at 319910, which doesn't exceed this number.

Base on these points above, it seems that one single instance of store-gateway is stuck after downloading 320M block headers (there should be 2x of that, for the index-header and sparse-index-header indices) to its file system, attached to /media/storage.

  1. You also mentioned there is no single "bad" block that leads the store-gateway into being stuck. Is this right?
  2. Can you check the inode usage on the attached file system? Does the file system have enough capacity to accommodate this number of local files?
  3. Do you also run compactor on this VM host and with this exact disk attached? Did the compactor manage to update the bucket-index successfully? How does the disk utilisation look, in general?

@riteshsonawane1372
Copy link
Author

riteshsonawane1372 commented Dec 24, 2024

  1. Yes today we found that the real issue is not with store-gateway but with compactor. There was one block that was mark for deletion because it was older than the retention time(6y). But the compactor didn't delete that block. From the makers/ in s3 we saw that the compactor is stuck and unable to delete any blocks. Also, it is unable to update the index header in s3. To test we took a backup of bucket-index.json.gz from s3 and deleted everything in it. Now we're getting this error Failed to read bucket index for tenant compactor We decreased the cleanup interval but still the same. We manually deleted the bad block. But now everything is crashing we're facing CPU throttling issues. On local compactor is able to get all the tsdb meta.json file with no issue

  2. Yes

  3. We don't see any issue with disk. This same thing was tested on a k8s cluster on a different node. We used the default config and just gave the s3 credentials but faced the same issue.

@riteshsonawane1372
Copy link
Author

Is there any tool that can validate all the blocks in S3? We had around 1TB of blocks, so we directly pushed them to S3. (We tested with ingestor backfill, but a single instance was unable to handle this volume.) I have tested with the mimirtool bucket validation thing, and I don't see any error there.

@riteshsonawane1372
Copy link
Author

Also if you need any metrics from mimir let me know we have Prometheus scraping mimir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants