-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Store-gateway is stuck in starting phase #10279
Comments
@narqo can you help here? |
From the store-gateway's logs on the screenshot can you tell if the system's managed to load all the discovered blocks, or it has stuck loading any of the remaining ones (e.g. compare the IDs of the blocks in the You may want to collect a goroutine profile and Go runtime trace to explore where exactly it stuck.
Also, could you share more details about the system. Which version of Mimir is that? Can you share the configuration options you're running with (please, make sure to redact any sensitive information from the config)? Can you attach the whole log file from the point of store-gateway's start (the text not a screenshot)? |
Config File
Log
|
It is continuously loading new blocks. I've been unable to query anything in the past 24 hours ts. Previously I was able to query 90 days of data. But after pushing the last 80GB of TSDB data it is stuck in the |
@pracucci can you help here? |
If the store-gateway is stuck in the starting phase and the local disk utilization is also growing, then it means the store-gateway is loading the new blocks (maybe very slow, but loading). On the contrary, if the disk utilization is not growing, then it looks stuck as you say. Which one of the two? |
The disk space is growing, so in the current scenario, we saw something interesting. The blocks at /media/storage/tsdb-sync/anonymous are at |
We tested the same thing on a K8s cluster with a default config (we just added S3 credentials), and the store gateway is still loading new blocks. |
I calculated the tsdb block so there are a total |
Base on these points above, it seems that one single instance of
|
|
Is there any tool that can validate all the blocks in S3? We had around 1TB of blocks, so we directly pushed them to S3. (We tested with ingestor backfill, but a single instance was unable to handle this volume.) I have tested with the mimirtool bucket validation thing, and I don't see any error there. |
Also if you need any metrics from mimir let me know we have Prometheus scraping mimir |
What is the bug?
I recently uploaded around 90GB of TSDB data directly to S3 after that, my store gateway is stuck in the starting phase. I have enabled debug logs but don't see any error (sharing ss). I have used this approach previously for more than 7 times, but now it is causing this problem. [CONTEXT: Doing influx to mimir migration, using promtool to generate tsdb].
How to reproduce it?
Push tsdb block to s3 and query the data using grafana for timestamps greater than 24hrs.
What did you think would happen?
I don't know why it is taking so long to load tsdb block. It was working previously.
What was your environment?
deployment was done using puppet on VM. Currently running single instance on 1 VM.
Any additional context to share?
No response
The text was updated successfully, but these errors were encountered: