You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Steps to reproduce:
List the minimal actions needed to reproduce the behaviour.
reboot of InfluxDB node
Expected behaviour:
Queries should still show metrics from all shreds.
Actual behaviour:
After restarting the influxdb service we seem to lose random 24h chunks of data that was previously present in influxdb. The gaps correspond to shard durations. so it looks like the shards lose data. Influx queries returns no results for the periods represented by the shards. The gaps are in several places in the range, but the issue is never on "latest" data.
After some investigation, we have made the following observations:
Looks like the data is still present in the shards (can be extracted using the influxd inspect export-lp command).
There are tags are missing from the fields.idx files for the faulty shards
It can be fixed by inserting a single measurement using the Web UI for the Line Protocol, using the missing field from fields.idx and a timestamp matching the faulty shard. Now it is added to the fields.idxl. and Influxdb returns the metrics for the whole shard. This works, but is a very manual process that results in fake data.
It can also be fixed by removing all fields.idx files. This will result in the files getting re-generated when starting InfluxDB, with all the tags in the shard as part of the fields.idx. But for some reason, for some shards fields.idx does not get generated.
It can also be fixed by just copying the .idx file from “neighboring” shards, since the metrics are very similar between shards. This has been the most reliable fix, and we were able to restore (make searchable again) all data in the database using this approach.
Any idea why the tags was not in the fields.idx after the service was restarted? Could they been present in fields.idxl before the reboot, and failed to be copied for some reason?
Why does it fail to re-generate fields.idx for some of the shards? Is there some way to “force” re-generation of the fields.idx file other that delete and restart of InfluxDB? Can we enable extra logging to get an idea of why they are not re-generated?
Environment info:
Linux 4.18.0-553.36.1.el8_10.x86_64 x86_64
InfluxDB v2.7.11
$ free -h
total used free shared buff/cache available
Mem: 1.5Ti 49Gi 916Gi 52Mi 545Gi 1.4Ti
Swap: 4.0Gi 0B 4.0Gi
$ df -h /var/lib/influxdb/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg01-influxdb 2.0T 896G 1018G 47% /var/lib/influxdb
Steps to reproduce:
List the minimal actions needed to reproduce the behaviour.
Expected behaviour:
Queries should still show metrics from all shreds.
Actual behaviour:
After restarting the influxdb service we seem to lose random 24h chunks of data that was previously present in influxdb. The gaps correspond to shard durations. so it looks like the shards lose data. Influx queries returns no results for the periods represented by the shards. The gaps are in several places in the range, but the issue is never on "latest" data.
After some investigation, we have made the following observations:
influxd inspect export-lp
command).Any idea why the tags was not in the fields.idx after the service was restarted? Could they been present in fields.idxl before the reboot, and failed to be copied for some reason?
Why does it fail to re-generate fields.idx for some of the shards? Is there some way to “force” re-generation of the fields.idx file other that delete and restart of InfluxDB? Can we enable extra logging to get an idea of why they are not re-generated?
Environment info:
Linux 4.18.0-553.36.1.el8_10.x86_64 x86_64
InfluxDB v2.7.11
$ free -h
total used free shared buff/cache available
Mem: 1.5Ti 49Gi 916Gi 52Mi 545Gi 1.4Ti
Swap: 4.0Gi 0B 4.0Gi
$ df -h /var/lib/influxdb/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg01-influxdb 2.0T 896G 1018G 47% /var/lib/influxdb
Config:
$ cat /etc/influxdb/config.toml
bolt-path = "/var/lib/influxdb/influxd.bolt"
engine-path = "/var/lib/influxdb/engine"
storage-cache-max-memory-size = "4g"
storage-compact-throughput-burst = "1700M"
storage-compact-throughput = "700M"
The text was updated successfully, but these errors were encountered: