[influxdb] Disk usage when storing Prometheus-style time series data in InfluxDB

Discussion:

Julius Volz

2015-03-29 19:48:32 UTC

Hi,

After initially test-driving InfluxDB[0][1] a year ago as an option for
long-term storage of Prometheus data, I just gave it another try with the
new tag support in InfluxDB 0.9.0. Paul had mentioned in
https://news.ycombinator.com/item?id=9001808 that this will fit the
Prometheus data model much better. I think it does improve things in terms
of being able to select more efficiently by tags (previously tag-style data
was stored in columns).

However, I'm still getting very high disk usage when I live-replicate all
metrics from a Prometheus server into InfluxDB. After running some standard
metric ingestion tests for about an hour, I get:

Prometheus: 13MB
InfluxDB: 180MB

Blowup factor: ~14x.

These are some typical example "/write" requests I'm sending to InfluxDB:

https://gist.github.com/juliusv/d3a430d1ef943c73f0ef

Is there something I'm doing wrong/inefficiently that could negatively
impact disk usage? Am I abusing InfluxDB's data model too much by trying to
squeeze Prometheus-style metrics[2] into it? Essentially, our data model is
OpenTSDB-like, except that we'd prefer not having to run OpenTSDB, but
something more modern, without the need for Hadoop/HBase. I guess one
problem for our case will be that InfluxDB is at its heart still a log
store for arbitrary sets of key/value columns per entry, instead of
optimized around purely numeric time series data? Would there be any way of
making this work better?

I built InfluxDB from current HEAD (caf3259) and started it simply by
running "./influxd" as a single node, without any further configuration.

Cheers,
Julius

/BCC: prometheus-***@googlegroups.com

[0]
http://prometheus.io/docs/introduction/comparison/#prometheus-vs.-influxdb
[1]
https://docs.google.com/document/d/1OgnI7YBCT_Ub9Em39dEfx9BuiqRNS3oA62i8fJbwwQ8/edit#heading=h.e32xcwnzxp3e
[2] http://prometheus.io/docs/concepts/data_model/

--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

dashesy

2015-03-29 19:54:13 UTC

Permalink

your values are all string have you tried numeric values to see if that
makes any difference?

Post by Julius Volz
Hi,
After initially test-driving InfluxDB[0][1] a year ago as an option for
long-term storage of Prometheus data, I just gave it another try with the
new tag support in InfluxDB 0.9.0. Paul had mentioned in
https://news.ycombinator.com/item?id=9001808 that this will fit the
Prometheus data model much better. I think it does improve things in terms
of being able to select more efficiently by tags (previously tag-style data
was stored in columns).
However, I'm still getting very high disk usage when I live-replicate all
metrics from a Prometheus server into InfluxDB. After running some standard
Prometheus: 13MB
InfluxDB: 180MB
Blowup factor: ~14x.
https://gist.github.com/juliusv/d3a430d1ef943c73f0ef
Is there something I'm doing wrong/inefficiently that could negatively
impact disk usage? Am I abusing InfluxDB's data model too much by trying to
squeeze Prometheus-style metrics[2] into it? Essentially, our data model is
OpenTSDB-like, except that we'd prefer not having to run OpenTSDB, but
something more modern, without the need for Hadoop/HBase. I guess one
problem for our case will be that InfluxDB is at its heart still a log
store for arbitrary sets of key/value columns per entry, instead of
optimized around purely numeric time series data? Would there be any way of
making this work better?
I built InfluxDB from current HEAD (caf3259) and started it simply by
running "./influxd" as a single node, without any further configuration.
Cheers,
Julius
[0]
http://prometheus.io/docs/introduction/comparison/#prometheus-vs.-influxdb
[1]
https://docs.google.com/document/d/1OgnI7YBCT_Ub9Em39dEfx9BuiqRNS3oA62i8fJbwwQ8/edit#heading=h.e32xcwnzxp3e
[2] http://prometheus.io/docs/concepts/data_model/
--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com
<https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/CADWjqk6ZASWW%3D5-ZVc6rZqaH4Ai_2VhBddAcwYAMX5VZRxf1dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Julius Volz

2015-03-30 14:08:07 UTC

Permalink

Hey Ehsan,

I was still sending values as strings because I wasn't sure whether I could
send NaN/Inf/-Inf otherwise (JSON doesn't support them). For my latest test
results I changed it to bare floats (ignoring special float values), but it
doesn't seem to make a difference.

Cheers,
Julius

Post by dashesy
your values are all string have you tried numeric values to see if that
makes any difference?

Post by Julius Volz
Hi,
After initially test-driving InfluxDB[0][1] a year ago as an option for
long-term storage of Prometheus data, I just gave it another try with the
new tag support in InfluxDB 0.9.0. Paul had mentioned in
https://news.ycombinator.com/item?id=9001808 that this will fit the
Prometheus data model much better. I think it does improve things in terms
of being able to select more efficiently by tags (previously tag-style data
was stored in columns).
However, I'm still getting very high disk usage when I live-replicate all
metrics from a Prometheus server into InfluxDB. After running some standard
Prometheus: 13MB
InfluxDB: 180MB
Blowup factor: ~14x.
https://gist.github.com/juliusv/d3a430d1ef943c73f0ef
Is there something I'm doing wrong/inefficiently that could negatively
impact disk usage? Am I abusing InfluxDB's data model too much by trying to
squeeze Prometheus-style metrics[2] into it? Essentially, our data model is
OpenTSDB-like, except that we'd prefer not having to run OpenTSDB, but
something more modern, without the need for Hadoop/HBase. I guess one
problem for our case will be that InfluxDB is at its heart still a log
store for arbitrary sets of key/value columns per entry, instead of
optimized around purely numeric time series data? Would there be any way of
making this work better?
I built InfluxDB from current HEAD (caf3259) and started it simply by
running "./influxd" as a single node, without any further configuration.
Cheers,
Julius
[0]
http://prometheus.io/docs/introduction/comparison/#prometheus-vs.-influxdb
[1]
https://docs.google.com/document/d/1OgnI7YBCT_Ub9Em39dEfx9BuiqRNS3oA62i8fJbwwQ8/edit#heading=h.e32xcwnzxp3e
[2] http://prometheus.io/docs/concepts/data_model/
--
You received this message because you are subscribed to the Google Groups
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com
<https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/30c4895a-53b7-4868-a24b-ae7ef3a84186%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Dix

2015-03-29 20:01:26 UTC

Permalink

Hi Julius,
How big is the data directory? Not the one with raft, but the other raw
data directory. Currently, log truncation isn't wired up so there are two
copies of the data on disk: the log in raft and the indexed data in the
data directory.

Also, how many unique series (measurement + tag set) do you have? How many
points per series across what span of time? Depending on how that looks,
there are disk space costs that are amortized by having at least a few
thousand points per series.

We definitely take up more space since we're not assuming a fixed sampling
interval and we can store data types other than just a number. We'll be
working over time to get storage more efficient, but something designed
specifically for regular time series at set sampling intervals with a float
will get much better compression for the time being.

How many series/datapoints is the 180MB?

Thanks,
Paul

--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/CAPuvaG6GgAw2sNhJox0xEkojWT2hqUd_UN15B-btrLp-xrc3vA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Julius Volz

2015-03-30 14:05:45 UTC

Permalink

Hi Paul,

Thanks, that's good to know! I just ran some similar tests again since I'm
currently don't have exactly the same environment. The basic results still
hold. Some info one the data I'm ingesting into both systems:

Total number of metrics (measurement names?): 877
Total number of unique (tagged) time series: 11,749
Total number of ingested samples/data points: 6,561,374
Total time span of data: ~1h
Rate of samples per series: 1 every 5s, or 1 every 10s, depending on the
scraped job

This was for a ~1h test run, scraping some host system metrics as well as
container (CPU/RAM/OOM, etc.) information from our internal cluster
scheduler.

I'm getting:

# Prometheus
$ du -hcs /tmp/prometheus-metrics
40M /tmp/prometheus-metrics

# InfluxDB
$ du -hc .influxdb
476M .influxdb/data/shards
480M .influxdb/data
16K .influxdb/broker/raft
158M .influxdb/broker/1
1.4M .influxdb/broker/0
160M .influxdb/broker
640M .influxdb
640M total

That's factor 16x with the "broker" dir, 12x without it.

Prometheus also doesn't care about regular intervals between samples, but
does double-delta encoding (both for timestamps as well as values) within
sample chunks to optimize disk usage.

I'm now letting this run longer to see how this amortizes over time.

Cheers,
Julius

Post by Paul Dix
Hi Julius,
How big is the data directory? Not the one with raft, but the other raw
data directory. Currently, log truncation isn't wired up so there are two
copies of the data on disk: the log in raft and the indexed data in the
data directory.
Also, how many unique series (measurement + tag set) do you have? How many
points per series across what span of time? Depending on how that looks,
there are disk space costs that are amortized by having at least a few
thousand points per series.
We definitely take up more space since we're not assuming a fixed sampling
interval and we can store data types other than just a number. We'll be
working over time to get storage more efficient, but something designed
specifically for regular time series at set sampling intervals with a float
will get much better compression for the time being.
How many series/datapoints is the 180MB?
Thanks,
Paul

Post by Julius Volz
Hi,
After initially test-driving InfluxDB[0][1] a year ago as an option for
long-term storage of Prometheus data, I just gave it another try with the
new tag support in InfluxDB 0.9.0. Paul had mentioned in
https://news.ycombinator.com/item?id=9001808 that this will fit the
Prometheus data model much better. I think it does improve things in terms
of being able to select more efficiently by tags (previously tag-style data
was stored in columns).
However, I'm still getting very high disk usage when I live-replicate all
metrics from a Prometheus server into InfluxDB. After running some standard
Prometheus: 13MB
InfluxDB: 180MB
Blowup factor: ~14x.
https://gist.github.com/juliusv/d3a430d1ef943c73f0ef
Is there something I'm doing wrong/inefficiently that could negatively
impact disk usage? Am I abusing InfluxDB's data model too much by trying to
squeeze Prometheus-style metrics[2] into it? Essentially, our data model is
OpenTSDB-like, except that we'd prefer not having to run OpenTSDB, but
something more modern, without the need for Hadoop/HBase. I guess one
problem for our case will be that InfluxDB is at its heart still a log
store for arbitrary sets of key/value columns per entry, instead of
optimized around purely numeric time series data? Would there be any way of
making this work better?
I built InfluxDB from current HEAD (caf3259) and started it simply by
running "./influxd" as a single node, without any further configuration.
Cheers,
Julius
[0]
http://prometheus.io/docs/introduction/comparison/#prometheus-vs.-influxdb
[1]
https://docs.google.com/document/d/1OgnI7YBCT_Ub9Em39dEfx9BuiqRNS3oA62i8fJbwwQ8/edit#heading=h.e32xcwnzxp3e
[2] http://prometheus.io/docs/concepts/data_model/
--
You received this message because you are subscribed to the Google Groups
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com
<https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/48619ad7-b745-430b-9d93-a382a9a544f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julius Volz

2015-03-30 19:47:01 UTC

Permalink

So I let this run a couple of hours longer and am getting:

Prometheus: 134MB
InfluxDB: 2.4GB (1.8GB data only)

Unfortunately the factor only increased (14x for data only now, 18x total),
so I think it's not about amortization in this case.

I mean, it makes sense in some ways, since InfluxDB is simply geared
towards a different use case. Still, the factor seems quite high even given
that.

Just to make sure: does InfluxDB still store all tags for each
sample/point, or are these only stored once?

Cheers,
Julius

Post by Julius Volz
Hi Paul,
Thanks, that's good to know! I just ran some similar tests again since I'm
currently don't have exactly the same environment. The basic results still
Total number of metrics (measurement names?): 877
Total number of unique (tagged) time series: 11,749
Total number of ingested samples/data points: 6,561,374
Total time span of data: ~1h
Rate of samples per series: 1 every 5s, or 1 every 10s, depending on the
scraped job
This was for a ~1h test run, scraping some host system metrics as well as
container (CPU/RAM/OOM, etc.) information from our internal cluster
scheduler.
# Prometheus
$ du -hcs /tmp/prometheus-metrics
40M /tmp/prometheus-metrics
# InfluxDB
$ du -hc .influxdb
476M .influxdb/data/shards
480M .influxdb/data
16K .influxdb/broker/raft
158M .influxdb/broker/1
1.4M .influxdb/broker/0
160M .influxdb/broker
640M .influxdb
640M total
That's factor 16x with the "broker" dir, 12x without it.
Prometheus also doesn't care about regular intervals between samples, but
does double-delta encoding (both for timestamps as well as values) within
sample chunks to optimize disk usage.
I'm now letting this run longer to see how this amortizes over time.
Cheers,
Julius

Post by Paul Dix
Hi Julius,
How big is the data directory? Not the one with raft, but the other raw
data directory. Currently, log truncation isn't wired up so there are two
copies of the data on disk: the log in raft and the indexed data in the
data directory.
Also, how many unique series (measurement + tag set) do you have? How
many points per series across what span of time? Depending on how that
looks, there are disk space costs that are amortized by having at least a
few thousand points per series.
We definitely take up more space since we're not assuming a fixed
sampling interval and we can store data types other than just a number.
We'll be working over time to get storage more efficient, but something
designed specifically for regular time series at set sampling intervals
with a float will get much better compression for the time being.
How many series/datapoints is the 180MB?
Thanks,
Paul

Post by Julius Volz
Hi,
After initially test-driving InfluxDB[0][1] a year ago as an option for
long-term storage of Prometheus data, I just gave it another try with the
new tag support in InfluxDB 0.9.0. Paul had mentioned in
https://news.ycombinator.com/item?id=9001808 that this will fit the
Prometheus data model much better. I think it does improve things in terms
of being able to select more efficiently by tags (previously tag-style data
was stored in columns).
However, I'm still getting very high disk usage when I live-replicate
all metrics from a Prometheus server into InfluxDB. After running some
Prometheus: 13MB
InfluxDB: 180MB
Blowup factor: ~14x.
https://gist.github.com/juliusv/d3a430d1ef943c73f0ef
Is there something I'm doing wrong/inefficiently that could negatively
impact disk usage? Am I abusing InfluxDB's data model too much by trying to
squeeze Prometheus-style metrics[2] into it? Essentially, our data model is
OpenTSDB-like, except that we'd prefer not having to run OpenTSDB, but
something more modern, without the need for Hadoop/HBase. I guess one
problem for our case will be that InfluxDB is at its heart still a log
store for arbitrary sets of key/value columns per entry, instead of
optimized around purely numeric time series data? Would there be any way of
making this work better?
I built InfluxDB from current HEAD (caf3259) and started it simply by
running "./influxd" as a single node, without any further configuration.
Cheers,
Julius
[0]
http://prometheus.io/docs/introduction/comparison/#prometheus-vs.-influxdb
[1]
https://docs.google.com/document/d/1OgnI7YBCT_Ub9Em39dEfx9BuiqRNS3oA62i8fJbwwQ8/edit#heading=h.e32xcwnzxp3e
[2] http://prometheus.io/docs/concepts/data_model/
--
You received this message because you are subscribed to the Google
Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com
<https://groups.google.com/d/msgid/influxdb/CAJeHL5d27r9juncHgkd8sWVVgUoYy0YyyQdrENxr-sggHPEYEA%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at http://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/bbd3d27d-d9e9-4a4c-8741-c03fadc9349a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

p***@gmail.com

2017-05-31 03:12:13 UTC

Permalink

Hi Julius,

Can you plz provide some steps to write prometheus data to influxdb. For now i have configured the Prometheus .yml file and have installed influxdb in another vm. i am wondering if there is easy configuration document available.

Thanks,
Pradip

--
Remember to include the version number!
---
You received this message because you are subscribed to the Google Groups "InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to influxdb+***@googlegroups.com.
To post to this group, send email to ***@googlegroups.com.
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/fa672444-f724-4531-b143-889f6b69aaf0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.