Today we are announcing the release of Metric Aggregate Pruning, a new management tool dimensional metric cardinality in New Relic One.
Our new metric aggregate pruning capability allows you to create rules that tell New Relic One which attributes of your metric data are important for longer-term trending and analysis and which attributes are only relevant for the short-term troubleshooting. Configuring New Relic One to “prune” attributes you don’t need when querying over longer time ranges helps you avoid hitting cardinality limits that reduce the usefulness of your data.
Metric Aggregate Pruning is the latest addition to our NerdGraph data suppression rules data management APIs. If you know our cardinality limits and Data Deletion Rules API and wants get started right away with pruning metric aggregates, see our documentation or go to “How to Use the New Metric Aggregate Pruning.”
If you’re unfamiliar with the cardinality metric, we’ll briefly explain what it is, how to tell if you’ve reached a cardinality limit, and three solutions to it.
What is Metric Cardinality?
Metrics are a key part of any observability platform. These are point-in-time metrics that represent a single data point, such as the percentage of CPU used on a system, or an aggregate metric, such as the number of bytes sent over a network in the last minute. This makes them ideal for identifying trends over time or observing rates of change.
In supervision, cardinality refers to the number of unique combinations of key-value pairs that exist for the attributes you include in a metric. For example, you may have a percentage of CPU used, called “cpuPercent”, that you report from each of your systems. If you include an attribute with this metric, so that you can identify the system from which the metric was reported, and you have two hosts reporting, called “host1” and “host2”, there is low cardinality. Indeed, there is only one attribute and two unique values for the metric: and . We would say it has a cardinality of 2. Instead, if you have a metric called “cpuPercent” with an attribute and an attribute that has been reported for every process on a system, and you have thousands of systems , there are high cardinality. Indeed, there could be millions of unique combinations of process and system IDs.
Low cardinality and (especially) high cardinality values are important when trying to answer questions such as “Show me all users who encountered a 504 error today between 9:30 and 10:30”. as discussed in this article on the importance of high cardinality data in observability. In order to solve this problem, you will need to correlate many of data. But you don’t want the huge amount of data to hamper your ability to search for it.
When you send metric data to New Relic One, it not only stores all the raw data points you send, but it also aggregates those metrics into new data points, called cumulations, at different time intervals. This makes them more efficient to store and faster to query, especially over long time windows. To ensure queries come back quickly, New Relic One decides whether to use raw data points or a rollup, depending on the query.
How to know when to prune metric aggregates
Using the power and scale of the underlying data platform, New Relic can easily manage account data with millions of unique metric time series every day. But as with all systems, we must impose certain limits to protect the system and keep it performing well for everyone.
If you happen to encounter one of the cardinalities limits, New Relic reports an in your account, which is an event you can query using NRQL. You’ll also see it in the Limits section of the Data Management Hub in New Relic One. Once a limit is reached, New Relic continues to process and store all of your data, but stops creating rollups for the rest of the day. If you look at a query that spans more than an hour when this happens, it might lead you to believe that the data has stopped reporting because the rollups used for these longer time window queries are not not available.
Never fear: as mentioned, the raw data you sent is still available. To access it, you can query for shorter time windows (one hour or less) or add the RAW keyword to your query to extract the data you need for troubleshooting. To avoid hitting limits altogether, you can use the new metric aggregate pruning.
How to use the new metric aggregate pruning
Metric aggregate pruning() allows you to specify one or more attributes to exclude from metric rollups.
Pruning metric aggregates is ideal for high-cardinality attributes sometimes included in metrics, such as a container ID or other unique identifiers. These high-cardinality attributes contain important detail when troubleshooting during an incident (narrow time window), but they lose value over time and are irrelevant when looking for longer-term trends.
In the following screenshot, there is a fictional example for setting up attribute pruning for a hypothetical metric named .