canton-network-docs
Optimize payload queries
Optimize payload queries - Canton Network Docs
Skip to main content
After being ingested by PQS, a representative contract may look as follows:
Users can derive the following information from these plans:
Users can derive the following improvements from this plan:
It can be observed that PostgresSQL is able to combine both indexes to further reduce the amount of data it needs to sift through (
Suppose that there is a relationship between the
It might be surprising to a human observer that PostgreSQL estimates
The most fascinating part of the new plan is that PostgreSQL has chosen to not waste computations processing the second index altogether since the estimates are now more accurate and predictable (
Entire statistics output not shown for brevity. However, please observe the change of values with increased accuracy:
After the change, PostgreSQL has enough buckets to capture the data distribution characteristics more accurately, for example most common values are spot on with reality.
This page discusses techniques and considerations for optimizing queries that involve payload contents in the Participant Query Store (PQS). It is assumed that these pages have already been consumed as background knowledge:Documentation Index
Fetch the complete documentation index at: https://docs.canton.network/llms.txt
Use this file to discover all available pages before exploring further.
pqs-how-to-query-contracts-and-transactions-using-sqlpqs-references-sql-api
Metadata indexes
PQS data retrieval functions already come indexed for efficient access that involves metadata-level queries (for example, querying for contracts created at particular ledger offset or lookup of a contract by its contract ID). However, most business workflow-level queries need to select data based on contracts’ actual payload contents. Default indexes that come with PQS cannot help with this since efficient querying depends on knowing Daml model’s details. In other words, each application team is responsible for creating and maintaining appropriate indexes to ensure optimal performance to match their data access patterns.Necessary tools
We will be employing PostgreSQL’spsql1 application and EXPLAIN2 SQL command to analyze query plans and rectify potential performance bottlenecks. Please, refer to the references listed at the end of this page for to learn more about these tools.
Daml model
For the purpose of this page, we are utilizing a simple Daml model which represents a register of tokens created by issuers and held by holders. The actual workflow is not relevant for this discussion so it’s not provided in its entirety, however it is beneficial to see how aToken model translates into a matching JSONB payload representation.
Test data setup
To illustrate the ideas, we have ingested a PQS instance from a Daml ledger that resulted in the following data storage footprint with a variety of data distribution characteristics:Payload indexes
Status quo (no custom indexes)
If we take no care to create any custom indexes, we can still query for contracts based on their payload contents, however the performance may be suboptimal. Let’s explore a few examples.Index Scanare in use, however the indexes are of general nature (for example default metadata-level ones)- PostgreSQL could parallelize execution (
Workers Launched: 2) - queries differ in selectivity (
rows=9vsrows=3972), however this was irrelevant for optimal execution since PostgreSQL did not forecast estimates accurately (rows=976) (being orders of magnitude off) - PostgreSQL performs quite a bit of work to be later discarded (
Rows Removed by Filter:)
payload column is a black box to it. It cannot make any assumptions about its contents (with content being a complex non-scalar structure) and therefore cannot maintain any meaningful statistics to help it make sound decisions. As a result, PostgreSQL consumes more-than-necessary quantity of CPU, memory, and I/O resources to produce equivalent amount of output.
Improving performance with a targeted index
When we know our data access patterns, we can create custom indexes to help PostgreSQL optimize its query plans. The query above seems to be able to benefit if thequantity field is indexed. With JSONB columns PostgreSQL allows to create indexes on expressions3 that refer to particular fields from the JSONB document.
- correct custom-tailored index is in use (
Index Scan using __contracts_14_token_quantity_idx) - PostgreSQL estimates the number of rows much more accurately (
rows=14vsrows=9) - no wasted work performed (
Rows Removed by Filteris gone) - no need to parallelize (
Workers Launchedis gone) so PostgreSQL can conserve its resources to serve other workloads
Understanding interplay of multiple indexes
Independent criteria
Let’s assume that we want to query for tokens held by a particular holder and whose quantity is within a certain range. In case condition criteria are independent, PostgreSQL can combine multiple indexes using bitmap operations to further optimize query plans. Let’s see how this works in practice.Heap Blocks: exact=200 vs Heap Blocks: exact=10483 + Rows Removed by Filter: 9132) to produce the same output faster and cheaper (no need to parallelize query processing).
Dependent criteria
However, if condition criteria are dependent, PostgreSQL tends to greatly underestimate the number of rows that will be returned and therefore may choose a suboptimal query plan. This usually happens when 2 attributes are separately indexed (for example, to support querying over each field in its own right by different code paths) although there exists an intrinsic relationship between the attributes. In such a case, simply having indexes defined is only half of the equation. The other half is to ensure that PostgreSQL has access to relevant statistics to make sound decisions. Let’s see how PostgreSQL can be taught to act right in such situations.wallet.holder and wallet.label fields such that wallet.label is derived from wallet.holder. Refer to the example JSONB payload above for the demonstration.
rows=820 and rows=976 to be returned for each index’s data set, however when combined the estimate is rows=4 which is orders of magnitude off from reality (rows=974).
Recognizing this fact and instructing PostgreSQL to collect relevant statistics45 will in turn optimize the query execution plan.
rows=1028 vs rows=974). While this approach might not necessarily result in dramatic query speed-up, it does significantly reduce the chance of PostgreSQL mispredicting and choosing a suboptimal execution plan.
There exist other multi-variate statistics types that PostgreSQL can employ, please refer to the documentation6 to explore if they can help you with your particular data and queries.
Statistics precision
Oftentimes, the default statistics target (100) is not sufficient to capture the data distribution characteristics for your particular case (for instance, your data exhibits a large skew towards a particular numeric spectrum). In such cases, it might be highly efficient to increase the statistics target for particular expressions7 to capture your data’s peculiarities. We can demostrate the benefit of increased statistics precision by examining the real top 10 token quantity values vs the statistics PostgeSQL collects.
Conclusion
To summarize, PostgreSQL has all necessary tools to optimize queries that involve payload contents stored inJSONB columns. However, it is up to the application developers to create, maintain and monitor appropriate indexes and statistics to ensure optimal query performance. This can be achieved by:
- creating custom indexes on expressions that reflect your data access patterns
- ensuring that PostgreSQL has access to relevant statistics to make sound decisions
- creating multi-variate statistics for dependent criteria where appropriate
- increasing statistics target for expressions that require higher precision to capture data distribution nuances
VACUUM ANALYZE runs). Therefore, it is recommended to strike a balance between query performance and maintenance overhead.
Remember that adding an index is not a panacea for all performance woes. The key to fast queries is controlling their selectivity. For a query whose predicate matches 20%+ of the table, a sequential scan is likely to be faster than an index scan. Master the use of EXPLAIN command to identify and eliminate performance bottlenecks.
Don’t underestimate the pragmatism of keeping your Daml models simple and flat. Deeply nested structures are harder to index and query efficiently. Consider denormalizing your Daml model if it helps achieving better query performance for downstream applications.
Footnotes
- https://www.postgresql.org/docs/current/app-psql.html ↩
- https://www.postgresql.org/docs/current/using-explain.html ↩
- https://www.postgresql.org/docs/current/indexes-expressional.html ↩
- https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED ↩
- https://www.postgresql.org/docs/current/sql-createstatistics.html ↩
- https://www.postgresql.org/docs/current/planner-stats-details.html ↩
- https://www.postgresql.org/docs/current/sql-alterindex.html ↩