There are several challenges that a modern blockchain indexing startup may face, including:
In this article, we review the evolution of Footprint Analytics’ technology architecture in stages as a case study to explore how Iceberg-Trino technology stack addresses the challenges of on-chain data.
Footprint Analytics has indexed about 22 public blockchain data, and 17 NFT marketplace, 1900 GameFi project, and over 100,000 NFT collections into a semantic abstraction data layer. It’s the most comprehensive blockchain data warehouse solution in the world.
Regardless of blockchain data, which includes over 20 billions rows of records of financial transactions , which is frequently queried by data analysts. it’s different from ingression logs in traditional data warehouses.
We have experienced 3 major upgrades in the past several months to meet the growing business requirements:
At the beginning of Footprint Analytics, we used Google Bigquery as our storage and query engine; Bigquery is a great product. It is blazingly fast, easy to use, and provides dynamic arithmetic power and a flexible UDF syntax that helps us quickly get the job done.
However, Bigquery also has a number of problems.
We were very interested in some of the OLAP products which had become very popular. The most attractive advantage of OLAP is its query response time, which typically takes sub-seconds to return query results for massive amounts of data, and it can also support thousands of concurrent queries.
We picked one of the best OLAP databases, Doris, to give it a try. This engine performs well. However, at some point we soon ran into some other issues :
Unfortunately, we could not replace Bigquery with Doris, so we had to periodically synchronize data from Bigquery to Doris using it as a query engine only. This synchronization process had a number of issues, one of which was that the update writes got piled up quickly when the OLAP engine was busy serving queries to the front-end clients. Subsequently the speed of the write process got affected, and synchronization took much longer and sometimes even became impossible to finish.
We came to realize that the OLAP could solve several issues that we are facing, and could not become the turnkey solution of Footprint Analytics especially for the data processing pipeline. Our problem is bigger and more complex, and we could say, OLAP as a query engine alone was not enough for us.
Welcome to Footprint Analytics architecture 3.0, a complete overhaul of the underlying architecture. We have redesigned the entire architecture from the ground up, to separate the storage, computation and query of data into three different pieces. Taking lessons from the two earlier architectures of Footprint Analytics , and learning from experience of other successful big data projects like Uber, Netflix, and Databricks.
We first turned our attention to data lake, a new type of data storage for both structured and unstructured data. Data lake is perfect for on-chain data storage as the formats of on-chain data range widely from unstructured raw data to structured abstraction data Footprint Analytics is well-known for. We expected to use data lake to solve the problem of data storage, and ideally it would also support mainstream compute engines such as Spark and Flink, so that it wouldn’t be a pain to integrate with different types of processing engines as Footprint Analytics evolves.
Iceberg integrates very well with Spark, Flink, Trino and other computational engines, and we can choose the most appropriate computation for each of our metrics. For example:
With Iceberg solving the storage and computation problems, we then had to think about how to choose a query engine. There are not many options available, the alternatives we considered were
Once we had decided on our direction, we did a performance test on the Trino + Iceberg combination to see if it could meet our needs and to our surprise, the queries were incredibly fast.
Knowing that Presto + Hive has been the worst comparator for years in all the OLAP hype, the combination of Trino + Iceberg completely blew our minds.
Here are the results of our tests.
case 1 : join large dataset
An 800 GB table1 joins another 50 GB table2 and does complex business calculations
case2: use a big single table to do a distinct query
Test sql : select distinct(address) from table group by day
The Trino+Iceberg combination is about 3 times faster than Doris in the same configuration.
In addition to this, there is another surprise, because Iceberg can use data formats such as Parquet, ORC, etc., which will compress the data and store it. Iceberg’s table storage takes only about 1/5 of the space of other data warehouses The storage size of the same table in the three databases is as followed:
Note: The above tests are individual examples that we have encountered in actual production and are for reference only.
・Upgrade effect
The performance test reports gave us enough performance that it took our team about 2 months to complete the migration, and this is a diagram of our architecture after the upgrade.
Since its launch in August 2021, Footprint Analytics team has completed three architectural upgrades in less than a year and a half, thanks to its span desire and determination to bring the benefits of the best database technology to its crypto users, and solid execution on implementing and upgrading its underlying infrastructure and architecture.
The Footprint Analytics architecture upgrade 3.0 has bought a new experience to its users, allowing users from different backgrounds to get insights in more diverse usage and applications:
There are several challenges that a modern blockchain indexing startup may face, including:
In this article, we review the evolution of Footprint Analytics’ technology architecture in stages as a case study to explore how Iceberg-Trino technology stack addresses the challenges of on-chain data.
Footprint Analytics has indexed about 22 public blockchain data, and 17 NFT marketplace, 1900 GameFi project, and over 100,000 NFT collections into a semantic abstraction data layer. It’s the most comprehensive blockchain data warehouse solution in the world.
Regardless of blockchain data, which includes over 20 billions rows of records of financial transactions , which is frequently queried by data analysts. it’s different from ingression logs in traditional data warehouses.
We have experienced 3 major upgrades in the past several months to meet the growing business requirements:
At the beginning of Footprint Analytics, we used Google Bigquery as our storage and query engine; Bigquery is a great product. It is blazingly fast, easy to use, and provides dynamic arithmetic power and a flexible UDF syntax that helps us quickly get the job done.
However, Bigquery also has a number of problems.
We were very interested in some of the OLAP products which had become very popular. The most attractive advantage of OLAP is its query response time, which typically takes sub-seconds to return query results for massive amounts of data, and it can also support thousands of concurrent queries.
We picked one of the best OLAP databases, Doris, to give it a try. This engine performs well. However, at some point we soon ran into some other issues :
Unfortunately, we could not replace Bigquery with Doris, so we had to periodically synchronize data from Bigquery to Doris using it as a query engine only. This synchronization process had a number of issues, one of which was that the update writes got piled up quickly when the OLAP engine was busy serving queries to the front-end clients. Subsequently the speed of the write process got affected, and synchronization took much longer and sometimes even became impossible to finish.
We came to realize that the OLAP could solve several issues that we are facing, and could not become the turnkey solution of Footprint Analytics especially for the data processing pipeline. Our problem is bigger and more complex, and we could say, OLAP as a query engine alone was not enough for us.
Welcome to Footprint Analytics architecture 3.0, a complete overhaul of the underlying architecture. We have redesigned the entire architecture from the ground up, to separate the storage, computation and query of data into three different pieces. Taking lessons from the two earlier architectures of Footprint Analytics , and learning from experience of other successful big data projects like Uber, Netflix, and Databricks.
We first turned our attention to data lake, a new type of data storage for both structured and unstructured data. Data lake is perfect for on-chain data storage as the formats of on-chain data range widely from unstructured raw data to structured abstraction data Footprint Analytics is well-known for. We expected to use data lake to solve the problem of data storage, and ideally it would also support mainstream compute engines such as Spark and Flink, so that it wouldn’t be a pain to integrate with different types of processing engines as Footprint Analytics evolves.
Iceberg integrates very well with Spark, Flink, Trino and other computational engines, and we can choose the most appropriate computation for each of our metrics. For example:
With Iceberg solving the storage and computation problems, we then had to think about how to choose a query engine. There are not many options available, the alternatives we considered were
Once we had decided on our direction, we did a performance test on the Trino + Iceberg combination to see if it could meet our needs and to our surprise, the queries were incredibly fast.
Knowing that Presto + Hive has been the worst comparator for years in all the OLAP hype, the combination of Trino + Iceberg completely blew our minds.
Here are the results of our tests.
case 1 : join large dataset
An 800 GB table1 joins another 50 GB table2 and does complex business calculations
case2: use a big single table to do a distinct query
Test sql : select distinct(address) from table group by day
The Trino+Iceberg combination is about 3 times faster than Doris in the same configuration.
In addition to this, there is another surprise, because Iceberg can use data formats such as Parquet, ORC, etc., which will compress the data and store it. Iceberg’s table storage takes only about 1/5 of the space of other data warehouses The storage size of the same table in the three databases is as followed:
Note: The above tests are individual examples that we have encountered in actual production and are for reference only.
・Upgrade effect
The performance test reports gave us enough performance that it took our team about 2 months to complete the migration, and this is a diagram of our architecture after the upgrade.
Since its launch in August 2021, Footprint Analytics team has completed three architectural upgrades in less than a year and a half, thanks to its span desire and determination to bring the benefits of the best database technology to its crypto users, and solid execution on implementing and upgrading its underlying infrastructure and architecture.
The Footprint Analytics architecture upgrade 3.0 has bought a new experience to its users, allowing users from different backgrounds to get insights in more diverse usage and applications: