[2025 Annual Guide] Crypto Data Engineering Guide

From cryptodatabytes by Andrew Hong

Backfills, and the Increasing Cost of Raw Data

EVM data has three core raw datasets:

blocks/transactions (top level transaction from/to/gas/data)
logs/receipts (events emitted from contracts, tied to transactions)
traces (full call traces of each transaction)

Shovel has a guide covering these basic datasets fairly well, and Cryo also provides many more almost 40 raw datasets in their cli tool:

Link to Cryo Datasets. These are great for data analysts and scientists to quickly extract and leverage, but less of a fit for an app data pipeline.

Downloading all the data from past blocks is called a “backfill”, with continuous syncs called a “frontfill.” Backfilling blocks + receipts on Base (EVM) using Quicknode RPC costs almost $10k:

From Quicknode Streams, priced for Blocks + Receipts on Base from block 1.

After discounts, most platforms quote around $4k for a Base backfill of traces and logs, and monthly frontfill cost of about $2k. You can optimize some costs here by running your own reth node on a service like Blockjoy for $650/mo (though keeping a node running and client up to date has its own labor costs).

Many teams try to backfill directly into their own database and then run transforms on top after (ELT), as it my take many transformation model attempts to get to the final dataset you want. This was much cheaper to do in the past with small chains like Ethereum mainnet (I remember when backfills only cost $200).

Blockchains have scaled wildly in the last year, with Solana and Base (EVM) leading the way. To give you a sense of scale, core Solana data takes up roughly 500TB and Base data takes up 50TB. With this current storage scale, you can add another $4-10k a month for storing, transforming, and querying all this data on GCP/AWS - then multiply this all by the number of chains you’re trying to support. These costs start to add up quickly if you aren’t careful!

Twitter(archive is the historical state of the chain. Clients like GETH/RETH can have different requirements based on implementation, this chart mainly serves to show you the rate of data growth)

Keep in mind that all chains are starting to hyper scale, with Base growing it’s gas limit exponentially (read as number of transactions/state) and Solana aiming for 1 million TPS. The devs are only working to minimize size of state that nodes need, not the poor souls who have to index and work with the data. So if you plan to store all raw data from a chain as part of your data strategy, you’re in for a lot of $$$ pain.

We’re also quickly moving away from a block-by-block processing model (goodbye 2 second block times) to a pure event streaming model (100-200ms per block, and even faster with preconfs). Again, this means that some of the slower data pipelines and delayed transforms you’ve been relying on will start to fall apart - make an error, and who knows when your pipeline will catch up to the latest block/event again.

So what can you do? It’s time to start studying different methods of backfilling and transforming crypto data!

A Mental Model for Crypto Data Transforms

Usually, the first iteration of the data pipeline involves tracking a few events from a set of contracts and then performing basic transforms on them before inserting them into a database.

Subgraphs on theGraph created the best developer experience (DX) around this pattern, letting you track events and factories and also saving necessary related parameters such as lending pool reserves and user params at time of liquidation on Aave. The process is as follows:

List a set of contracts and ABIs in a yaml/schema
Define a set of mapping (transform logic) typescript files that process each event into columns on the schema
Run these mappings over the whole history of the chain and persist these schemas to a database block-by-block
theGraph handles the raw data so you don’t have to deal with the RPC and storage issues of a full backfill

However, over the years these initial subgraphs ran into several limitations:

Backfills can run into performance issues (so every time you edit a subgraph), and can also lag behind on faster chains.
If you require some enhanced decoding, read functions, or RPC calls then you can’t use viem or other helper packages.
Only supports graphQL, so aggregation queries and further transformations on top are painful.

Realizing these limitations, many of the data products from 2022/2023 contained some sort of “subgraph migration” feature, such as Satsuma (acq. Alchemy), Goldsky, Ponder, Subquery, Subsquid, and StreamingFast. All of these take different approaches at keeping the simple DX of subgraphs while solving some of the problems mentioned earlier. theGraph is of course also trying to tackle some of these, choosing StreamingFast as their core partner.

We’ll get into all the core data engineering problems in the following sections, but at this stage we can define our transform needs as follows:

Hook onto an event/function (like a Swap) and then gather “periphery” information at the time of the hook using a combination of packages and read functionality. This is also a good time to gather a filtered version of a larger dataset, such as full traces only when there was a Swap in the transaction.

Post-Execution Transforms

So we know we can combine the extract and transform steps with some subgraph-esque hooks. Let’s call that a “post-execution transform,” defined as transforming the default raw data from an RPC after a transaction/block is committed. Below are some tools that I’ve found most useful for a specific use case.

Remember that it’s very unlikely a single platform will fulfill all your needs, so don’t get too attached to always using the same tool.

Let’s start with the tools that let you directly build from the RPC:

Easiest Self-Hosted Transform Solution: Ponder (Typescript Library)

The best way to do this without relying on a cloud vendor is by using ponder.sh. You can hook onto any ABI event, and then use viem to do any onchain calls/reads you need to enhance the data before inserting it into your database. Here’s a simple example of combining a hook and a read function together. You can imagine hooking onto a Uniswap V3 Swap event, then calling their TWAP oracle and combining that with reads on the tick movements at the start and end of the block.

I personally use ponder to power the Bytexplorer mint page to track ERC6551, ERC20, ERC721, ERC1155, and some factory contracts.

If you’re tracking a few specific contracts, need to make various API calls during your transformation, and want to store this data in your own database; then Ponder will be the simplest and lowest cost option when combined with an RPC provider like Quicknode or Alchemy.

You can check out envio for a similar solution.

Lightweight Cloud Solution: Quicknode (Typescript/Python)

If you want a more lightweight solution than Ponder with a simple key value store (like a mongoDB), then check out Quicknode functions (they also provide an RPC). I would pick this if you don’t need to join the data heavily to some internal database table, but still want something more flexible than a subgraph.

They have a great guide here with Aave. You can think of functions like a lambda on AWS, with three key steps:

Set up a “stream” of backfill data to run through the function
Write some arbitrary typescript/python scripts as a “function” that can be chained together to transform data
Store outputs in a key-value store that can be queried via API

I believe indexing.co is the only other platform to support functions as well, you can reach out to their team to test it out.

This is the most flexible and lightweight option out there, useful if you are working with object specific data (sets of wallets, tokens, and contracts that don’t require many aggregations or joins).

Quickest Self-Hosted Bulk Analytics Solution: Cryo (Rust, Python)

I already mentioned them earlier, but Cryo supports almost 40 raw datasets that are basic transforms on top of raw RPC data. Storm (the maintainer) gave a great high level talk on self-hosted data and where Cryo fits in:

You can find a more recent talk here too that goes more in depth

If you just want generic raw datasets for one-time analytics purposes, you should check to see if there are any supported datasets on cryo that you can easily pull with their CLI (run node or get an RPC → pull with cryo → saves in parquet files → query with duckdb). This is NOT for building a continuous pipeline, or for adding transforms during extraction though.

Now onto the SQL tools that let you build on top of basic tables built out of RPC data…

Transform/Query Without Your Own Data Warehouse: Dune (SQL)

My recommendation is to always use a cloud SQL data provider like Dune first to test out your transformation logic without cost, before jumping straight into the raw data with some of the tools I will talk about later. You should think of your contract data as a small subset of all the available data, and using a platform like Dune will help you build familiarity with transforms across the post-execution space. Everything from creation_traces (deployed contracts), ERC20 transfers, prices, DEX trades, ENS names, and more will be tables you likely create or use down the line. You can create your own query views and matviews for a pseudo DAG like this:

Every query has a “lineage” button. Note that you can’t programmatically execute the queries by order of inheritance, so the lineage is only for show.

I often utilize the Dune API filters to test these transforms in my backend before ever loading large tables into my database - just remember to query filters on results and not re-execute the query each call (which will be really expensive). You can also upload small sets of your data to Dune for testing against some of your internal schemas too. It’s also great for daily or weekly jobs you may need to run, like Cowswap does for distributing solver rewards.

Remember that Dune data is not real-time, with large variations of freshness across tables (between 1 minute to 1 hour to 1 day), and the execution engine can run into queues/variance during busy times. With Dune, do not expect to be able to pull raw datasets into your own database without paying a lot of money. The ideal setup is needing to pull already deeply abstracted tables like dex.trades, where you can pull up the last 100 million trades (100m rows, 25 columns at 25,000 datapoints per credit with 100,000 credits) using the premium plan of $1k/mo. 100 million trades is roughly 2 months of trades on Base (as of december 2024). You can use functions like array_agg to pull as many values as you can into one cell so save on credits.

Allium has a SQL offering with a more powerful API and data extraction, that is real-time within 5 seconds and has a dedicated engine - albeit with fewer community example queries and tables.

Choose these options if you want a data solution where you don’t need an data warehouse of your own for onchain data. Your complexity here will be in finding and creating the SQL tables you need within the platform. You can expect to pay at least $1k a month at scale for immediate access to the most powerful data pipelines in crypto.

Easiest SQL-to-your-Warehouse Transforms: Goldsky (SQL)

Unfortunately, Dune does not have good data engineering DX. It’s not easy to manage pipelines of query views to sink into an external database. I’ve found that Goldsky has the best self-serve experience for this, with a DAG table builder:

Goldsky Mirror / Pipeline

They support many sinks (pushing data from queries into your database) for a large number of data destinations with one click, coming close to the ease of use level of a tool like fivetran. I believe they have some of the most startup friendly pricing for first-time backfills or academic usage as well, if you reach out nicely to the team.

You can create a DAG of queries in Goldsky the same way that you can in Dune, however Goldsky actually supports stateful transforms through streaming SQL with Flink (instead of batch like dbt). You also can interweave data from external handlers using http requests or webhooks, for you to externally manage some custom functions before pushing and joining with other Goldsky datasets.

Re-Execution Transforms

If you’re a beginner data engineer, I highly recommend using one of the earlier solutions for whatever you are trying to do. When you get into more advanced and data heavy pipelines, you may need to study re-execution transforms.

Forking a chain with custom execution has been a common practice in Solidity testing workflows for years. The basic idea is that you can take a chain and it’s state, and copy, edit, and run it locally so that you can do anything you want to it:

Submitting transactions at any block from any wallet (without the key)
Editing contract code to add events or even new variables to enhance the data
Adding opcodes or other low level functionality for more visibility (cheatcodes)

Once you’ve made your edits, you can replay a previous/future block and then use the same RPC endpoints to extract your enhanced data.

There has been an ongoing trend to build data tooling closer to the execution level, since any edits you make here will save you time doing expensive joins and queries post-execution.

The only downside here is that you have to learn Solidity and some lower-level VM lifecycle concepts to leverage these tools.

Easiest Events Editor: Shadow (Cloud or Self-Hosted)

Sometimes you will want to enhance the data around a key emitted event (like a Swap). Shadow makes it easy to add new events that let you capture more data around the exact time of execution. Note that getting data at transaction level granularity requires a re-execution environment, all the tools from post-execution can really only get you data at a block level granularity. This is very important for data like prices, liquidity, balances, health factors, oracle math, and more.

Here’s an example of their added events to Uniswap V3, namely making it easier to grab liquidity and price data during the swap. There is a registry of examples they’ve built to easily learn and run alongside their shadow reth node which takes advantage of some fancy reth client exex features. Don’t worry if that’s all word soup to you, basically just means you can edit events and run it yourself somewhat simply.

Emily Hsia, CTO of Shadow, gave a great presentation talking about using Shadow in your data pipeline below (2:39:18):

The RPC standard support doesn’t change at all when you edit events, so you can then use Ponder with a Shadow RPC to create a subgraph/hook data engineering experience on top.

Other options include ghostlogs and sim patch, but these are not open-source and self hostable (if that matters to you).

Most Powerful Re-Execution Hooks: Sim (smlXL, acq. Dune)

For more complex transforms, you may need data from different points in time of execution and not just a single event. You can create hooks inside of a smart contract using Sim, such as hooking on to every time a storage variable like a balance mapping has changed, every transfer event, or every call trace in a transaction/block:

You can easily get real-time balances at the latest block using something like Dune Echo API, but my use case required being able to get a balance at any historical block - even within a specific transaction. So one fun Sim canvas I created is getting balance changes of all addresses and tokens involved in a transaction (pre and post) using a combination of storage hooks and block hooks.

The logic in my lambda (set of hooks in solidity) is as follows:

For a given transaction, before the balance mapping is changed store the owner, token address, and current balance to a data mapping called “balances” with a struct of “BalanceInfo”
Each time the balance is changed, update the final balance value in “BalanceInfo”.
We keep only the last update, and at the end of the block we return the whole “balances” mapping with the complete before and after “BalanceInfo” data (shown in the test logs).
Nothing is saved to a database, I get to process this per API call using an “external trigger” component.

Link to Canvas with Code

This kind of data extraction requires RPC tracing, storage layouts, and low level memory diff tracking to produce - the kind of expertise very few people in crypto have. Meanwhile I can create that pipeline using Sim in just 100 lines. Getting dynamic data like balances at a transaction level granularity was previously impossible, but now it can be done.

Sim is the data tool that I use the most for my new data engineering stack, covering 90% of my use cases. The other 10% is covered by Dune and Quicknode.

After these transformations above, you can then write SQL queries on top and execute those as API endpoints or directly sink it to your database like Goldsky.

Sim also supports stateful solidity transforms, meaning you can add a storage variable whose state carries on through all blocks. With a simple mapping variable, you essentially can create a key value store in your lambda!

Lightweight Custom Read Functions: TEVM (Typescript Library)

A common data gathering utility in crypto is the “multicall”, where you essentially create a new contract with a function that calls a set of other read functions when called. This way you can gather a bunch of data across contracts while only making one RPC call (saving you time and money). Viem actually has a powered up version of this in their simulate call that lets you override state too.

We can now take it a step further, and rewrite read functions on any existing contract completely. This is like editing events, but it is written directly in typescript making it even more lightweight as it doesn’t have the full compile/deployment process that Shadow and Sim have.

Twitter Link

To be honest I have not built many examples with TEVM yet, but I have some fun public experiments coming up soon. I’m very excited about it so I’m including this tool in here now for others to check out, too.

Concluding Thoughts: Tradeoffs on Transforms

All in all, your data pipeline will depend on when and where you handle your transforms (thanks captain obvious). Below is a little chart of the tools and concepts I’ve mentioned today and when they can be used:

Remember these are only a subset of all the crypto data tools, there are many others but they largely fall under the same usage archetypes that are shown above.

As a rule of thumb, I usually try and get a basic pipeline working on a cloud platform provider first and then decide later if I want to pull it out into an external pipeline/database later. Doing every dataset yourself from scratch each time will slow you down immensely and cost you both in infra and engineers.

In the end, you will most likely need a combination of tools here to meet both your app and analytics data requirements. The open source and open data nature of crypto means we’re given a lot of powerful tools and platforms, instead of just closed APIs.

Many analysts live in the world of Dune, and only know SQL. Learning the rest of the data stack with some of the tools here will be a way to take your skills and career to the next level, because as the space evolves, so will the number and quality of apps. Those apps will need good data. And good data requires great data engineers and engineering - I hope that will be driven by you.

Best of luck to all of you in 2025!

Recently Searched

Hot Coins

Trending

Daily Must-Read

Welcome Back

Join CoinTime

Sign in with email

Sign up with email

Check your inbox