Anton Bryzgalov
21 Nov 2022
β’
2 min read
Blockchain technologies are not a geeky thing anymore and many companies run their business around crypto assets. A popular scenario is to build a Data Platform operating blockchain data and launch analytical and realtime services on top of it. This is what I have done recently, and want to share my experience with you.
Hi π I am Anton Bryzgalov (aka @bryzgaloff), an Expert Data Engineer and Architect in Cloud & Blockchain.
The solution I have built is described in detail in my blogpost on Medium. This post is a short summary exclusively for Blockchain Works π
I have exported the data using a number of public node providers. This is ok for an MVP. However, in a long term, I strongly advise you to run your own node. Such a node will take several weeks to sync and extra cost for hosting, but the further data extraction speed will be significantly higher than when using a public node.
Also, using free public nodes require a solid fallback mechanism due to rate and other limits. With an owned node the rate is limited only by your VM resources.
Using a paid provider like Alchemy or QuickNode is completely not an option: the full history is too big and will result in $10Ks.
For a quick launch, I have used public BigQuery datasets with Ethereum data. They are populated using an open source Ethereum ETL tool. Exporting data from BigQuery into S3 is a cheap option.
A set of optimizations described in the article, allowed me to export the data in 24 hours and β$100 (BigQuery -> S3).
However, relying on BigQuery leads to a data structure loss, so to operate on raw JSONs I have had to switch back to the public nodes.
π₯ As an outcome, I have ingested the full history of Ethereum to S3. The data is hosted a raw JSON format as responded by the nodes. This allows me to easily rerun any calculations on the full history without touching faulty public nodes.
The querying is performed using AWS Athena. It supports JSON brilliantly. But the full dataset is of 4.5TB now and Athena is paid for a volume of scanned data. Reading the full history in raw costs β$25. So, to reduce the costs I convert the data from a raw JSON into a cost-efficient Parquet format (with Snappy compression). For specific purposes, this has reduced querying costs by 97%!
Our product-level goal was to build a cryptowallets balances API: given a wallet address, it should respond with a list of all tokens contained by the wallet with 100% accurate balances. The latency should be minimal from the latest mined block.
Following a KISS ("keep it simple, stupid!") principle, I have implemented a polling script which "listens" to new blocks on top of the Ethereum and once a new block appears, updates a DynamoDB table containing aggregated balances for each wallet.
Calculating balances using Athena and uploading the data into DynamoDB was quite a challenge:
uint256
-based calculations in Athena (which's the biggest number type is 128-bit Decimal).β‘οΈ As a result we are able to recalculate the full balances history and upload it to DynamoDB in β€3 hours (and <$1). The API's latency is <1s from the latest block. Dive deeper into the details in the article.
Working on something similar? I am happy to have a further discussion with you, let's get in touch on LinkedIn! π€
Anton Bryzgalov
I am an Expert Cloud Data Engineer with 7+ years of production experience, including 5 years as a Solutions Architect. Designed 25+ projects for various domains including finance, marketing, retail, blockchain.
See other articles by Anton
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!