Institutional-grade, full-stack blockchain analytics pipeline for Cronos EVM. Designed to produce investment-quality research on network health, token economics, DeFi ecosystem depth, user quality, and valuation — from first principles, directly from raw on-chain data.
- Overview
- Research Philosophy
- Architecture
- Data Infrastructure & Ingestion Layer
- Analytical Framework
- Chain Health & Network Vitality
- Transaction Quality & Organic Activity Assessment
- Automated Account Detection & Behavioral Classification
- Wash Trading & Artificial Volume Detection
- DeFi Ecosystem Analysis
- Token Flow & Holder Behavior Analysis
- Wallet Clustering & Address Relationship Detection
- Validator Decentralization & Staking Economics
- Smart Contract Ecosystem Depth
- Social Sentiment & Narrative Analysis
- Governance Participation Analysis
- Catalyst Event Modeling
- Peer Chain Benchmarking Framework
- Valuation Framework
- Narrative Synthesis Layer
- Visualization & Dashboard Infrastructure
- Reporting Pipeline
- Technical Stack
- Cloud Infrastructure (Akash Network)
- Repository Layout
- Quick Start
This repository contains the full research infrastructure used to conduct a comprehensive, data-driven investigation of the Cronos EVM blockchain network. The pipeline spans from raw RPC ingestion through multi-dimensional signal generation, statistical analysis, cross-market benchmarking, and investment-grade narrative synthesis — producing both interactive dashboards and structured research reports.
The framework treats on-chain data as a primary signal source and applies institutional research standards to every layer of analysis: rigorous data validation, reproducible aggregation logic, defensible statistical methodology, and transparent assumption documentation. Where market data or off-chain signals supplement the analysis, sources are logged and assumptions are parameterized for sensitivity testing.
This is not a surface-level metrics aggregator. Every reported figure — from daily active addresses to annualized gas revenue to exchange flow directionality — is computed from first-principles using raw block, transaction, receipt, and event log data, with explicit logic for handling data quality issues, rounding artifacts, and protocol-specific idiosyncrasies.
Blockchain analytics suffers from a pervasive problem: most publicly available metrics are either superficially aggregated, easily gamed, or presented without adjustment for the quality of the underlying activity. A chain reporting 100,000 daily active addresses may in practice have 60,000 automated accounts, 15,000 wash-trading bots, and a further 10,000 address-splitting artifacts — leaving a materially different picture of organic user engagement.
This research framework is built around three foundational principles:
1. Reported vs. Adjusted Metrics Every surface-level metric is paired with an adjusted version that accounts for known distortions: bot activity, wash volume, circular flows, and dormant contract interactions. The delta between reported and adjusted is itself an analytical signal — large divergence indicates a chain relying on artificial activity to maintain headline numbers.
2. Economic First Principles Network value is ultimately anchored to economic output. We prioritize gas fee revenue (a direct measure of blockspace demand), stablecoin supply dynamics (a proxy for real settlement usage), and DEX volume quality (filtered for wash trades) over vanity metrics. Every valuation analysis flows through a revenue-based framework with explicit peer comparisons.
3. Reproducibility and Parameterization All thresholds, weights, and assumptions are externalized to configuration files, enabling scenario sensitivity testing without code changes. Researchers using this framework can stress-test conclusions by adjusting classification thresholds, valuation multiples, or peer universe composition.
The pipeline is organized into five conceptual layers:
┌──────────────────────────────────────────────────────────────────┐
│ LAYER 1: INGESTION │
│ Async RPC downloader → Schema-normalized Parquet tables │
├──────────────────────────────────────────────────────────────────┤
│ LAYER 2: AGGREGATION │
│ Raw Parquet → Daily aggregate tables (Polars streaming) │
├──────────────────────────────────────────────────────────────────┤
│ LAYER 3: SIGNAL GENERATION │
│ 12 specialized analytical modules generating investment signals │
├──────────────────────────────────────────────────────────────────┤
│ LAYER 4: SYNTHESIS & VALUATION │
│ Cross-signal integration, peer benchmarking, scenario modeling │
├──────────────────────────────────────────────────────────────────┤
│ LAYER 5: OUTPUT │
│ Interactive dashboard (12 pages) · Markdown/HTML reports │
│ Slide-deck-ready chart exports · Structured data exports │
└──────────────────────────────────────────────────────────────────┘
Each layer is designed to be independently testable and replaceable. Swapping the ingestion source (e.g., from direct RPC to an archive node provider) does not require changes to any analytical module, as long as the Parquet schema contract is maintained.
The ingestion layer is built around an asynchronous batch downloader that interfaces directly with Cronos EVM JSON-RPC endpoints. The architecture is designed for reliability and throughput over long historical backtests.
Concurrency model: The downloader issues requests concurrently across multiple RPC endpoints with configurable per-endpoint concurrency limits. Endpoint health is tracked in real-time — endpoints exhibiting elevated error rates or rate-limit responses are placed into a cooldown queue and rotated out in favor of healthier alternatives. This approach maximizes throughput while maintaining data completeness.
Batch download semantics: Block ranges are divided into configurable batch spans. Each batch atomically retrieves block headers, full transaction lists, transaction receipts, event logs, contract creation records, and optionally internal transaction traces. Batches are validated after write — checksums on record counts are verified before a batch is marked complete.
Checkpointing and fault tolerance: The downloader maintains per-worker checkpoint files recording the last successfully completed batch boundary. On restart, workers resume from their checkpoint rather than re-downloading completed ranges. Failed individual blocks are tracked in a separate manifest and retried in a subsequent pass rather than blocking forward progress.
Schema normalization: Raw RPC JSON responses are parsed and type-coerced into a fixed Parquet schema before write. This step handles common RPC inconsistencies: hex-to-integer conversion for block numbers and gas fields, address checksum normalization, null handling for missing fields (e.g., to address on contract creation transactions), and value precision handling for Wei-denominated amounts.
| Table | Granularity | Key Fields |
|---|---|---|
blocks |
One row per block | number, timestamp, gas_used, gas_limit, miner, tx_count |
transactions |
One row per transaction | hash, from, to, value, gas_price, gas_used, method_id, status, is_contract_creation |
receipts |
One row per transaction | hash, gas_used, status, contract_address |
event_logs |
One row per log entry | tx_hash, contract_address, topic_0–3, data, log_index |
token_transfers |
Decoded ERC20 transfers | token_contract, from, to, value, tx_hash |
dex_swaps |
Decoded swap events | pair_address, sender, amount0_in/out, amount1_in/out |
liquidity_events |
Mint/burn events | pair_address, sender, amount0, amount1, event_type |
contract_creations |
Deployed contracts | contract_address, deployer, tx_hash, block_number |
Raw tables are processed by a streaming aggregation engine (built on Polars) that produces daily summary tables without materializing the full dataset in memory. Daily aggregates form the primary input to all analytical modules and enable efficient time-series analysis over multi-month to multi-year horizons.
The chain health module produces a multidimensional assessment of network activity across engagement, economic output, and blockspace demand dimensions.
Daily Active Addresses (DAA) Raw DAA is computed as the union of unique transaction senders on a given day. This is supplemented by a decomposition into new addresses (first-ever on-chain appearance) vs. returning addresses (previously seen). The ratio of new-to-returning tracks whether growth is driven by genuine user onboarding or by address churn — a pattern common in incentivized ecosystems where users create fresh addresses to re-qualify for rewards. Seven-day and thirty-day moving averages smooth day-of-week seasonality. Peak DAA and percentage decline from peak are prominently reported as summary statistics.
Transaction Quality Composition Daily transactions are classified into: simple native transfers, smart contract interactions, contract deployments, and failed transactions. The share of each type over time reveals ecosystem maturity — a chain where the majority of activity is simple transfers lacks DeFi depth, while a high failure rate can indicate congestion, MEV bot activity, or poorly written contracts. Both absolute counts and percentage share are tracked with moving averages.
Gas Revenue Analysis Gas fee revenue (denominated in CRO, converted to USD using historical price data) is computed daily and annualized using a 30-day trailing window. This annualized figure serves as the primary economic output measure for valuation purposes. We explicitly distinguish between fee revenue (the economic value captured by the network) and transaction volume (a vanity metric easily inflated by self-transfers and wash trades).
Blockspace Demand Block utilization (gas used / gas limit) and empty block frequency are tracked as direct measures of organic blockspace demand. Low utilization and high empty block rates are structural indicators of insufficient demand, regardless of what headline transaction counts suggest. Comparison to Ethereum's baseline utilization provides context for how aggressively blocks are being filled.
Stablecoin Supply Dynamics Native issuance and redemption of major stablecoins (USDC, USDT, DAI) are tracked via ERC20 Transfer events from/to the zero address. Net stablecoin supply on-chain is a high-quality signal for real settlement usage — unlike gas volume, stablecoin flows are difficult to inflate artificially at scale without economic cost.
Contract Liveness Rate The fraction of all historically deployed contracts that received at least one user interaction within a rolling window is computed as a measure of ecosystem vitality. A chain where the majority of deployed contracts are inactive suggests either a failed DeFi ecosystem or one propped up by short-term incentive programs that have since expired.
Raw transaction counts overstate organic activity in virtually every EVM chain. This module decomposes reported metrics into quality-adjusted equivalents.
Reality Gap Quantification For each key metric (DAA, transaction count, DEX volume, smart contract interactions), we produce both the reported figure and an adjusted figure accounting for identified artificial activity. The delta — the "reality gap" — is reported as a separate signal. This synthesis table is a core deliverable of the research, providing a single view of how much reported activity is genuine.
Transaction Value Distribution Transactions are bucketed by value (micro, small, medium, large, whale tiers) to understand the distribution of economic activity. Chains dominated by micro-transactions often reflect bot activity or incentive farming. The presence of consistent large-value transactions indicates genuine institutional or high-net-worth usage.
Method ID Analysis Transaction input data method IDs (the first four bytes of the calldata, representing the function selector) are aggregated to identify which contract functions are being called most frequently. Ecosystem depth is characterized by diversity across method IDs — a chain where a single method ID accounts for a large share of non-transfer interactions indicates activity concentration that may not be organic.
Bot detection is one of the most consequential and methodologically complex components of on-chain research. Automated accounts systematically inflate user engagement metrics, and their removal often reveals a materially smaller genuine user base.
Behavioral Feature Extraction For each address with sufficient transaction history, a behavioral feature vector is computed across multiple dimensions: transaction frequency relative to active days, consistency of gas price selection, breadth of contract interactions, temporal distribution across hours of the day, inter-transaction timing statistics, and diversity of function calls issued. No single feature reliably distinguishes bots from humans — the signal is in the combination.
Ensemble Classification Logic Classification uses a multi-criterion ensemble rather than a single threshold. An address is flagged as likely automated if its behavioral profile simultaneously satisfies criteria across multiple feature dimensions. The specific combination and thresholds are configurable and were calibrated against known automated address patterns. The output is a three-way classification: likely automated, likely human, and uncertain — preserving epistemic honesty for borderline cases.
Daily Activity Decomposition Once addresses are classified, daily transaction counts and DAA figures are decomposed by classification category. The artificial DAA share — the percentage of daily active addresses classified as automated — is tracked over time. Trend in this share is meaningful: rising artificial share in the presence of flat raw DAA indicates organic user atrophy masked by growing bot participation.
Temporal Pattern Analysis Automated accounts exhibit characteristic temporal signatures: activity uniformly distributed across all 24 hours (not concentrated in human waking hours), minimal weekend suppression, and highly regular inter-transaction intervals. These patterns are visualized as hourly and day-of-week heatmaps, with bot vs. human overlays, to communicate the behavioral divergence intuitively.
Volume figures — both native token transfer volume and DEX trading volume — are among the most gamed metrics in crypto. Two complementary detection methodologies are applied.
Circular Native Transaction Detection A directed graph is constructed from daily native token transactions, with nodes representing addresses and edges representing value flows. Graph cycle detection algorithms identify circular flow patterns: sequences of transfers that return value to the originating address within a bounded path length. Such cycles — where value moves A→B→C→A — are economically purposeless transfers that exist solely to inflate volume and transaction count metrics. Circular volume is isolated and reported separately from genuine directional flows. The share of daily transfer volume attributable to detected circular patterns is a direct measure of synthetic activity.
DEX Wash Volume Detection On decentralized exchanges, wash trading manifests as rapid direction reversals on the same trading pair by the same sender. The detection logic identifies swap sequences where the same address executes trades of opposite direction on the same pair within a narrow block window, consistent with wash trading mechanics. Flagged volume is separated from the reported DEX volume figure, yielding an adjusted DEX volume that better reflects genuine economic activity. The wash share — detected wash volume as a percentage of gross DEX volume — is tracked as a separate signal.
These two methodologies are complementary: circular native flows primarily capture point-to-point wash activity, while DEX wash detection targets exchange-specific patterns. Neither is exhaustive, meaning adjusted volumes represent a lower bound on artificial activity rather than a precise measurement.
DEX Volume Quality Assessment Raw DEX swap volume is computed by joining swap event logs with token transfer records to identify underlying asset amounts. USD value is assigned using historical price feeds for major tokens (native asset, wrapped ETH, wrapped BTC) and $1.00 for stablecoins. The resulting USD volume series is then filtered through the wash detection layer to produce quality-adjusted DEX volume. Trends in both gross and adjusted volume — and the spread between them — are core research outputs.
Liquidity Depth Trends Liquidity provision (mint) and removal (burn) events are tracked separately to reconstruct net liquidity flow over time. A sustained net outflow of liquidity is a leading indicator of DeFi ecosystem contraction — liquidity providers are signaling that yield opportunities no longer justify capital commitment. The ratio of cumulative liquidity additions to removals over time reveals whether the ecosystem is attracting or shedding capital.
Protocol Activity Ranking Individual smart contracts are ranked by daily interactions and unique user counts. This ranking reveals ecosystem concentration: how much of total activity flows through a handful of protocols vs. being distributed across a diverse ecosystem. Heavy concentration in one or two protocols makes the chain fragile — the departure or failure of a single protocol materially impacts headline metrics.
Protocol Lifecycle Tracking Each protocol's weekly unique user count is tracked, and protocols whose current usage has declined below a threshold of their historical peak are classified as dying. The share of the protocol ecosystem in declining states is a measure of ecosystem health — a chain with many dying protocols is contracting regardless of what aggregate figures suggest.
Exchange Flow Analysis Exchange addresses are identified through behavioral heuristics: addresses receiving funds from a large number of unique counterparties at scale are consistent with exchange deposit patterns. Net daily flows (inbound minus outbound) from identified exchange addresses constitute the primary exchange flow signal. Sustained net inflows to exchanges are historically associated with selling pressure, as users deposit tokens to sell; net outflows suggest accumulation or withdrawal to self-custody.
Holder Distribution & Concentration Address balances are estimated from lifetime inflow and outflow of the native token, then bucketed into wealth tiers. The resulting distribution — percentage of addresses and percentage of supply in each tier — characterizes holder concentration. The Gini coefficient of the distribution provides a summary scalar for inequality. High concentration in a small number of addresses increases tail risk from large-holder selling.
Top Holder Behavioral Monitoring The top tier of holders is monitored for behavioral signals: net change in estimated balance over a trailing window, classified as accumulating, distributing, or stable. Consistent distribution by large holders is a structural bearish signal that does not always appear immediately in price action.
Token Velocity The ratio of daily transferred value to estimated circulating supply provides a velocity measure — how many times the supply "turns over" daily through economic activity. Extremely low velocity may indicate a store-of-value dynamic or illiquid markets; unusually high velocity (especially when combined with high detected wash share) suggests artificial inflation of transfer figures.
Dormancy Cohort Analysis Addresses are bucketed by days since last transaction into dormancy cohorts. Tracking the migration of addresses across cohorts over time reveals whether the user base is actively engaging with the chain or gradually going dormant. A rising share of addresses in the 90-365 day dormancy bucket indicates that users who tried the chain are not returning — a user retention failure.
Multiple addresses under common control distort individual-address metrics: DAA, unique user counts, and holder distributions all overstate diversity when related addresses are counted independently.
Funding Source Heuristic Addresses funded by the same source address within a short time window are likely under common control — the funding pattern is consistent with batch wallet creation for farming, airdrop hunting, or other programmatic address generation strategies. Such addresses are linked into clusters.
Gas Price Fingerprinting Addresses that consistently use identical, non-standard gas prices share infrastructure — they are configured by the same deployment scripts or bot frameworks. Addresses with matching rare gas price signatures are linked as related.
Union-Find Cluster Merging Individual pairwise links from both heuristics are merged using a Union-Find (disjoint-set) algorithm, producing final cluster assignments. The cluster size distribution — number of clusters of size 1, 2-5, 5-20, 20+ — is itself an analytical signal: a chain with many large clusters has a user base that is less diverse than raw address counts suggest.
Validator Production Analysis Block production is tracked by miner (validator) address, computing each validator's share of blocks produced over time. The Nakamoto coefficient — the minimum number of validators whose combined block share exceeds 50% — is the primary decentralization metric. A Nakamoto coefficient of 3 means three colluding validators could theoretically reorg or censor the chain. Trend in Nakamoto coefficient over time reveals whether the chain is centralizing or decentralizing.
Staking Flow Monitoring Token flows into and out of staking contracts are tracked to measure net staking dynamics. Net staking outflows can indicate validator confidence is declining or that staking rewards no longer justify lockup opportunity cost. The relationship between staking flows and price action often leads price — validators reducing exposure is a bearish signal.
Emission Sustainability Analysis Annual staking emissions (the total CRO distributed to validators and delegators per year, estimated from reward structures) are compared against annualized gas fee revenue. Where emissions substantially exceed fee revenue, the chain operates on an inflationary subsidy model — staking yields are funded by dilution rather than real economic output. This sustainability gap is a structural valuation concern.
Contract Lifecycle Classification Every deployed contract is tracked from its creation block forward, with metrics for total lifetime interactions and unique user count. Contracts are classified by recency of last interaction into active (interacted within the past month), cooling (one to three months since last interaction), and dead (more than three months dormant). The distribution of the deployed contract base across these categories is a measure of ecosystem vitality — a healthy chain continuously onboards new active contracts while retaining usage of existing ones.
Ecosystem Concentration (HHI Equivalent) The share of total interactions accounted for by the top N contracts is computed across rolling windows. High concentration indicates thin ecosystem depth — a few dominant protocols account for most activity, with a long tail of inactive deployments. This creates fragility: ecosystem metrics are dominated by a small number of actors whose decisions (fee changes, migrations, shutdowns) can swing aggregate figures materially.
NFT Market Activity ERC721 transfer events are identified from event log topic patterns. Daily active NFT contracts and unique NFT transfer counts are tracked as measures of collector and speculative market activity. NFT markets are often a leading indicator of broader retail engagement.
On-chain data is a lagging signal — user sentiment shifts before behavior does. The social layer monitors Reddit discourse around the Cronos ecosystem and its parent exchange to identify emerging narrative themes before they manifest in on-chain metrics.
Feature-Level Discourse Tracking Subreddit posts are analyzed for mentions of specific product and ecosystem features: card rewards, exchange utility, staking programs, application ecosystem, fees, and customer support quality. Each post is classified by feature topic using keyword pattern matching, enabling topic-level trend detection rather than undifferentiated sentiment aggregation.
Sentiment Polarity by Topic Within each feature topic, posts are classified by polarity: positive sentiment (satisfaction with rewards, endorsement of products) vs. negative sentiment (complaints about reward cuts, comparison unfavorable to alternatives). Daily polarity aggregates by topic enable the detection of sentiment regime changes — a shift from net positive to net negative in a specific feature category (e.g., card rewards) that may precede behavioral changes in the on-chain data.
Narrative Timeline Integration Known external events (reward program changes, regulatory announcements, product updates) are overlaid on the sentiment time series as event markers. This enables assessment of whether sentiment changes were event-driven or reflected gradual organic deterioration — a distinction relevant to both causality analysis and forward projection.
Voting Power Concentration On-chain governance participation is analyzed for concentration of voting power across validators and large token holders. The share of total governance power controlled by the top N participants is computed alongside a Nakamoto equivalent for governance (minimum participants needed to pass a proposal unilaterally). Low governance participation and high concentration are governance risk factors — protocol changes can be pushed through by a small coalition without broad community input.
Participation Rate Trends Governance participation rate (fraction of circulating supply that votes on proposals) is tracked over time. Declining participation may indicate apathy or investor disengagement, while sustained high participation signals an active community. The combination of high governance power concentration and low participation rate represents a particularly concerning governance structure.
Off-Chain Catalyst Framework The framework supports integration of off-chain catalyst data — events external to the blockchain that materially affect the ecosystem's outlook. These are modeled as annotated timelines that can be overlaid on any signal time series to assess impact.
Developer Activity Proxies GitHub commit activity across core ecosystem repositories is tracked as a proxy for developer engagement and protocol maintenance investment. Compression in commit frequency over time — particularly in core infrastructure repositories — suggests declining developer interest or resource allocation, a leading indicator of protocol stagnation.
Event Impact Attribution Known catalyst events (product changes, regulatory developments, macroeconomic events) are marked on signal time series and an event window analysis is applied to assess pre- and post-event signal behavior. This systematic approach to impact attribution enables defensible claims about whether specific catalysts caused observed behavioral changes.
Absolute metrics are only interpretable in context. The benchmarking module positions Cronos against a curated peer universe of EVM and non-EVM chains.
Peer Universe The default peer universe includes Ethereum, Solana, Arbitrum, Base, Optimism, Polygon, and Avalanche — chains selected for market significance, data availability, and structural comparability to Cronos as an EVM chain with a large parent exchange ecosystem (analogous comparisons: BNB Chain / Binance, OKB / OKX).
Valuation Ratio Benchmarking Three primary valuation ratios are computed for each chain in the peer universe:
- FDV / Annualized Gas Revenue: The primary revenue multiple. Comparable to a P/E ratio. Chains trading at higher multiples relative to peers must justify the premium through superior growth expectations, security properties, or ecosystem depth.
- FDV / Daily Active Addresses: A per-user valuation metric that normalizes for ecosystem size. Useful for identifying chains trading at a per-user premium or discount to peers.
- FDV / TVL: A measure of how much capital is locked in the ecosystem relative to fully diluted valuation. Very high FDV/TVL may indicate thin DeFi ecosystem relative to market expectations.
Data Sourcing Peer chain data is sourced from DeFi Llama (TVL), CoinGecko (FDV, circulating supply), and self-computed for Cronos (gas revenue, DAA). Manual override capability is provided for cases where public data sources are stale, inconsistent, or unavailable for specific chains — enabling researchers to substitute validated manual observations without code changes.
Relative Positioning For each ratio, Cronos is positioned relative to peer median and range. Outlier positioning (significantly above peer median on revenue multiples, significantly below on user counts) is flagged as a research finding requiring explanation.
The valuation module implements a scenario-based intrinsic value framework anchored to revenue generation.
Revenue-Based Valuation The primary valuation methodology applies market-observed revenue multiples from peer chains to Cronos's annualized gas fee revenue. Three scenarios are parameterized:
- Bull case: Applies an optimistic revenue multiple consistent with the upper end of the peer distribution, reflecting expectations of ecosystem recovery and growth.
- Base case: Applies the median peer revenue multiple, reflecting no significant premium or discount to peers.
- Bear case: Applies a below-median multiple, reflecting structural concerns about ecosystem quality, user retention, and competition.
Each scenario produces an implied fully diluted valuation that can be compared against current market FDV to compute upside or downside. The parameterization (multiples and revenue multipliers) is externalized to configuration and can be adjusted for sensitivity testing.
Emission-Adjusted Sustainability A separate analysis compares annual staking and validator emissions against annualized gas revenue. Where emissions exceed revenue, the protocol operates in deficit — token holders fund staking yields through dilution rather than protocol income. This sustainability metric is incorporated as a qualitative risk factor in the valuation narrative.
Scenario Sensitivity Scenario outputs are presented as a matrix, allowing readers to trace how implied valuation changes across combinations of revenue growth assumptions and multiple assumptions. This prevents anchoring to a single point estimate and communicates the range of plausible outcomes given parameter uncertainty.
Individual signals are integrated into a coherent research narrative in the synthesis module.
Reported vs. Adjusted Summary Table The primary synthesis output is a structured comparison table presenting six to eight key metrics in two columns: the surface-level reported figure and the analysis-adjusted figure. For each metric, the adjustment methodology is described and the percentage divergence is computed. This table is the most actionable single output of the research — it communicates, in compact form, the gap between how the chain presents itself and what the underlying data supports.
Investment Thesis Parameterization The framework supports explicit parameterization of the research thesis — bull, bear, or neutral — with associated assumptions documented in configuration. This enables reproducibility: another researcher can reproduce the analysis with different assumptions and see how conclusions change. Catalyst events (product changes, regulatory developments, competitive dynamics) that inform the thesis are explicitly enumerated rather than embedded implicitly in chart narratives.
Cross-Signal Consistency Check The synthesis layer validates that signals across modules are internally consistent — for example, that bot-adjusted DAA and wash-adjusted DEX volume tell the same directional story, and that exchange flow directionality aligns with on-chain holder behavior trends. Inconsistencies between modules are flagged for further investigation rather than suppressed.
A twelve-page Streamlit dashboard provides interactive exploration of all analytical outputs.
| Page | Contents |
|---|---|
| Executive Summary | KPI cards, reported vs. adjusted summary table, key charts |
| Chain Health | DAA trends, transaction composition, gas revenue, blockspace demand |
| DeFi Ecosystem | DEX volume (gross and adjusted), unique DEX users, liquidity trends, protocol rankings |
| Token Flows | Exchange inflows/outflows, top holder behavior, holder distribution, dormancy cohorts |
| Wash Trading & Bots | Circular flow metrics, wash volume separation, bot activity decomposition, temporal patterns |
| Comparative | Cross-chain valuation ratios vs. peer universe |
| Valuation | Scenario analysis matrix, emission sustainability |
| Data Explorer | Raw Parquet table browser for direct data inspection |
| Catalyst Charts | Slide-deck-ready annotated time series for catalyst narrative |
| Reddit Sentiment | Subreddit sentiment trends by feature topic |
| Governance | Validator analysis, Nakamoto coefficient, voting power concentration |
| Appendix | Supporting charts, deep-dive analyses |
In addition to the interactive dashboard, all analytical charts are produced in a slide-deck-ready static format with institutional typography, consistent brand colors, and pre-computed annotation callouts (current value, peak, percentage decline from peak). This export layer is designed to support direct integration into research reports or investor presentations without post-processing.
The reporting module synthesizes all analytical outputs into structured research documents.
Automated Narrative Generation Key metrics are pulled from Parquet outputs and inserted into a narrative template. The template is structured to guide readers from data context (what the chain reports) through analysis (what the data shows after adjustment) to implications (what the findings mean for valuation and risk assessment). This narrative structure mirrors institutional equity research format.
Tabular Output Comparison tables (peer benchmarking, valuation scenarios, reported vs. adjusted metrics) are auto-generated from analysis outputs, ensuring tables and text are always in sync — a common failure mode in manually assembled research.
HTML/Markdown Export Reports are rendered in both Markdown (for version control and programmatic processing) and styled HTML (for distribution and presentation). The HTML rendering uses professional typography and responsive table formatting suitable for direct distribution to counterparties or sponsors.
| Layer | Technology |
|---|---|
| Data ingestion | Python · asyncio · aiohttp |
| Data storage | Apache Parquet · PyArrow |
| Aggregation | Polars (streaming) |
| Analysis | Pandas · NumPy · NetworkX |
| Machine learning / classification | Custom heuristic ensemble |
| Visualization | Plotly · Matplotlib |
| Dashboard | Streamlit |
| NLP / sentiment | Regex pattern matching · custom lexicon |
| Market data | CoinGecko API · DeFi Llama API |
| Configuration | JSON · Python dataclasses |
| Testing | pytest |
| Compute infrastructure | Akash Network (decentralized cloud) |
The full pipeline — from raw blockchain ingestion through proprietary signal analysis — was executed entirely on Akash Network, a decentralized cloud compute marketplace. Akash was chosen for its permissionless GPU access, competitive pricing relative to centralized cloud providers, and alignment with the decentralized ethos of the research subject.
The pipeline ran across two distinct phases and five total nodes:
| Phase | Nodes | Purpose | Duration |
|---|---|---|---|
| Ingestion | 4 × standard compute VMs | Parallel blockchain data download (independent block range per worker) | ~2 days |
| Analysis | 1 × H100 GPU node | Proprietary signal generation, classification models, graph analytics | ~2 days (overnight run) |
Ingestion phase: Four independent worker VMs downloaded non-overlapping block ranges in parallel, each writing to its own Parquet partition. Once all workers completed, their outputs were merged into a single unified dataset using a join step that reconciled schema and deduplicated any overlapping boundary blocks before handoff to the aggregation layer.
Analysis phase: The merged dataset was transferred to a single high-memory H100 GPU node, which ran the full analytical suite — bot detection, wash trading graph analysis, wallet clustering, and all signal generation modules — in a single overnight session.
Each of the four download workers was deployed from the following SDL, with WORKER_ID, BLOCK_START, and BLOCK_END overridden per deployment to assign non-overlapping block ranges:
---
version: "2.0"
services:
downloader:
image: cronos-research/downloader:latest
env:
- WORKER_ID=1
- BLOCK_START=0
- BLOCK_END=12500000
- RPC_ENDPOINTS=https://evm.cronos.org
- BATCH_SIZE=500
- CONCURRENCY=8
expose:
- port: 8080
as: 80
to:
- global: false
profiles:
compute:
downloader:
resources:
cpu:
units: 4.0
memory:
size: 8Gi
storage:
- size: 250Gi
placement:
akash:
pricing:
downloader:
denom: uakt
amount: 1000
deployment:
downloader:
akash:
profile: downloader
count: 1---
version: "2.0"
services:
analysis:
image: cronos-research/analysis:latest
env:
- DATA_PATH=/data/merged
- OUTPUT_PATH=/data/outputs
- ENABLE_GPU=true
expose:
- port: 8501
as: 80
to:
- global: false
profiles:
compute:
analysis:
resources:
cpu:
units: 32.0
memory:
size: 512Gi
storage:
- size: 512Gi
- size: 512Gi
class: beta3
gpu:
units: 1
attributes:
vendor:
nvidia:
- model: h100
placement:
akash:
attributes:
host: akash
pricing:
analysis:
denom: uakt
amount: 10000
deployment:
analysis:
akash:
profile: analysis
count: 1| Node | CPU | Memory | Storage | GPU | Runtime |
|---|---|---|---|---|---|
| Downloader ×4 | 4 cores each | 8 GB each | 250 GB each | — | ~2 days |
| H100 Analysis | 32 cores | 512 GB | 512 GB persistent + 512 GB ephemeral | 1× NVIDIA H100 | ~2 days (overnight) |
Total compute consumed across all nodes: ~4 days of wall-clock time
| Component | Cost |
|---|---|
| 4× ingestion worker VMs (~2 days each) | ~$28.00 |
| 1× H100 GPU node (~2 days) | ~$100.42 |
| Total | ~$128.42 |
All pricing settled in AKT via the Akash on-chain marketplace. Costs reflect actual lease bids accepted by providers during the research window.
cronos_research/
├── config/ Configuration and assumptions
│ ├── config.py Runtime settings loader
│ ├── chart_style.py Visual style constants
│ ├── thesis_assumptions.json Valuation parameters and catalysts
│ ├── catalyst_off_chain.json Off-chain catalyst data
│ ├── known_addresses.json Address labels (exchanges, protocols)
│ ├── peer_comparison_overrides.json Manual peer chain overrides
│ └── workers/ Per-worker download configurations
│
├── src/
│ ├── downloader/ Async blockchain data ingestion
│ │ ├── core.py Multi-endpoint RPC batch downloader
│ │ ├── parsers.py RPC JSON → Parquet schema normalization
│ │ └── storage.py Parquet write/validate/append
│ │
│ ├── analysis/ Analytical modules
│ │ ├── aggregator.py Raw → daily aggregate tables
│ │ ├── chain_health.py Network vitality metrics
│ │ ├── defi_analysis.py DEX and liquidity analysis
│ │ ├── token_flows.py Holder behavior and exchange flows
│ │ ├── bot_detection.py Automated account classification
│ │ ├── wash_detection.py Circular flows and wash volume
│ │ ├── wallet_clustering.py Address relationship detection
│ │ ├── staking.py Validator and staking economics
│ │ ├── ecosystem_analysis.py Contract lifecycle and ecosystem depth
│ │ ├── comparative.py Peer chain benchmarking
│ │ ├── synthesis.py Cross-signal integration
│ │ └── calibration.py Threshold calibration utilities
│ │
│ └── gpu/ Optional GPU-accelerated entrypoints (CPU fallback)
│
├── visualization/ Chart generation modules
│ ├── interactive_charts.py Plotly time-series charts
│ ├── static_charts.py Matplotlib export-ready charts
│ ├── catalyst_charts.py Slide-deck catalyst visuals
│ ├── governance_charts.py Validator and governance charts
│ ├── reddit_narrative_charts.py Social sentiment visuals
│ ├── reddit_sentiment_analysis.py NLP and sentiment aggregation
│ └── network_graphs.py Graph visualizations
│
├── dashboard/ Streamlit interactive dashboard
│ ├── app.py Application entry point
│ ├── helpers.py Shared data loading utilities
│ ├── components/ Reusable UI components
│ └── pages/ One file per dashboard page (12 pages)
│
├── reports/
│ └── report_generator.py Markdown/HTML report synthesis
│
├── scripts/ Runnable pipeline entrypoints
│ ├── download.py Blockchain data download
│ ├── run_aggregation.py Daily aggregate computation
│ ├── run_analysis.py Full analysis suite
│ ├── run_reddit_sentiment_analysis.py Social data analysis
│ ├── generate_*.py Individual chart generators
│ └── run_*.sh Shell pipeline orchestration
│
├── tests/ Test suite
├── requirements.txt Python dependencies
├── pyproject.toml Package configuration
└── Makefile Common task aliases
Prerequisites: Python 3.11+, 50GB+ free disk space for historical data.
# 1. Create and activate virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download a block range (adjust range as needed)
python scripts/download.py --start-block 0 --end-block 10000
# 4. Run aggregation
python scripts/run_aggregation.py
# 5. Run full analysis suite
python scripts/run_analysis.py
# 6. Launch dashboard
streamlit run dashboard/app.pyFor large historical backtests, the parallel worker configuration in config/workers/ enables distributing the download across multiple concurrent workers targeting different block ranges.
This research infrastructure was developed as part of a structured investment analysis process. The methodology is designed to be rigorous, reproducible, and skeptical of reported metrics — treating on-chain data as adversarial rather than trustworthy at face value.