Skip to main content

Discovery Node

An Audius Discovery Node is a service that indexes the metadata and availability of data across the protocol for Audius users to query. The indexed content includes user, track, and album/playlist information along with social features. The data is stored for quick access, updated on a regular interval, and made available for clients via a RESTful API.

The Discovery Node source code is hosted on GitHub and see the registered Discovery Nodes on the Audius Protocol Dashboard.


Design Goals

  1. Expose queryable endpoints which listeners/creators can interact with
  2. Reliably store relevant blockchain events
  3. Continuously monitor the blockchain and ensure stored data is up to date with the network
Legacy Terminology

The "Discovery Node" may be referred to as the "Discovery Provider". These services are the same.


Database

The Discovery Node uses PostgreSQL. Our Postgres database is managed through SQLAlchemy, an object relational mapper and Alembic, a lightweight database migration tool. The data models are defined in src/models which is used by alembic to automatically generate the migrations under alembic/versions. You can find the connection defined between alembic and the data models in alembic/env.py


Flask

The Discovery Node web server serves as the entry point for reading data through the Audius Protocol. All queries are returned as JSON objects parsed from SQLAlchemy query results, and can be found in src/queries.

Some examples of queries include user-specific feeds, track data, playlist data, etc.


Celery

Celery is simply a task queue - it allows us to define a set of single tasks repeated throughout the lifetime of the Discovery Node.

Currently, a single task (src/tasks/index.py:update_task()) handles all database write operations. The Flask application reads from the database and is unaware of data correctness.

Celery worker and beat are the key underlying concepts behind Celery usage in the Discovery Node.

Celery Worker

Celery worker is the component that actually runs tasks.

The primary driver of data availability on Audius is the 'index_blocks' Celery task. What happens when 'index_blocks' is actually executed? The Celery task does the following operations:

  1. Check whether the latest block is different than the last processed block in the ‘blocks’ table.

    If so, an array of blocks is generated from the last blockhash present in the database up to the latest block number specified by the block indexing window.

    Block indexing window is equivalent to the maximum number of blocks to be processed in a single indexing operation

  2. Traverse over each block in the block array produced after the above step.

    In each block, check if any transactions relevant to the Audius Smart Contracts are present. If present, we retrieve specific event information from the associated transaction hash, examples include creator and track metadata.

    To do so, the Discovery Node must be aware of both the contract ABIs as well as each contract's address - these are shipped with each Discovery Node image.

  3. Given operations from Audius contracts in a given block, the task updates the corresponding table in the database.

    Certain index operations require a metadata fetch from decentralized storage (Audius Storage Protocol). Metadata formats can be found here.

Why index blocks instead of using event filters?

This is a great question - the main reason chosen to index blocks in this manner is to handle cases of false progress and rollback. Each indexing task opens a fresh database session, which means database transactions can be reverted at a block level - while rollback handling for the Discovery Node has yet to be implemented, block-level indexing will be immediately useful when it becomes necessary.

Celery Beat

Celery beat is responsible for periodically scheduling index tasks and is run as a separate container from the worker. Details about periodic task scheduling can be found in the official documentation.

This is an identical container as the Celery worker but is run as a 'beat scheduler' to ensure indexing is run at a periodic interval. By default this interval is 5 seconds.


Redis

A Redis client is used for several things in the Discovery Node.

  1. Caching (internally and externally)
  2. As a broker for Celery
  3. As a mechanism for locking to ensure single execution contexts for Celery jobs

Elastic Search is used denormalize data and support certain queries (Feed, Search, Related Artists, etc.). Elastic Search data is populated and kept up to date by database triggers that live on the Postgres database.

ETL code for the Elastic Search layer is found in the es-indexer.