Data Collection

Listening history of users from the year 2005 is stored in HDFS and accounts to over 200 million listens. A listen in HDFS is stored as:

Column Description
artist_mbids A list of MusicBrainz ID(s) of the artist(s) of the track
artist_msid The MessyBrainz ID of the artist of the track
artist_name The name of the artist of the track
listened_at The timestamp at which the listening event happened
recording_mbid The MusicBrainz ID of the track
recording_msid The MessyBrainz ID of track
release_mbid The MusicBrainz ID of the release to which the track belongs
release_msid The MessyBrainz ID of the release to which the track belongs
release_name The name of the release to which the track belongs
tags A list of user defined tags to be associated with this track
track_name The name of the track listened to
user_name The name of the user

ListenBrainz recommendation engine uses Apache Spark to train models using Spark's inbuilt algorithm for collaborative filtering to generate recommendations using its user's listening history. The size of listening data dump~ 16GB. ListenBrainz uses a cluster of 4 nodes (computers) each with following specifications:

CPU(s): 8

Thread(s) per core: 1

Core(s) per socket: 8

Socket(s): 1

Storage available: 250GB

Memory available: 32GB

One of the nodes is the master and the remaining three are the workers. There are two executors on each worker. Here are the Spark configuration values:

spark.scheduler.listenerbus.eventqueue.capacity: 100000

spark.cores.max: 18

spark.executor.cores: 3

spark.executor.memory: 10GB

spark.driver.memory: 8GB

spark.network.timeout: 240s

spark.driver.maxResultSize: 2GB

Collaborative filtering is commonly used for recommender systems. This technique aims to fill in the missing entries of a user-item association matrix. Instead of using explicit preferences such as rating given by a user to an item, we use implicit feedback (number of times a track has been listened to by a user). The first step is to collect relevant data to be able to preprocess it. The sub steps of data collection are as follows:

users_df time (min) recordings_df time(min) playcounts_df time (min)
0.52 0.58 1.65

Total time lapsed in data collection: 3.66m

Note: Number of rows in a dataframe or number of elements in an RDD (count information) is not included because it may result unnecessary computation time.