Listening history of users from the year 2005 is stored in HDFS and accounts to over 200 million listens. A listen in HDFS is stored as:
Column | Description |
---|---|
artist_mbids | A list of MusicBrainz ID(s) of the artist(s) of the track |
artist_msid | The MessyBrainz ID of the artist of the track |
artist_name | The name of the artist of the track |
listened_at | The timestamp at which the listening event happened |
recording_mbid | The MusicBrainz ID of the track |
recording_msid | The MessyBrainz ID of track |
release_mbid | The MusicBrainz ID of the release to which the track belongs |
release_msid | The MessyBrainz ID of the release to which the track belongs |
release_name | The name of the release to which the track belongs |
tags | A list of user defined tags to be associated with this track |
track_name | The name of the track listened to |
user_name | The name of the user |
ListenBrainz recommendation engine uses Apache Spark to train models using Spark's inbuilt algorithm for collaborative filtering to generate recommendations using its user's listening history. The size of listening data dump~ 16GB. ListenBrainz uses a cluster of 4 nodes (computers) each with following specifications:
CPU(s): 8
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
Storage available: 250GB
Memory available: 32GB
One of the nodes is the master and the remaining three are the workers. There are two executors on each worker. Here are the Spark configuration values:
spark.scheduler.listenerbus.eventqueue.capacity: 100000
spark.cores.max: 18
spark.executor.cores: 3
spark.executor.memory: 10GB
spark.driver.memory: 8GB
spark.network.timeout: 240s
spark.driver.maxResultSize: 2GB
Collaborative filtering is commonly used for recommender systems. This technique aims to fill in the missing entries of a user-item association matrix. Instead of using explicit preferences such as rating given by a user to an item, we use implicit feedback (number of times a track has been listened to by a user). The first step is to collect relevant data to be able to preprocess it. The sub steps of data collection are as follows:
users_df time (min) | recordings_df time(min) | playcounts_df time (min) |
---|---|---|
0.52 | 0.58 | 1.65 |
Total time lapsed in data collection: 3.66m
users_df: A user dataframe is created to select distinct user names and assign each user a unique integer id. The schema for user dataframe is as follows:
user_id | user_name |
---|---|
1 | rob |
2 | iliekcomputers |
3 | Vansika Pareek |
listens_df: A listen dataframe is created to select all the listens in a given time frame. The schema for listen dataframe is as follows:
listened_at | track_name | recording_msid | user_name |
---|---|---|---|
2017-01-31 16:52:22 | You are the reason | b34e1496-e898-4ebd-99e9-fff2d4fc98d2 | rob |
2017-01-31 17:52:02 | Sunflower | da3bc0a3-fd1c-4afd-a0de-e21f757fac43 | iiekcomputers |
2018-01-31 17:00:20 | You are the reason | b34e1496-e898-4ebd-99e9-fff2d4fc98d2 | rob |
2017-01-3 14:00:10 | Sunflower | da3bc0a3-fd1c-4afd-a0de-e21f757fac43 | Vansika Pareek |
2018-01-3 10:00:10 | Stand by me | b9da2ed1-6291-4b05-9e5e-b87551a8e75f | iliekcomputers |
recordings_df: A recording dataframe is created to select distinct recordings/tracks listened to and assign each recording a unique integer id along with other relevant information(track_name, artist_name etc). The schema for recording dataframe is as follows:
track_name | recording_msid | artist_name | artist_msid | release_name | release_msid | recording_id |
---|---|---|---|---|---|---|
You are the reason | b34e1496-e898-4ebd-99e9-fff2d4fc98d2 | Justin Nozuka | f8471a38-0575-4dcf-b657-401377ad293c | Beautiful | b94a34da-5500-4203-b621-aca10b3cb9f0 | 1 |
Sunflower | da3bc0a3-fd1c-4afd-a0de-e21f757fac43 | Havana Black | 4859879e-d85a-441d-b2e5-12b42c2ea752 | Family Collection 1987-2007 | 2c5b73e4-d8ab-47d9-bb21-5e320cf347e9 | 2 |
Stand by me | b9da2ed1-6291-4b05-9e5e-b87551a8e75f | Ben E. King | 837555ba-012e-45f1-9a9c-9628da13ee54 | Billboard Presents: Family Friendship Classics | 0fede8ca-7b38-455b-bde3-8a098c430310 | 3 |
playcounts_df: Playcounts dataframe is the output of the first step i.e. data collection. It is obtained by joining the above mentioned dataframes. The dataframe contains recording ids of all the distinct tracks that a user has listened to for all the users along with the listen count. The schema for playcounts dataframe is as follows:
user_id | recording_id | count |
---|---|---|
1 | 1 | 2 |
2 | 2 | 1 |
3 | 2 | 1 |
2 | 3 | 1 |
Note: Number of rows in a dataframe or number of elements in an RDD (count information) is not included because it may result unnecessary computation time.