Data Collection

Listening history of users from the year 2005 is stored in HDFS and accounts to over 200 million listens. A listen in HDFS is stored as:

Column	Description
artist_mbids	A list of MusicBrainz ID(s) of the artist(s) of the track
artist_msid	The MessyBrainz ID of the artist of the track
artist_name	The name of the artist of the track
listened_at	The timestamp at which the listening event happened
recording_mbid	The MusicBrainz ID of the track
recording_msid	The MessyBrainz ID of track
release_mbid	The MusicBrainz ID of the release to which the track belongs
release_msid	The MessyBrainz ID of the release to which the track belongs
release_name	The name of the release to which the track belongs
tags	A list of user defined tags to be associated with this track
track_name	The name of the track listened to
user_name	The name of the user

The listening history is fed to Spark's inbuilt algorithm for collaborative filtering. Collaborative filtering is commonly used for recommender systems. This technique aims to fill in the missing entries of a user-item association matrix. Instead of using explicit preferences such as rating given by a user to an item, we use implicit feedback (number of times a track has been listened to by a user). The first step is to collect relevant data to be able to preprocess it. The sub steps of data collection are as follows:

Note: Listens from 2005-01 to 2019-12 have been fed to the recommender system. Total listens used to train, validate and test the model = 200,272,949

users_df: A user dataframe is created to select distinct user names and assign each user a unique integer id. Query to create the user dataframe takes 0.90m to process. There are 3,917 rows in the dataframe. The schema for user dataframe is as follows:

user_id user_name

1 rob

2 iliekcomputers

3 Vansika Pareek

user_id	user_name
1	rob
2	iliekcomputers
3	Vansika Pareek

listens_df: A listen dataframe is created to select all the listens from 2005-01 to 2019-12. Query to create listen dataframe takes 0.45m to process. There are 200,272,949 rows in the dataframe. The schema for listen dataframe is as follows:

listened_at	track_name	recording_msid	user_name
2017-01-31 16:52:22	You are the reason	b34e1496-e898-4ebd-99e9-fff2d4fc98d2	rob
2017-01-31 17:52:02	Sunflower	da3bc0a3-fd1c-4afd-a0de-e21f757fac43	iiekcomputers
2018-01-31 17:00:20	You are the reason	b34e1496-e898-4ebd-99e9-fff2d4fc98d2	rob
2017-01-3 14:00:10	Sunflower	da3bc0a3-fd1c-4afd-a0de-e21f757fac43	Vansika Pareek
2018-01-3 10:00:10	Stand by me	b9da2ed1-6291-4b05-9e5e-b87551a8e75f	iliekcomputers

recordings_df: A recording dataframe is created to select distinct recordings/tracks listened to and assign each recording a unique integer id along with other relevant information(track_name, artist_name etc). Query to create recording dataframe takes 2.70m to process. There are 22,335,149 rows in the dataframe. The schema for recording dataframe is as follows:

track_name	recording_msid	artist_name	artist_msid	release_name	release_msid	recording_id
You are the reason	b34e1496-e898-4ebd-99e9-fff2d4fc98d2	Justin Nozuka	f8471a38-0575-4dcf-b657-401377ad293c	Beautiful	b94a34da-5500-4203-b621-aca10b3cb9f0	1
Sunflower	da3bc0a3-fd1c-4afd-a0de-e21f757fac43	Havana Black	4859879e-d85a-441d-b2e5-12b42c2ea752	Family Collection 1987-2007	2c5b73e4-d8ab-47d9-bb21-5e320cf347e9	2
Stand by me	b9da2ed1-6291-4b05-9e5e-b87551a8e75f	Ben E. King	837555ba-012e-45f1-9a9c-9628da13ee54	Billboard Presents: Family Friendship Classics	0fede8ca-7b38-455b-bde3-8a098c430310	3

playcounts_df: Playcounts dataframe is the output of the first step i.e. data collection. It is obtained by joining the above mentioned dataframes. Query to produce playcounts dataframe that contains recording ids of all the tracks that a user has listened to for all the users along with the listen count takes 6.57m to process. There are 56,011,220 rows in the dataframe. The schema for playcounts dataframe is as follows:

user_id recording_id count

1 1 2

2 2 1

3 2 1

2 3 1

user_id	recording_id	count
1	1	2
2	2	1
3	2	1
2	3	1

Next