Published on June 29, 2016
1. Music Streams Running a social network on an event based architecture
2. Who are we? Stefano Galarraga: Lead Developer at Crowdmix ~20 years in sw engineering mostly working in middleware, messaging and most recently Big Data Open source contributor: Scalding, AKKA https://github.com/galarragas Michal Dziemianko: Big Data Engineer and Data Scientist at Crowdmix Background in AI and Distributed Computing ~10 years in software engineering and research Tiago Palma: Big Data Engineer and Data Warehouse Developer at Crowdmix Data Warehousing and ETL experience working with traditional MPP systems as Teradata and with Big Data technologies
3. Crowdmix: Who are we? • A social network focused on music • The model is based on crowds • People can share different type of content in the crowds they joined • Music obviously is the most interesting content • We are aiming for large scale (millions of user) => our system is designed for scalability • We don’t own any music content but we allow people to share and listen to tracks across different streaming services
4. Main App Features (1/2) • Classical social network interaction: • Users can build their social graph • Started with a follower/followee model now moving to a friends model • no content shared to followers • friendship enables direct communication • Both concepts leveraged for recommendations and content prioritisation • Users are joining crowds • P2P communication • Unlimited size for a crowd • Limited number of crowds to join (just enforced by the backend for performance reasons)
5. Main App Features (2/2) • Content: • Music (obviously) • No music is streamed by CM • Tracks from different streaming providers (more about this later) • Videos • From YouTube • Small videos from camera • Images and Pictures • System Generated Content: • Recommendations • Various forms, from crowd/people suggestion to surfacing content in the home page • Charts • Notifications
6. CM Architecture diagram
7. Tech Stack • AWS hosted infrastructure • Docker for services deployment (not for Kafka and Cassandra) • Mesos for resources management, Marathon for Containers Orchestration • Kafka 0.8.2 (thinking about 0.9) • Cassandra to store Materialized Views, • Elasticsearch to index the searchable content • Spark running on EMR for batch processing • Stream processing done using: • Confluent’s Kafka consumer/producer for the “legacy” Java based microservices • AKKA Streams for all new microservices
8. Event Based System • Kafka content retention is two weeks (started with 6 months). Topics are stored in S3 using Secor • QoS: accepting to lose content for most of the use cases. Focus on response times vs accuracy • CQRS • Batch Processing • Recommendation jobs • Data Warehouse/Analytics • Track Metadata acquisition and matching (more of it later) • Event replay for data migration and evolution and recovery • Backup/Restore • Using the event replay batch
9. Use Cases • Charts • Stream processing • Music Matching, Home Page • Lambda processing • Analytics • Traditional batch processing
10. Charts • Clients generate a TrackListened event whenever a track is listened for more than 5s • TrackListened event contains information about the context of the in-app listens, such as: • The comment where the track was embedded • The stack where the track was embedded or the comment was referring to • A preview from music search • Listening from the Chart itself • The crowd where the event occurred if available • Tracks are resolved by Music Matching Service (more below) • Duplicate everything for to videos
11. Charts • Chart service listens to the listens topic • Fan out based on track sources • Increment counters for each known source • Fan in to avoid duplicates • Triggers chart re-computation based on the context (near Real Time) • Adds extra information to the chart • Some of the factor that contributed to the calculation are surfaced (cheers, listens, shares) • Some of the “top” stacks where the track was shared are listed
12. Music Matching (1/2) • Tracks from different streaming providers: Implications: • Playback • A user is sharing a stack with her favourite track. She is a Spotify Premium user and lives in US • The stack is going to be seen by people: • Without a Spotify Premium account but with an iTunes subscription • Without any streaming service subscription • Preferring to see the Video associated to the track if available
13. “Across different streaming services” ? (2/2) • Charts • “Hello” from Adele is the favourite track in the “Romantic Dudes” crowd • We need to count sharing/listens/likes of the track even if shared from different sources • User behaviour analysis (e.g. recommendations …) • You listened to “Hello” from Adele 15 times yesterday • .. but you’re not in the “Romantic Dudes” crowd yet: • the system needs to understand your tastes even if you liked/listened to that track in different streaming services • You always used your iTunes version but another user had the same pattern but on Spotify • You should be connected even if not sharing the same streaming service
14. Ok, it is important. But why is it difficult? • Identify the track • Tracks are identified differently across different services • There is a “standard” track ID called ISRC but it is not always available • Other shared IDs such as the MusicBrainz ID are also just partially supported • There are sources (Youtube) not providing content by ISRC • Track metadata are not super-consistent too: • The same track title might contain the name of the featured artist in one service while the same info is stored in the artist-name field of another service • Retrieve the ID for the track in the right country in a scalable way • Need to search for the track across different sources in a scalable way • Handle time constraints • Handle missing results or connection problems
15. Is that a common problem? There are other companies/project doing the same work: Project Rosetta Stone by EchoNest (http://blog.echonest.com/post/66963888889/the-echo-nests-rosetta-stone- unlocking-social) now owned Spotify and not openly available anymore Spotify API can provide IDs for other services BoP http://bop.fm/ (now shut down) was offering a web-based music matching service
16. And now … a diagram
17. Data Warehouse / Analytics Motivations - Know what the users are doing in the app - Validate new features (A/B testing) - Measure user retention (Sticky Factor) - Calculate our revenue stream - Some “vanity” metrics: - Total number of users - Number of new users (Day / Week)
18. Why not use Mixpanel No simple way to query the raw data Security Requirements Ability to correlate the data sent to mixpanel directly from the Mobile and backend data (Kafka) Grow the number of active users with lower costs Our Business Analysts were familiar with SQL
19. Building Data Warehouse from Kafka - Secor consumes data from Kafka and dumps into S3 every 30 minutes - Spark Job reads the data, replays the events and applies business logic transformations. - After the data transformations, data is loaded into Redshift - Mode Analytics, our BI tool of choice to query the data in Redshift (SQL or Python) and generate nice D3 reports.
20. Reliable Data Warehouse - Kafka is the source-of-truth - Building the Warehouse by replaying the events from the source-of-truth means that the data is highly reliable for reporting, and... … to find/recover nasty bugs in the app that were not detected by the QA process.
21. What we did right? • Schema based events • Using AVRO as serialization format • Common Event Model • Enforcing common fields in events (timed uuid based eventId, ts, correlationId) • Event Schema Registry (using one) • Replicated/independent data views built from events • Secor (plus some contribution for compaction)
22. What we did wrong? • Schema Registry (implementing one) • we didn’t use Schema Registry from Confluent from start, we built our own • No enforcement on write of schema compatibility • Some contortions to support schema download • The model is compatible with the Schema Registry adoption (some copied parts too) and we want to move there • Event Sourcing • Not the idea per se, but OUR implementation was wrong • Confusing event based system with event sourcing • events not replayable directly but needed batch based processes to build the system view • Would migrate to a proper Event Sourcing framework like AKKA Persistence • Still keeping the event based infrastructure in Kafka
23. System Performance • The system has been designed with the target of supporting one million active users from the beginning and a steep growth in the following months • Marketing strategy has now changed • system has been opened to public using an invitation model • operating with few thousands of users Performance results mentioned here are from the performance test done where we still wanted to open to the millions of users
24. System Performance
25. System Performance Test Cases
27. Speaker Profiles Stefano Galarraga is currently working as Lead Developer at Crowdmix. Started his professional career in 1997 and has been working mostly in middleware, and message based systems, most recently moving to Big Data. He is contributor of Twitter’s Scalding and Typesafe’s AKKA projects plus some of his owns you can find at https://github.com/galarragas Michal Dziemianko is currently working as Big Data Engineer and Data Scientist at Crowdmix. He has a background in AI and Distributed Computing. He has PhD in Machine Learning from the University of Edinburgh and have been working for around 10 years in software engineering and research. Tiago Palma is currently working as Big Data Engineer and Data Warehouse Developer at Crowdmix. He has several years of experience in Data Warehousing and ETL, working with traditional MPP systems as Teradata and with Big Data technologies. He has also experience as a DevOps.
Die 12 besten Musik Streaming Dienste im Vergleich bei vetalio. Jetzt kostenlos vergleichen und Musik Streaming Testsieger finden!
Second Life Musicians. IndieSpectrum Radio The only true Second Life Radio station! Commercial free 24/7; featuring only original music from over 320 ...
Als Musikstreaming (auch Music as a service) bezeichnet man die Übertragung von Musikangeboten per Streaming Audio zur Wiedergabe auf Computern oder ...
Musik-Streaming im Internet: Musik-Flatrates im Überblick CD und MP3 war gestern, die Zukunft liegt im Musik-Streaming. Zu dieser Einschätzung kommen ...
Musik-Streaming-Dienste gehören fest zur Medienlandschaft. Der Marktführer Spotify ist dabei nur einer der vielen Anbieter für gestreamte Musik. Als ...
100X Hardcore (2014) 2NE1 - 1st Mini Album (2009) 2NE1 - 2nd Mini Album (2011) 2NE1 - Crush (2014) 3 Doors Down - The Greatest Hits (2012) 3 Doors Down ...
Musik Anbieter Songtitel in Mio. Länder Preis PC Android iOS BlackBerry Windows Phone Bemerkung; 16: USA, Großbritannien, Deutschland: Abonnement ...
Musik. Sortieren Salsa. Salsa, Latina, Tropical. Sierra Maestra - El guayabero Musikmix PartyHits. EM Fussball Hits 2016. Fred Flanke ...
Musik Auf Spotify findest Du unzählige Songs. Hör Dir Deine Lieblingssongs an, entdecke neue Titel und bau Dir so Deine ganz persönliche Musiksammlung ...
MusicStreaming.com Stay Tuned! Launching in 2016 Contact Us Music Streaming for Everyone . MusicStreaming.com Stay Tuned! Launching in 2016 Contact Us ...