You're invited to the second annual Big Data: Perspectives @ 1454 ft, an invite-only event hosted at LinkedIn's offices on the 22nd floor of the Empire State Building. This event will bring together professionals from the Big Data Ecosystem along with engineers from LinkedIn's Data teams to discuss recent developments and the many challenges and opportunities that lie ahead.
This event will take place on Wednesday, September 30th at LinkedIn's offices located on the 22nd floor of the Empire State Building. Space is limited, and you must pre-register in order to gain admission to the event.
Food and drinks will be served at the event.
When: September 30th, 2015 - Doors open at 6:00pm, talks start at 7:00pm.
Where: Empire State Building, New York City
Confirmed talks include:
Moving Forward (Monotonically) Toward Better Data Systems by Joe Hellerstein
I’m optimistic that we are coming to the end of a decade of relatively nonsensical debate about how best to build data-centric systems. Arguments like “ACID vs. Eventual Consistency”, “MapReduce vs. SQL”, “Performance vs. Scalability” are giving way to better-informed discussions and more interesting designs. I see this in the research community, but also in emerging scalable open source systems like Kafka, Spark, and Cassandra to name a few, as well as in new high-performance designs like Microsoft Hekaton. I believe we can identify a single key design pattern in the new, more intelligent systems, namely monotonicity: the accumulation of “experiences” (events, versions, logic) that move forward over time and cause system state and behavior to evolve.
This pattern did not appear by coincidence. To illustrate, I’ll talk about how we’re forced to dance around some (non-monotonic) mines we laid for ourselves along the way—including (in increasing order of significance and generality) the MapReduce Barrier, the distributed Log, the Lock, and the Variable. Finally I'll present a theorem that helps us understand why some of the most interesting systems today involve monotonic concepts like diffs, versions, and streams. Finally I'll present the CALM Theorem, which helps explain why some of the most interesting systems today involve monotonic concepts like diffs, versions, and streams.
Building a real time analytics data store: The story of Pinot@LinkedIn
While generic storage systems can be used to support single point use-cases, to do this at scale demanded specialized distributed infrastructure. Two years ago, we embarked on a new project to solve our real-time analytics problems, codenamed Pinot. Pinot enables us to slice, dice and scan through massively large quantities of data in real-time across a wide variety of products. It has established itself as the de facto online analytics platform to provide valuable insights to our members and customers. Pinot serves as the backend for more than 25 analytics products for our customers and members.
Real time stream processing at LinkedIn
At LinkedIn, events pertaining to application and system monitoring, user behavior tracking etc. are all ingested into our pub-sub system (Kafka). One of our engineers has described it as LinkedIn’s "circulatory system" for data, enabling us to have a loosely connected set of services which all operate together. Another super-critical source of events are the updates that are happening on our databases. At LinkedIn we have a system (DataBus) which captures changes in the database transaction logs and makes them available for down stream processing. Combined together, our tracking and change capture systems generate 1.4 Trillion events a day. Being able to process this large amount of traffic in realtime is critical to our business and we build Samza, a scalable stream processing platform, to meet this challenging requirement. In this presentation we will discuss how we solve this large scale stream processing problem at LinkedIn using Apache Samza/Kafka.
Bridging batch and streaming data movement with Gobblin
Over the years, LinkedIn's data infrastructure team built custom solutions for ingesting diverse data entities into our Hadoop eco-system. At one point, we were running more than a dozen types of ingestion pipelines which created significant data quality, metadata management, development, and operation challenges.
Our experiences and challenges motivated us to build Gobblin, a highly scalable data ingestion framework. Gobblin supports ingestion from diverse data sources such as databases (Oracle, Mysql, SQLServer), file systems, REST APIs, streaming systems (Kafka) and custom protocols (Salesforce). A collection of primitives in the framework provides capabilities like data governance, legal compliance, data format and layout conversion, cleansing and quality validation. The talk will focus on Gobblin's architecture, features, and extension points. We will share our learnings from operating Gobblin in production at LinkedIn, and preview on-going work for supporting streaming data ingestion in an environment with thousands of datasets.
Scaling out to 10 Clusters, 1000 Users, and 10,000 Flows: The Dali Experience at LinkedIn
Over the past couple of years our personal experience has demonstrated that Hadoop does an admirable job of scaling out to thousands of nodes and petabytes of data. However, we found ourselves far less satisfied with the ability of this platform to scale out in other dimensions, namely number of users, the myriad different frameworks and languages that those users employ in their daily tasks, and the tens of thousands of data applications that these users write and have to maintain. In an effort to address these problems we built Dali, a collection of libraries, services, and development tools united by the common goal of providing a dataset API for Hadoop. In this presentation we will give an overview of the project’s different components, discuss recent successes, and conclude with a detailed discussion of Dali Views, a new addition to the project that makes it easier to share logic and manage the contracts that exist between data producers and data consumers.