Please Note: This schedule is in EST (Eastern Standard Time, UTC-5)
Data Engineering Summit 2023
LIVE: Wednesday, 18th January
LIVE: Wednesday, 18th January
10:00 - 10:35
Beyond Monitoring: The Rise of Data Observability

Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.

Beyond Monitoring: The Rise of Data Observability image
Shane Murray
Field CTO | Monte Carlo
10:40 - 11:15
Streaming Featurization with Ibis, Substrait and Apache Arrow

In this talk, you’ll learn how Two Sigma and Voltron Data are collaborating to improve the performance of featurization workflows using the Ibis, Substrait, Arrow software stack. Wes McKinney and David Palaitis have been working together since 2016 on the design and implementation of high performance data engines for processing unstructured, high volume, streaming datasets for use in machine learning algorithms. While Palaitis has focused on using these tools to support machine learning at Two Sigma, Wes has built out a new business to support the open source computing libraries that are critical to supporting high performance featurization for quant finance workloads.

Streaming Featurization with Ibis, Substrait and Apache Arrow image
Wes McKinney
CTO and Co-Founder | Voltron Data
Streaming Featurization with Ibis, Substrait and Apache Arrow image
David Palaitis
Managing Director | Two Sigma
11:20 - 11:55
Applying Engineering Best Practices in Data Lakes Architectures

Engineering best practices are methods and tools that allow high quality, high velocity software development. Among the most common application development best practices are the agile methodology, continuous integration and continuous deployment and production monitoring. Do those practices apply for data engineers working over high scale data lakes?
In this talk, we will show how adopting those practices to data lakes is a must, as it provides us with a safe environment to operate in, that produces higher quality data in less time. Our time will be spent on actual data engineering and less on manual plumbing of data pipelines.

The set of tools that would allow us to implement engineering best practices, such as data version control, orchestration, data quality platforms, and data monitoring tools, are right at the tip of our fingers. We will show how to combine those tools to create robust data products over data lakes.

Applying Engineering Best Practices in Data Lakes Architectures image
Einatt Orr
CEO and Co-Founder | Treeverse
11:20 - 11:55
Thrive in the Data Tooling Tornado: Lead, Hire, and Execute Better by Escaping Older Industrial Antipatterns

The data tooling landscape has exploded to hundreds of products. New ones emerge almost daily. In this environment, firms struggle just to meet legacy business commitments. And the exciting next-gen projects, leveraging analytics and machine learning (AI), are notoriously unsuccessful: their failure rate is estimated as high as 85 percent!

In this talk, we’ll learn why many of these challenges result from outdated anti-patterns held over from 20th-century industry. These older patterns emphasize efficiency over effectiveness and are not appropriate for 2023 — leading to results both ineffective and inefficient.

We’ll look at adjustments in approach that make it easier for data teams to hire, manage, retain, and execute effectively using modern data tooling — all while gaining that sought-after efficiency.

This session is aimed at medium and large businesses and will be especially useful *outside* of the “big tech” software/SaaS industry.

Thrive in the Data Tooling Tornado: Lead, Hire, and Execute Better by Escaping Older Industrial Antipatterns image
Adam Breindel
Independent Consultant
12:00 - 12:35
From BI to AI : Lakehouse is the modern data architecture

Industries are constantly evolving their infrastructure to fit large amounts of data that they collect from multiple sources and increasingly adopting a new data architecture pattern where a single system can house data for many different workloads. They have already been storing all their data in the cloud data lakes but when it comes to decision making, they have to move their data somewhere else. This creates complexity, multiple copies, delays, and new failure modes. Moreover, enterprise use cases include machine learning and AI, for which neither data lakes nor warehouses are ideal. Lakehouse combines the strengths of data warehouse and data lakes into a single system, allowing data teams to accelerate their use cases as they are able to use one system rather than needing to access multiple systems, thus eliminating data silos and duplicity of data, offering you reliability and cost efficiency. Lakehouse is based on open formats such as Delta Lake, that provides support for advanced analytics and AI with performance and reliability guarantees. Through this talk, we will cover the evolution of modern data architecture, foundation principles and cover some production examples of lakehouses.

From BI to AI : Lakehouse is the modern data architecture image
Vini Jaiswal
Developer Advocate | Databricks
12:00 - 12:35
Demystifying Data Mesh – Tackling common misconceptions about Data Mesh

Data Mesh is moving forward on its hype cycle. More and more vendors are naming them data mesh solutions. Inherently this is wrong. Data mesh is about federating responsibilities by acknowledging a distributed landscape. Within this session, Wannes will address more misconceptions and will return to explain the core concepts of data mesh.

Demystifying Data Mesh – Tackling common misconceptions about Data Mesh image
Wannes Rosiers
Product Manager | Ratio
12:40 - 13:15
Reliable Pipelines and High Quality Data Without the Toil

Bad data sucks. But it’s a struggle keeping data fresh and high quality as pipelines get complicated. Data observability is the ability to understand what’s happening to, and within, your pipelines at all times. It enables data engineers to identify pipeline issues sooner, spot pipeline performance opportunities more easily, and reduce toilsome maintenance work.

Data observability techniques were pioneered by large scale data teams at companies like Uber, AirBnB, and Intuit. But today they’re accessible to team’s of nearly any size.

In this talk you’ll hear about the history of data quality testing and data observability inside Uber, the differences between data observability and other methods like data pipeline tests, how techniques developed there can be applied by data engineers anywhere, and an overview of both commercially available and open source tools available today.

Reliable Pipelines and High Quality Data Without the Toil image
Kyle Kirwan
Co-Founder and CEO | Big Eye
12:40 - 13:15
Data-Planning to Implementation

How can businesses leverage big data, fast data, traditional data and modern data for decision making? How can businesses realize value from data? What are the capabilities needed for enterprise data management? “Data: Planning to Implementation” will provide a strategic perspective to the “why, what, where, when, how and whom” of data management across industry.

Data-Planning to Implementation image
Balaji Raghunathan
VP of Digital Experience | ITC Infotech
13:20 - 13:55
Building Trusted Platforms by Protecting Data

In an age of unparalleled innovation, our technical computing power has far outpaced our moral processing power. The challenge in the coming decade will be sustaining the engagement and monetization of data while also building for safety, trust and transparency. Protecting user data is not just about compliance with regulations but also about business maturity. Ensuring sound data governance is critical to help avoid the excesses that have hurt the reputations of the tech industry. In this talk, using data privacy as a vehicle, Nishant Bhajaria will help connect the global regulatory landscape and high-scale technical operations to privacy tools, processes and metrics. The end goal is to help evolve technical innovation to preserve the gains of the bottom-up agile revolution while protecting consumers and growing the business,

Building Trusted Platforms by Protecting Data image
Nishant Bhajaria
Director of Privacy Engineering | Uber
13:20 - 13:55
Automated Data Classification

Automating Data Classification is key to a successful data privacy program. Data privacy policies apply to specific types of data and without knowing which datasets contain this regulated data, it is impossible to protect it. In any vast and dynamic data estate, manual labeling or classification of data is impractical. This talk will cover the challenges and different approaches for automating data classification.

Automated Data Classification image
Alex Gorelik
Distinguished Engineer | LinkedIn
14:00 - 14:35
Getting into Data Engineering

Curious about becoming a data engineer? This talk will cover the key things you should consider about a career in data engineering, particularly against the backdrop of 2023’s economic climate.

Getting into Data Engineering image
Joe Reis
CEO | Ternary Data
14:00 - 14:35
Leveraging Data in Motion in a Cloud-first World

Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the data in real time as new events occur. More than 80% of the Fortune 100 are building their businesses on this new platform. In this talk, I will first share the story behind Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then, I will talk about how making Kafka Cloud native creates new opportunities for building one system of record and some real world use cases.

Leveraging Data in Motion in a Cloud-first World image
Jun Rao
Co-Founder | Confluent
14:40 - 15:15
Spark, Cloud, DBT, Data Warehouses

We are going to discuss current Data Engineering trends and how the industry is moving toward a new data stack. I will first discuss the current tech stack which most companies are using, why there is a need for a shift, and how the current tech stack is moving towards data warehouses and delta lakes.

Spark, Cloud, DBT, Data Warehouses image
Navdeep Kaur
Founder | Techno Avengers
14:40 - 15:15
Assessing Data Quality: The 3 Facets of Data “Fitness”

While most of us are used to assessing the quality of data for gaps, errors, and other data integrity problems, understanding whether the information we have is “fit” for our intended purpose can be a little trickier. In this session, I’ll cover the three essential facets of data “fitness” that can help you ensure that the your data can really give you the answers you want.

Assessing Data Quality: The 3 Facets of Data “Fitness” image
Susan McGregor
Associate Research Scholar | Columbia University's Data Science Institute
15:20 - 15:55
Building a Data Mesh: Strategies and Best Practices for navigating the Implementation of a Data Mesh

Data mesh is a new approach to to thinking about data based on a distributed architecture for data management that promotes decentralized ownership and control of data assets. It emphasizes the use of domain-driven design and self-service access to data, with the goal of improving the quality and usability of data for business decision-making. In this talk, we will explore the principles and practices of data mesh and how to implement it in an organization.

Building a Data Mesh: Strategies and Best Practices for navigating the Implementation of a Data Mesh image
Hajar Khizou
Lead Data Engineer | SustainCERT
15:25 - 15:55
Using Compression, Deduplication & Encryption for Better Data Management

Cloud storage footprint is in exabytes and exponentially growing and companies pay billions of dollars to store and retrieve data. In this talk, we will cover some of the space and time optimizations, which have historically been applied to on-premise file storage, and how they would be applied to objects stored in Cloud

Deduplication and compression are techniques that have been traditionally used to reduce the amount of storage used by applications. Data encryption is table stakes for any remote storage offering and today, we have client-side and server-side encryption support by Cloud providers.

Combining compression, encryption, and deduplication for object stores in Cloud is challenging due to the nature of overwrites and versioning, but the right strategy can save millions for an organization. We will cover some strategies for employing these techniques depending on whether an organization prefers client side or server side encryption, and discuss online and offline deduplication of objects.
Companies such as Box, and Netflix, employ a subset of these techniques to reduce their cloud footprint and provide agility in their cloud operations.

Using Compression, Deduplication & Encryption for Better Data Management image
Tejas Chopra
Senior Software Engineer | Netflix
Select date to see events.

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google