Please Note: This schedule is in EST (Eastern Standard Time, UTC-5)
Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.


In this talk, you’ll learn how Two Sigma and Voltron Data are collaborating to improve the performance of featurization workflows using the Ibis, Substrait, Arrow software stack. Wes McKinney and David Palaitis have been working together since 2016 on the design and implementation of high performance data engines for processing unstructured, high volume, streaming datasets for use in machine learning algorithms. While Palaitis has focused on using these tools to support machine learning at Two Sigma, Wes has built out a new business to support the open source computing libraries that are critical to supporting high performance featurization for quant finance workloads.




Engineering best practices are methods and tools that allow high quality, high velocity software development. Among the most common application development best practices are the agile methodology, continuous integration and continuous deployment and production monitoring. Do those practices apply for data engineers working over high scale data lakes?
In this talk, we will show how adopting those practices to data lakes is a must, as it provides us with a safe environment to operate in, that produces higher quality data in less time. Our time will be spent on actual data engineering and less on manual plumbing of data pipelines.
The set of tools that would allow us to implement engineering best practices, such as data version control, orchestration, data quality platforms, and data monitoring tools, are right at the tip of our fingers. We will show how to combine those tools to create robust data products over data lakes.


The data tooling landscape has exploded to hundreds of products. New ones emerge almost daily. In this environment, firms struggle just to meet legacy business commitments. And the exciting next-gen projects, leveraging analytics and machine learning (AI), are notoriously unsuccessful: their failure rate is estimated as high as 85 percent!
In this talk, we’ll learn why many of these challenges result from outdated anti-patterns held over from 20th-century industry. These older patterns emphasize efficiency over effectiveness and are not appropriate for 2023 — leading to results both ineffective and inefficient.
We’ll look at adjustments in approach that make it easier for data teams to hire, manage, retain, and execute effectively using modern data tooling — all while gaining that sought-after efficiency.
This session is aimed at medium and large businesses and will be especially useful *outside* of the “big tech” software/SaaS industry.


Industries are constantly evolving their infrastructure to fit large amounts of data that they collect from multiple sources and increasingly adopting a new data architecture pattern where a single system can house data for many different workloads. They have already been storing all their data in the cloud data lakes but when it comes to decision making, they have to move their data somewhere else. This creates complexity, multiple copies, delays, and new failure modes. Moreover, enterprise use cases include machine learning and AI, for which neither data lakes nor warehouses are ideal. Lakehouse combines the strengths of data warehouse and data lakes into a single system, allowing data teams to accelerate their use cases as they are able to use one system rather than needing to access multiple systems, thus eliminating data silos and duplicity of data, offering you reliability and cost efficiency. Lakehouse is based on open formats such as Delta Lake, that provides support for advanced analytics and AI with performance and reliability guarantees. Through this talk, we will cover the evolution of modern data architecture, foundation principles and cover some production examples of lakehouses.


Data Mesh is moving forward on its hype cycle. More and more vendors are naming them data mesh solutions. Inherently this is wrong. Data mesh is about federating responsibilities by acknowledging a distributed landscape. Within this session, Wannes will address more misconceptions and will return to explain the core concepts of data mesh.


Bad data sucks. But it’s a struggle keeping data fresh and high quality as pipelines get complicated. Data observability is the ability to understand what’s happening to, and within, your pipelines at all times. It enables data engineers to identify pipeline issues sooner, spot pipeline performance opportunities more easily, and reduce toilsome maintenance work.
Data observability techniques were pioneered by large scale data teams at companies like Uber, AirBnB, and Intuit. But today they’re accessible to team’s of nearly any size.
In this talk you’ll hear about the history of data quality testing and data observability inside Uber, the differences between data observability and other methods like data pipeline tests, how techniques developed there can be applied by data engineers anywhere, and an overview of both commercially available and open source tools available today.


How can businesses leverage big data, fast data, traditional data and modern data for decision making? How can businesses realize value from data? What are the capabilities needed for enterprise data management? “Data: Planning to Implementation” will provide a strategic perspective to the “why, what, where, when, how and whom” of data management across industry.


In an age of unparalleled innovation, our technical computing power has far outpaced our moral processing power. The challenge in the coming decade will be sustaining the engagement and monetization of data while also building for safety, trust and transparency. Protecting user data is not just about compliance with regulations but also about business maturity. Ensuring sound data governance is critical to help avoid the excesses that have hurt the reputations of the tech industry. In this talk, using data privacy as a vehicle, Nishant Bhajaria will help connect the global regulatory landscape and high-scale technical operations to privacy tools, processes and metrics. The end goal is to help evolve technical innovation to preserve the gains of the bottom-up agile revolution while protecting consumers and growing the business,


Automating Data Classification is key to a successful data privacy program. Data privacy policies apply to specific types of data and without knowing which datasets contain this regulated data, it is impossible to protect it. In any vast and dynamic data estate, manual labeling or classification of data is impractical. This talk will cover the challenges and different approaches for automating data classification.


Curious about becoming a data engineer? This talk will cover the key things you should consider about a career in data engineering, particularly against the backdrop of 2023’s economic climate.


Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the data in real time as new events occur. More than 80% of the Fortune 100 are building their businesses on this new platform. In this talk, I will first share the story behind Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then, I will talk about how making Kafka Cloud native creates new opportunities for building one system of record and some real world use cases.


We are going to discuss current Data Engineering trends and how the industry is moving toward a new data stack. I will first discuss the current tech stack which most companies are using, why there is a need for a shift, and how the current tech stack is moving towards data warehouses and delta lakes.


While most of us are used to assessing the quality of data for gaps, errors, and other data integrity problems, understanding whether the information we have is “fit” for our intended purpose can be a little trickier. In this session, I’ll cover the three essential facets of data “fitness” that can help you ensure that the your data can really give you the answers you want.


Data mesh is a new approach to to thinking about data based on a distributed architecture for data management that promotes decentralized ownership and control of data assets. It emphasizes the use of domain-driven design and self-service access to data, with the goal of improving the quality and usability of data for business decision-making. In this talk, we will explore the principles and practices of data mesh and how to implement it in an organization.


Cloud storage footprint is in exabytes and exponentially growing and companies pay billions of dollars to store and retrieve data. In this talk, we will cover some of the space and time optimizations, which have historically been applied to on-premise file storage, and how they would be applied to objects stored in Cloud
Deduplication and compression are techniques that have been traditionally used to reduce the amount of storage used by applications. Data encryption is table stakes for any remote storage offering and today, we have client-side and server-side encryption support by Cloud providers.
Combining compression, encryption, and deduplication for object stores in Cloud is challenging due to the nature of overwrites and versioning, but the right strategy can save millions for an organization. We will cover some strategies for employing these techniques depending on whether an organization prefers client side or server side encryption, and discuss online and offline deduplication of objects.
Companies such as Box, and Netflix, employ a subset of these techniques to reduce their cloud footprint and provide agility in their cloud operations.


Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.


In this talk, you’ll learn how Two Sigma and Voltron Data are collaborating to improve the performance of featurization workflows using the Ibis, Substrait, Arrow software stack. Wes McKinney and David Palaitis have been working together since 2016 on the design and implementation of high performance data engines for processing unstructured, high volume, streaming datasets for use in machine learning algorithms. While Palaitis has focused on using these tools to support machine learning at Two Sigma, Wes has built out a new business to support the open source computing libraries that are critical to supporting high performance featurization for quant finance workloads.




Engineering best practices are methods and tools that allow high quality, high velocity software development. Among the most common application development best practices are the agile methodology, continuous integration and continuous deployment and production monitoring. Do those practices apply for data engineers working over high scale data lakes?
In this talk, we will show how adopting those practices to data lakes is a must, as it provides us with a safe environment to operate in, that produces higher quality data in less time. Our time will be spent on actual data engineering and less on manual plumbing of data pipelines.
The set of tools that would allow us to implement engineering best practices, such as data version control, orchestration, data quality platforms, and data monitoring tools, are right at the tip of our fingers. We will show how to combine those tools to create robust data products over data lakes.


The data tooling landscape has exploded to hundreds of products. New ones emerge almost daily. In this environment, firms struggle just to meet legacy business commitments. And the exciting next-gen projects, leveraging analytics and machine learning (AI), are notoriously unsuccessful: their failure rate is estimated as high as 85 percent!
In this talk, we’ll learn why many of these challenges result from outdated anti-patterns held over from 20th-century industry. These older patterns emphasize efficiency over effectiveness and are not appropriate for 2023 — leading to results both ineffective and inefficient.
We’ll look at adjustments in approach that make it easier for data teams to hire, manage, retain, and execute effectively using modern data tooling — all while gaining that sought-after efficiency.
This session is aimed at medium and large businesses and will be especially useful *outside* of the “big tech” software/SaaS industry.


Industries are constantly evolving their infrastructure to fit large amounts of data that they collect from multiple sources and increasingly adopting a new data architecture pattern where a single system can house data for many different workloads. They have already been storing all their data in the cloud data lakes but when it comes to decision making, they have to move their data somewhere else. This creates complexity, multiple copies, delays, and new failure modes. Moreover, enterprise use cases include machine learning and AI, for which neither data lakes nor warehouses are ideal. Lakehouse combines the strengths of data warehouse and data lakes into a single system, allowing data teams to accelerate their use cases as they are able to use one system rather than needing to access multiple systems, thus eliminating data silos and duplicity of data, offering you reliability and cost efficiency. Lakehouse is based on open formats such as Delta Lake, that provides support for advanced analytics and AI with performance and reliability guarantees. Through this talk, we will cover the evolution of modern data architecture, foundation principles and cover some production examples of lakehouses.


Data Mesh is moving forward on its hype cycle. More and more vendors are naming them data mesh solutions. Inherently this is wrong. Data mesh is about federating responsibilities by acknowledging a distributed landscape. Within this session, Wannes will address more misconceptions and will return to explain the core concepts of data mesh.


Bad data sucks. But it’s a struggle keeping data fresh and high quality as pipelines get complicated. Data observability is the ability to understand what’s happening to, and within, your pipelines at all times. It enables data engineers to identify pipeline issues sooner, spot pipeline performance opportunities more easily, and reduce toilsome maintenance work.
Data observability techniques were pioneered by large scale data teams at companies like Uber, AirBnB, and Intuit. But today they’re accessible to team’s of nearly any size.
In this talk you’ll hear about the history of data quality testing and data observability inside Uber, the differences between data observability and other methods like data pipeline tests, how techniques developed there can be applied by data engineers anywhere, and an overview of both commercially available and open source tools available today.


How can businesses leverage big data, fast data, traditional data and modern data for decision making? How can businesses realize value from data? What are the capabilities needed for enterprise data management? “Data: Planning to Implementation” will provide a strategic perspective to the “why, what, where, when, how and whom” of data management across industry.


In an age of unparalleled innovation, our technical computing power has far outpaced our moral processing power. The challenge in the coming decade will be sustaining the engagement and monetization of data while also building for safety, trust and transparency. Protecting user data is not just about compliance with regulations but also about business maturity. Ensuring sound data governance is critical to help avoid the excesses that have hurt the reputations of the tech industry. In this talk, using data privacy as a vehicle, Nishant Bhajaria will help connect the global regulatory landscape and high-scale technical operations to privacy tools, processes and metrics. The end goal is to help evolve technical innovation to preserve the gains of the bottom-up agile revolution while protecting consumers and growing the business,


Automating Data Classification is key to a successful data privacy program. Data privacy policies apply to specific types of data and without knowing which datasets contain this regulated data, it is impossible to protect it. In any vast and dynamic data estate, manual labeling or classification of data is impractical. This talk will cover the challenges and different approaches for automating data classification.


Curious about becoming a data engineer? This talk will cover the key things you should consider about a career in data engineering, particularly against the backdrop of 2023’s economic climate.


Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the data in real time as new events occur. More than 80% of the Fortune 100 are building their businesses on this new platform. In this talk, I will first share the story behind Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then, I will talk about how making Kafka Cloud native creates new opportunities for building one system of record and some real world use cases.


We are going to discuss current Data Engineering trends and how the industry is moving toward a new data stack. I will first discuss the current tech stack which most companies are using, why there is a need for a shift, and how the current tech stack is moving towards data warehouses and delta lakes.


While most of us are used to assessing the quality of data for gaps, errors, and other data integrity problems, understanding whether the information we have is “fit” for our intended purpose can be a little trickier. In this session, I’ll cover the three essential facets of data “fitness” that can help you ensure that the your data can really give you the answers you want.


Data mesh is a new approach to to thinking about data based on a distributed architecture for data management that promotes decentralized ownership and control of data assets. It emphasizes the use of domain-driven design and self-service access to data, with the goal of improving the quality and usability of data for business decision-making. In this talk, we will explore the principles and practices of data mesh and how to implement it in an organization.


Cloud storage footprint is in exabytes and exponentially growing and companies pay billions of dollars to store and retrieve data. In this talk, we will cover some of the space and time optimizations, which have historically been applied to on-premise file storage, and how they would be applied to objects stored in Cloud
Deduplication and compression are techniques that have been traditionally used to reduce the amount of storage used by applications. Data encryption is table stakes for any remote storage offering and today, we have client-side and server-side encryption support by Cloud providers.
Combining compression, encryption, and deduplication for object stores in Cloud is challenging due to the nature of overwrites and versioning, but the right strategy can save millions for an organization. We will cover some strategies for employing these techniques depending on whether an organization prefers client side or server side encryption, and discuss online and offline deduplication of objects.
Companies such as Box, and Netflix, employ a subset of these techniques to reduce their cloud footprint and provide agility in their cloud operations.

