Explore the full two-day session lineup, and stay tuned for additional exciting sessions and speakers.

Day 1 - June 07

12:55 PM PDT
3:55 PM EDT

Poster Competition: JoinBoost: Tree Training with just SQL

Zachary Huang

Zachary Huang

PhD Student
Columbia University

Data and machine learning (ML) are crucial for enterprise operations. Enterprises store data in databases for management and use ML to gain business insights. However, there is a mismatch between the way ML expects data to be organized (a single table) and the way data is organized in databases (a join graph of multiple tables). Current specialized ML libraries (e.g., LightGBM, XGBoost) necessitate data denormalization, data export, and data import, as they operate as separate programs incompatible with databases. The existing method not only increases operational complexity but also faces scalability limitations, slower performance, and security risks. But what if there was a way to achieve competitive tree training performance with just SQL? We present JoinBoost, a lightweight Python library that transforms tree training algorithms over normalized databases into pure SQL queries. Compatible with any DBMS and data stack, JoinBoost is a simplified, all-in-one data stack solution that avoids data denormalization, export, and import. JoinBoost delivers exceptional performance and scalability tailored to the capabilities of the underlying DBMS. Our experiments reveal that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) when compared to LightGBM, and scales well beyond LightGBM in terms of features, DB size, and join graph complexity.


Watch on demand

Watch all of the live sessions on-demand and discover the latest developments in data-centric AI.
Watch on demand