Day 1 - June 07
Poster Competition: JoinBoost: Tree Training with just SQL
Zachary Huang
Data and machine learning (ML) are crucial for enterprise operations. Enterprises store data in databases for management and use ML to gain business insights. However, there is a mismatch between the way ML expects data to be organized (a single table) and the way data is organized in databases (a join graph of multiple tables). Current specialized ML libraries (e.g., LightGBM, XGBoost) necessitate data denormalization, data export, and data import, as they operate as separate programs incompatible with databases. The existing method not only increases operational complexity but also faces scalability limitations, slower performance, and security risks. But what if there was a way to achieve competitive tree training performance with just SQL? We present JoinBoost, a lightweight Python library that transforms tree training algorithms over normalized databases into pure SQL queries. Compatible with any DBMS and data stack, JoinBoost is a simplified, all-in-one data stack solution that avoids data denormalization, export, and import. JoinBoost delivers exceptional performance and scalability tailored to the capabilities of the underlying DBMS. Our experiments reveal that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) when compared to LightGBM, and scales well beyond LightGBM in terms of features, DB size, and join graph complexity.