Some Thoughts on Rethinking Databases for Computational Science

From 2014-2015, I was a Database Kernel Engineer in the Distributed Systems team at MongoDB. The team was responsible for designing and implementing protocols for executing database queries on data that is distributed across multiple machines. The query plan was automatically decided based on several factors (including read/write throughput, data locality, and data distribution).

A shard key (i.e., a single indexed field or multiple fields) was used to distribute data into multiple chunks on different servers. Thus, the choice of a shard key could lead to different data distributions. This choice is especially important throughout the lifetime of executing different queries on the same data. As a result, domain knowledge of the data distribution and the lifetime of possible queries could be important in the query plan execution. In specific scientific fields (e.g., quantum physics), the data generated can be stored in a flat view. But this view does not take advantage of the data generating process to eliminate redundancies, thus resulting in costly materializations. What if we had a way to design query and storage plans using knowledge of the scientific domain? Clearly, this could lead to significant improvements in storage and computation costs.

Computational Science (or Scientific Computing) is an emerging discipline that essentially uses computers to simulate or solve scientific problems (whether in social sciences or natural sciences). The input of domain knowledge is critical to computational science. Database Theory has been crucial to the design and use of database management systems, providing the SQL (Structured Query Language) interface, the relational model and calculus, and related abstractions[1]. It is now well-acknowledged that the database management system that should be adopted depends on the type of data that would be stored and the resulting queries that can be run across the data. For example, graph database systems are well-suited for storing lots of data that represents relationships (edges) between entities (nodes). Relationships are considered first-class citizens in graph databases and so the queries are optimized for inference on graphs. Usually, the storage and indexing format implicitly takes advantage of domain knowledge of graphs for the design of the database. Can we apply a similar methodology to the design of database abstractions for scientific modeling? I assert that we can!

ChatGPT-generated image that represents “computational science”

Task-Based Search for Alignment

Within the database community, there have been a couple of theoretical ideas and implementations that design task-based dataset search systems. The idea is as follows: given a set of providers that can provide a data corpus, the dataset search system identifies augmentable datasets that maximize the utility of a machine-learning or data-analytic task. Then these datasets are used to perform a specific analysis task. For optimization purposes, designing the search system requires domain knowledge about the query or set of queries to be performed and about other critical concerns (e.g., privacy and security). Within the artificial intelligence community that focuses on large language models, this is called AI alignment, the “process of encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible. Through alignment, enterprises can tailor AI models to follow their business rules and policies”[2]. For task-based dataset search, the search systems are tailored to follow business rules and policies (with certain costs, of course) for a specific data-analytic task. However, with AI alignment, the goal is to align the system with certain values and goals without specifying a clear objective function! To train such systems, you need an evaluator (i.e., another LLM or a human being) that is able to expertly determine the effectiveness of the current LLM. The expertise of evaluator is crucial to the whole process and thus the evaluator must provide high-quality data through, for example, a task-based dataset search system! Throughout the whole process of alignment, access to high-quality data and domain knowledge is important. One could say that we need computational alignment of sciences (with domain knowledge of the data generating process) and the system where the data will be stored and queried.

Database Theory for Science Tasks

We need new theory to show the limits and capabilities of domain knowledge and task-based search for the computational sciences. Then relying on such theory, we can design more effective systems for simulating, illuminating, or solving scientific problems. To give a specific problem: the quantum many-body physics subfield concerns exploring physical properties of many interacting quantum particles. The interactions between the particles has information that is encoded in some wave function of the entire complex system. Storing and accounting for all interactions becomes infeasible quickly as the dimension of the system scales exponentially with the number of particles. Is there a way to take advantage of database approximations when performing such complex simulations?

References

[1] Raghu Ramakrishnan and Johannes Gehrke. 2002. Database Management Systems (3 ed.). McGraw-Hill, Inc., USA.

[2] IBM Research. What is AI alignment? https://research.ibm.com/blog/what-is-alignment-ai

Annual Report for 2023

Every year, I get annual reports for some organizations that I’m affiliated with (e.g., the Simons Foundation). The Annual Report summarizes what the organization accomplished through grant-making, in-house research, outreach activities, and so on. The report is both reflective and forward-looking. In the final blog post of this year, I will write a brief annual report for myself.

Harvard Commencement 2023

Technically, I earned my Ph.D. in 2022: I defended and submitted my dissertation in the 2022 calendar year. But I did not attend the 2022 commencement ceremony as I knew it would take significant effort to have a subset of my extended family attend the ceremony. In 2023, the Alabi family showed up and it was a glorious way to officially culminate my graduate career. Sometimes, I think the general public underestimates how much support from friends/family is necessary to succeed in academia (especially, during the latter years of the Ph.D. when the dissertation writing process is more likely to be more isolating than earlier years). I am blessed to have a strong support network of friends and family.

NaijaCoder

The NaijaCoder 2023 summer camp took place in Abuja, Nigeria. Alida Monaco and I were the main instructors for the class of 2023. The camp ran for 2 weeks. Every day, we had 6-hour lectures with a lunch break. Despite the intense schedule, the students were really engaged in class 🔥🧠.

The 2023 camp was physically located in the premises of Lifegate Academy in Abuja. Mr. Anywanwu Ebere is the head of schools; he was pivotal in getting the board of the academy to approve our use of their premises for our activities. We had guest lectures, from EducationUSA (from the U.S. Embassy in Nigeria) and GIEVA (Global Integrated Education Volunteers Education), during which they discussed study-abroad opportunities mostly targeted to U.S. schools. We also wrote up some results of our research on early algorithms education in Nigeria. Check it out: https://arxiv.org/abs/2310.20488 (to appear at SIGCSE 2024).

Planning is underway for the 2024 iteration. Owing to increased demand, we will host the program in Abuja and Lagos. Hopefully, more students from the southern parts of Nigeria can attend the program. There will be more instructors, more participants, and more food. Going forward, we would like to maintain the same rigor as we scale instruction to more participants.

Simons Foundation Junior Fellowship

I am in the middle of my 2nd year as a Junior Fellow in the Simons Society of Fellows. Fellows are expected to attend the weekly dinners. When I’m in town, I go to the dinners. It is always fun to hang out with fellow Junior Fellows and gain wisdom from the Senior Fellows. In March 2023, the Simons Society of Fellows held a retreat at the Ritz Carlton in Sarasota, Florida. For me, the highlight of the trip was going birdwatching!

The Simons Junior Fellowship is a grant, in the applicant’s name, given to an institution in NYC. As such, I am hosted at Columbia University as a post-doc. This year, most of the papers I published centered around data privacy and graph generation algorithms. I have also begun exploring some topics in quantum information. It is nice to have a post-doc that affords me the opportunity to explore interests outside my dissertation topic.

This year, I also spent some time learning from scientists at the Flatiron Institute at the Simons Foundation. I have one ongoing project, with a friend at Flatiron, which I hope to continue in 2024.

2024

Next year, I will continue, at the same pace, with research and my non-profit work. Also, I plan to read more books that are not directly related to my research (e.g., just started reading “Surely You’re Joking, Mr. Feynman!”). Finally, I’m looking forward to the 2024 Simons Society of Fellows retreat in San Juan, Puerto Rico.

Data Markets for Federated Learning

The Database Community (e.g., see this symposium on data markets) has recently been championing frameworks for data access, search, commodification, manipulation, extraction, refinement, and storage. I heard about data markets from Eugene Wu; it seems like a market area and research opportunity that is ripe for exploration.

In recent work that was presented in-person by Jerry at VLDB 2023, we wrote about a data search platform (called Saibot) that satisfies differential privacy. Essentially, the main algorithm is able to identify augmentations (join or union compatible via the group operations +, x) that will lead to highly accurate models (the evaluation objective is the \ell_2 metric but it scales to other objectives as well). This has implications for improving data quality (e..g, perhaps one can identify the right augmentations that will lead to better outcomes) and heterogenous collaboration of all kinds. We evaluate our algorithms on over 300 datasets and compare to leading alternative mechanisms.

Opportunities and Challenges in Data Markets

  1. Data Quality and Accuracy: In my opinion, the biggest challenge to the proliferation of data markets is the availability of high-quality data. No amount of analytical sophistication can get over the basic problem of high-quality data. For example, there are certain subgroups in America (e.g., African-American females) that are under-represented in datasets about academia. In fact, most academic departments in the U.S. do not even have any African-Amerian females. So if a social science researcher wishes to study the academic progression of women in academia and observe trends, the researcher cannot make broad claims about departments that do not even have a single Black woman. So first the researcher must seek out data sources of higher quality. e.g., by including data from HBCUs (Historically Black Colleges and Universities).
  2. Privacy and Security Concerns: Suppose a hospital has data on patient check-ins, health, characteristics, and disorders. If released, the data could help researchers gain valuable insight about diseases in specific areas. Unfortunately, it is known that exactly releasing aggregate information about individuals (even from datasets that are “anonymized”) could lead to de-anonymization/re-identification attacks. Our work on Saibot provides mechanisms to ensure that data search platforms satisfy certain notions of differential privacy.
  3. Collaboration and Knowledge Sharing: Data markets encourage collaboration between organizations and industries. They facilitate the sharing of knowledge and expertise, breaking down silos (especially within academia) and fostering a culture of collective problem-solving. However, one could ask: how much collaboration—between industries—is needed to solve a problem or achieve a certain level of accuracy for statistical models? This problem needs further study.
  4. Economic Value: Some technology companies (e.g., Netflix and Facebook) earn their value proposition (almost) entirely from having lots of users and interactions on their platforms. Having more specific forms of data (e.g., the data on African-American females) could give companies a competitive advantage in data markets. So having access to data markets can create new revenue streams. I would personally like to see more economic analysis of the value of data markets!