Postdoc MIT, CSAIL
Cambridge, MA, USA
raulcf at csail.mit.edu
Data has the potential to substantially improve many areas of our lives, from advancing scientific research to making people more productive to understanding society better. Most data is sequestered in silos or stored in formats that are hard to understand by machines and that limits its value. To use the remaining data and extract its maximum potential, we must improve the ways in which we discover, prepare, and process data.
In my research I build high-performance systems for discovering, preparing, and processing data. I often use techniques from data management, statistics, and machine learning. At MIT I work with professors Sam Madden and Mike Stonebraker. Before MIT, I completed my PhD at Imperial College London with Peter Pietzuch.
Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data.
In addition to structured sources such as relational tables, organizations are plagued with unstructured sources such as PDFs, text files and emails as well. Integrating both kinds of sources has been a cornerstone of multiple research communitifies for decades. It is challenging because it demands extracting structure from the unstructured sources and then finding a common schema to represent both. In this line of research, we advocate a different approach: rather than trying to infer a common schema, we aim to find a common representation for both structured and unstructured data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness, and we learn the embedding so that it satisfies different downstream applications, from filling missing values, to data discovery and verification among many others.
Large-scale data processing systems depend on stateless dataflows to extract data parallelims and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.
SIGMOD'19 PC Member
Ecana: Ecana's technology was a data acquisition system, based on wireless sensor networks that would sense several stages of the wine making process, from the temperature, humidity of the vines, to the wineries. The acquired data was integrated with other external data sources such as weather forecasts from local stations. Finally, the data was visualized through a dashboard that would highlight KPIs of interest for the winemaker, helping to understand the process in more detail end-to-end.
contxt.in: Way before 'fake news' we've had a big information overload problem, in which the same piece of news appears in many different feeds and sources, from social network such as Twitter, Facebook or Linkedin, to the traditional media outlets. The competition for attention was too fierce, and it was often the case that people would end up missing part of the story. Harder still was to start meaningful conversations around the topics of interest---opinion formation was getting harder and harder. With context, we built a system to aggregate news from multiple heterogeneous sources, and we used machine learning models that would relate those news around specific topis of interest for the user. Users had a platform to chat about the news they were most interested.
MIT Postdoc: At MIT I work on making the swaths of available data in organizations easy to find and use.
Microsoft Research Intern: At Microsoft I worked on the design, implementation and evaluation of a new distributed data proccessing system, Quill, which uses the cloud to allow users an almost transparent deployment of their jobs. The goal was to make data processing easier for more people.
PhD Imperial College London: At Imperial I worked on distributed data processing.
LinkedIn: Software Engineer Intern. At LinkedIN I did a lot of systems work on the data infra group on Kafka and Samza, two key technologies at the backbone of many organizations today.
UC3M: Researcher at FP7 Project. Design and implementation of distributed real-time services for critical infrastructure.