Academic Mentors by Year:
Elahe Soltanaghai

Visualizing Invisible Wireless Signals
We are developing a mixed-reality app that can visualize wireless signals between WiFi devices in homes and buildings and their interaction with objects and people in the environment—desired expertise: blender, unity, familiarity with LiDar and Mesh reconstruction algorithms. If you’re interested in this project, please take a look at this paper.
Gaze-based Immersive Interactions
This project explores a new gaze interaction technique for virtual and augmented reality applications that integrates focal depth into gaze input dimensions, facilitating users to actively shift their focus along the depth dimension for interaction. Familiarity with Unity and some mixed reality programming experience is required for this project.
Bruno Abreu

Neutral-atom simulations as a platform for distributed quantum computing
Neutral-atom-based quantum computing technologies are a promising avenue for scalable quantum computing that can leverage existing capabilities, increase the return on investment of existing noisy devices, and ultimately lead to real-world quantum applications beyond quantum simulations. In this context, distributed computing frameworks, such as the Message Passing Interface (MPI) in classical computing, are essential and must be conceptualized and prototyped in concurrence with hardware and network advancements. This project aims to understand how to use analog quantum computing with geometric control of qubits to identify the core components of a quantum state passing interface that operates over hybrid classical and quantum networks.
Sever Tipei

Music on HPC
The project centers on DISSCO, software for composition, sound design and music notation/printing developed at UIUC, NCSA and Argonne National Laboratory. Written in C++, it includes a Graphic User Interface using gtkmm. A parallel version has been developed at the San Diego Supercomputer Center with support from XSEDE (Extreme Science and Engineering Discovery Environment).DISSCO has a directed graph structure and uses stochastic distributions, sieves (part of Number Theory), Markov chains and elements of Information Theory to produce musical compositions. Presently, efforts are directed toward adding new features, refining a system for the notation of music as well as to the realization of an Evolving Entity, a composition whose aspects change when computed recursively over long periods of time thus mirroring the way living organisms are transformed in time (Artificial Life).
Michael Miller

Exploring Quantum Sound and Music
This project seeks to explore the state of the art using quantum computing concepts to create sound and music. We will explore information available from previous conferences and seek out new developments. We will then explore what is needed to create a framework/interface for users to experiment with and look toward having a platform available to deploy on quantum resources when acquired by NCSA.
Elhan Ersoz

Multi-sensor Data Integration for High-throughput Plant Phenotyping
More information in the linked-deck. Project to explore methods and approaches for integrating 2D image and 3D pointcloud data obtained from gantry mounted a high-density sensor array.
Henry Priest

Genomic Foundations of Programmable Crop Senescence in the cover crop Chickpea (Cicer arietinum L.)
Cover crops represent a major opportunity to improve sustainability of modern agricultural production, but have limited adoption. Ensuring that cover crops are ready-for-harvest in a timely manner is an important aspect of developing this technology and driving adoption. Without varieties developed for specific regions, one possible future avenue for closing this gap will be to develop genome-edited crop varieties which are appropriate for each region’s desired growth cycle. This project will use existing gene expression data and graph-based differential co-expression analyses to identify regulatory relationships between expression regulators and genes involved in crop senescence. Ultimately this will enable the development of gene expression strategies to customize the growth cycle of chickpea to coincide with ideal seasonal harvest windows. Familiarity with biological domain concepts of gene expression, transcription, and basic genomic concepts will be a strong enabler. Prior experience with graph concepts, gene co-expression networks, and related concepts is not needed.
April Novak

Native Combinatorial Geometry using Unstructured Meshes for Nuclear Energy Applications
The Monte Carlo method is widely considered the state-of-the-art technique for solving the neutron transport equation. However, statistical simulation of billions (or more) neutrons are required to achieve converged simulations, and hence the Monte Carlo method remains computationally intensive. Faster-running multigroup methods instead discretize the energy phase space into tens or hundreds of energy intervals, a drastic approximation to the eight orders of magnitude in neutron energy observed in fusion and fission systems. However, these multigroup methods require generation of cross sections, or spectrum-averaged reaction rates using nuclear data. The gold standard for this cross section generation is using Monte Carlo methods. Therefore, a typical nuclear engineering is faced with a sizable inefficiency in computational analysis — (i) the need to generate computational geometry unique to a Monte Carlo code (combinatorial geometry) in order to generate constitutive models (cross sections), while at the same time (ii) generating computational geometry specific to a deterministic solver (unstructured meshes). In other words, nuclear engineers often expend 50% of more of their effort on model preparation across multiple different software tools in order to calculate the data needed for analysis. This project aims to eliminate this inefficiency by taking a novel approach to generate the combinatorial geometry in-line using geometry data structures in MOOSE’s reactor module, an open-source meshing library. The student will develop translation models from C++ geometry objects in MOOSE to the corresponding C++ combinatorial geometry objects in OpenMC to automate the “high-to-low” data generation needs of nuclear analysts. Experience in C++ and Python is encouraged, but no domain-specific knowledge of nuclear engineering is needed.
Ismini Lourentzou

VLMs for Open Vocabulary SGG
Scene Graph Generation (SGG) involves extracting objects and their relationships from an image, forming a graph where nodes represent
objects and edges represent relationships. Conventional SGG approaches are limited to a closed-set vocabulary, restricting their ability to detect and classify unseen objects and relationships. Subsequent works propose models for Open Vocabulary SGG (Ov-SGG), yet often rely on predefined textual inputs, e.g., region-caption pairs or a set of semantic labels, which limits their ability to detect unseen objects and relations effectively. This project will involve designing new Vision and Language Models (VLMs) for Open Vocabulary SGG and comparing against existing baselines.
Joshua Allen

Development of the NEAT genomic read simulator
NEAT is a simulator of next-generation sequencing technology. As the fast-paced world of genomics continues to develop, better software tools are needed to process and analyze the data. However, due to lack of availability of raw read data, due to data restrictions, file sizes, or privace concerns, raw data can be hard to obtain, making calibration of the software difficult to achieve. NEAT can simulate variants in silico, NGS-like reads, and acompanying alignments to develop and calibrate genomic processing software, benefitting genomics research and saving money real world tests and calibrations.
Rini B. Mehta

Sanskrit grammar engine (Python or JS or Wolfram)
Sanskrit, the classical language of India/South Asia, has a near-perfect grammar that was composed in the 4th c. BCE by Panini. This project will create an engine for combining words, deconstructing compound words, create declensions of nouns and conjugation of verbs, based on the rules outlined in Panini’s grammar. Knowledge of Devanagari/Sanskrit characters needed. This project is part of an interdisciplinary NLP-centric research cluster Rini B. Mehta is developing on campus.
Gautham Narayan, Amanda Wasserman

Building and Testing the LSST Supernova Recommendation System Pipeline
The Legacy Survey of Space and Time (LSST) will observe ~1000 transients per night, an increase by a factor of about 100x from current rates. To process this data and decide what objects are worthy of follow-up observations, we implement an active learning loop in the Dark Energy Science Collaboration time domain pipeline. This is a modular system that can have multiple algorithms with different science goals. The student will work on an anomaly detection aspect of the system using the Lightcurve Anomaly Identification and Similarity Search (LAISS, Aleo et al. 2024). The student will incorporate LAISS in to the system, implementing the active learning loop for anomaly detection, and ensuring the system runs end-to-end, recommending new anomalies for follow-up. We will then test the pipeline on simulated LSST data to analyze the types of anomalous objects that will be found in the sample. Python knowledge required, with familiarity with machine learning strongly preferred.
Kevin Chang

LLM-based Knowledge Agent
Large language models aka ChatGPT have changed the landscape of artificial intelligence and promised to automate how we perform knowledge-intensive work. This project will explore LLMs as the “engines” to build agents for such automation– e.g., to help you find the knowledge you need, to synthesize the knowledge for a specific topic, to answer a technical question, or to survey and read papers for your learning the literature.
– Techniques: Machine learning, natural language processing, large language models.
– Requirements:
– Required: CS225 or CS241 or equivalent knowledge.
– Preferred: Have taken one data-related course (CS410, 411, 412, 446, 447). Significant programming experience in Python.
Kaiyu Guan

Using Satellite Data for Large-Scale Crop Monitoring
The project will focus on developing advanced machine learning to interpret satellite remote sensing data to quantify crop growth and yield. The students will work on large-scale satellite data processing, remote sensing algorithm development, and solving real-world questions related to global food production and environmental sustainability. Students with experience in Python programming, big data analytics, deep learning or computer vision are most welcome.
Haohan Wang

LLM Agents for Automating Complicated Tasks
Overview:
In the quest to push the boundaries of AI interaction with world, our project seeks to harness the transformative power of large language models. The system will employ Large Language Models (LLMs) to automate intricate and labor-intensive tasks, which are traditionally managed by human experts. This project will not only innovate in the realm of AI but also radically enhance efficiency across various domains.
Goals:
The primary objective is to develop a framework of AI agents, each taking on critical roles within a team—ranging from project management to complex data analysis. These roles will work collaboratively to tackle some of the most challenging problems in modern science.
Impact:
By reducing the reliance on manual effort for repetitive tasks, the system aims to speed up the research process, allowing human scientists to focus on creative and strategic aspects of scientific inquiry. The project promises significant contributions to fields requiring extensive data analysis, setting a new standard for AI in scientific research.
Opportunity for Students:
Participants will gain hands-on experience with advanced AI models, participate in the development of a novel AI framework, and contribute to impactful research that could revolutionize how scientific data is processed and utilized. Moreover, students will have the opportunity to publish their findings and present at top international conferences, making significant strides in their academic and professional careers.
Computational Biology
Overview:
The project aims at advancing the frontiers of genetics through the power of computational tools and methodologies. By integrating data science, machine learning, and bioinformatics, the projects seeks to unlock new discoveries across various domains of genetics, from molecular genetics to population genetics.
Goals:
One of the example project is aimed to create advance machine learning methods to understand the regulatory structure of different cells considering the spatial position of those cells, aiming to offer new understanding of the genetic mechanisms of cells to a granularity that is previously not attended to.
Opportunity for Students:
Students participating in this project will have the opportunity to engage in cutting-edge research at the intersection of computational biology, machine learning, and genetics. They will develop expertise in advanced machine learning techniques, particularly in the design and application of Transformer models, and gain experience in handling and analyzing complex genomic data. This project offers a unique platform for students to contribute to innovative research that could redefine our understanding of cellular mechanisms and disease development.
Tim McPhillips

Using large language models to decompose research papers into nanopublications
The bulk of existing and newly reported scientific knowledge is embedded in the natural language texts of research papers. Figures and tables in these papers, along with the computational artifacts associated with them, provide one entrypoint to this knowledge; while recent NLP and LLM advances provide another. In this project we will demonstrate how knowledge in the scientific literature can be harvested via a combination of these approaches. The student will automate tools for extracting figures and tables from PDFs of research papers; train, finetune and use LLMs to detect in the text of these papers references to figures, to tables, and to computational artifacts used or produced by the research; extract portions of the text that make scholarly claims (assertions) supported by some combination of the figures, tables, and computational artifacts; and then decompose each publication into a set of nanopublications making subsets of the claims asserted in the original paper and supported by corresponding subsets of the figures, tables, and computational artifacts. Finally, the student will explore the potential for building a scholarly knowledge graph from the nanopublications extracted in this way from a set of related papers in a particular research field chosen jointly by the student and project supervisor.
Preferred experience for intern candidates: Python programming experience; interest in LLM and NLP.
David Bianchi

Leveraging Human Gut Microbiome Sensors for Development of Personalized Nutrition
Overview: Humans are not equipped to degrade highly complex polysaccharides or the fiber content (e.g., arabinoxylans and pectins) of our diets. The microbes in the colon that metabolize these nutrients are equipped with a large arsenal of genes, which in total are referred to as the microbiome. Our team, using bioinformatics, biochemical and biophysical approaches, has developed the first molecular signatures of the microbial sensors present in the microbiomes of the human gut. By systematically characterizing the gut bacteria, nutrient sensors, metabolites and health benefits, we will be able to provide each individual with their personal fiber/polysaccharide-sensing signatures.
Goals: Thrust 1 – Structural Basis of Sensor Binding Specificity Across 15 Phylogenetic Clusters of Nutrient Sensors: Establish a Generalizable Platform with which to predict/characterize the substrate specificity of each sensor class based on sequence/structural motifs
Thrust 2 – Data Analytics Visualization Interface (Libraries of Metabolomics Data and Sequences of Nutrient Sensors): Create a web interface via which researchers can quickly access a dashboard view detailing the phylogenetic/sequence details of sensors and the metabolomics data related to each nutrient/diet.
Thrust 3 – Pathway Analysis of Beneficial Health Metabolites (Expand WAX analysis (previous proposal) to Pectin, Apple, Soy, other nutritional components) – with RNA-Seq and primer design for each of the clusters: Analyze the pathways by which metabolites of beneficial health properties are generated by the gut microbiome and evidence for these via transcriptomics analysis and wet lab experiments, guided by in silico work.
Impact: By further clarifying the complex interplay of the gut microbiome in response to dietary changes we will develop (with our experimental partners) personalized nutritional fiber/polysaccharide processing signatures. These signatures will then be used by clinicians to propose dietary regimens whereby beneficial (anti-inflammatory etc.) metabolites are generated by the metabolic activity of an individual’s existing bacterial fauna.
Opportunity for Students: Students will have an opportunity to develop a variety of techniques including: application of bioinformatics/computational biology software for simulation and vizualization, large scale simulations via HPC resources (SLURM etc.), data analytics techniques, application of machine learning models to large datasets (Variational Autoencoder etc.), scientific communication.*Preferred Experience for Students*:- Introductory Level Python Programming
Jessica Saw

MedTerms: Innovative Software Interfaces for Medical Education
MedTerms, developed by the Visual Analytics Group (NCSA) is a web-based platform that uses visualization design and innovative software user interface to support students in learning human disease from a first-principles approach. NCSA technical expertise the student would experience includes: (1) the unique combined skillset in software user interface and experience (UIX) design and medical domain knowledge, and (2) creativity and innovation in software interface design and development.
A good candidate for this project is someone who is interested in frontend development, and user interface and experience design. We also have opportunities for people who have no coding experience and are pre-med (MD or MD/PhD).
Matthew Krafczyk

DRYML: A flexible machine learning framework
Don’t Repeat Yourself Machine Learning (DRYML) (https://github.com/ncsa/dryml) is a new machine learning and artificial intelligence framework meant to simplify working with the disperate ML libraries of the modern day. DRYML objects wrap other ML systems and provide a uniform API for training and evaluating them. Once a model is trained, DRYML aims to make deploying that model as simple as possible, with the user only needed to load a file from disk while DRYML takes care of everything else. The student will learn a variety of ML frameworks and make their mark by contributing to an evolving ML library. Candidates for this project will benefit from existing Python experience, and experience with major machine learning frameworks such as pytorch or tensorflow.
Phuong Cao

Security Analytics
This project will apply statistical analysis to identify novel attack patterns in security logs. We will focus on several attacks such as cryptographics downgrade, novel zero-day exploits, and self-learning malware. The outcome will be a standard security dataset to be used by other papers and publication of tools and/or analytics paper.
Shirui Luo

SIMULATING DISEASE PROGRESSION USING GENERATIVE DIFFUSION MODEL
Candidates who are interested in physics-embedded generative diffusion model are encourged to apply. By leveraging data-driven insights with embedded biophysical principles, the project is aimed to buid a hybrid generative model for disease progression. A case study will be on brain tumor growth prediction, with radiation theorapy as intervention. At inference time, upon receiving a new observation (e.g., an MRI scan), the model is designed to predict the future trajectory of the tumor along with an associated uncertainty map. It dynamically adjusts its predictions based on new observations and treatment interventions, ensuring that outputs are both current and clinically relevant.
KNOWLEDGE FUSION in GENERATIVE DIFFUSION MODEL
Candidates who are interested in compositional generative diffusion model are encourged to apply. The project aims to explore how to combine multiple different pre-trained diffusion models representing subcomponents of a desired system to achieve a comprehensive inference. Examples of such study includes multimodel information fusion, data-driven/physics-based hybriding, model ensembling, etc.
Yuxiong Wang

Knowledge-Intensive Multi-modal LLMs
The advancements in foundation models, as exemplified by large language models (LLMs), have enabled more natural and engaging interactions between humans and machines. However, current models often struggle when user queries involve certain personal contexts (“my dogs and cats”) or external expert knowledge (“medical diagnosis”). To address this challenge, existing LLMs leverage retrieval-augmented generation (RAG) techniques to access contexts in external databases, e.g., Perplexity.ai and GPT4. However, none of these methods have paid attention to the visual information, which is a critical aspect of conversations. Therefore, our project will investigate a novel framework of Vision-RAG to unlock the potential of knowledge-intensive multi-modal conversation for LLMs. We are actively looking for ambitious undergraduate students with knowledge and skills in deep learning platforms (Pytorch) and computer vision (experience in CS 543 or related CS 598 courses preferred).
Angela Lyons and Aiman Soliman

A Machine Learning and Geospatial Approach to Addressing the Socioeconomic Impacts of Climate Change Among Forcibly Displaced Populations in Developing Countries
Climate change is significantly exacerbating forced displacement in developing countries, intensifying the vulnerability of populations already struggling with socioeconomic challenges. Rising sea levels, increased frequency of extreme weather events, and prolonged droughts directly threaten livelihoods, particularly in agricultural and coastal regions. For instance, small island nations in the Pacific and low-lying areas in Bangladesh face existential risks from rising seas, displacing entire communities. Additionally, erratic weather patterns disrupt food and water security, compelling rural inhabitants to migrate to urban areas or across borders in search of better conditions. This displacement often leads to overburdened urban infrastructures, increased competition for resources, and heightened tensions within host communities. Furthermore, the socioeconomic impacts can be profound, as displaced populations frequently lose access to education, healthcare, and employment opportunities, perpetuating cycles of poverty. The international community’s response, though evolving, remains inadequate in addressing the root causes and providing sufficient support for adaptation and resilience-building in the most affected regions. Hence, the nexus between climate change and forced displacement underscores the urgent need for comprehensive global strategies to mitigate environmental impacts and support vulnerable populations. We use cutting edge machine learning and geospatial techniques to enhance our understanding of this problem and provide sustainable solutions and recommendations to the international community.
We are looking for students with experience in basic programming in Python and/or R and basic knowledge and skills in machine learning; experience with GIS and geospatial analysis is a plus. Anticipated tasks include assisting the team with: (1) data preprocessing of geospatial and socioeconomic feature data, (2) data modeling, analysis, and predictions, and (3) the creation of mappings and other data visualizations using geospatial and socioeconomic feature data. Students will develop and review code and create documentation for the code. They will also assist in developing machine learning algorithms and then training, validating, and testing the algorithms. Students will also create a GITHUB for the team, where they will prepare and upload scripts and other documentation for the project. Students will meet with the mentors on a regular basis, participate in team meetings, and actively engage with graduate students.
Preferred Skills:
• Background in data science and statistical modeling
• Programming languages: Python, R, and/or Stata
• Basic knowledge and skills in machine learning and/or geospatial analysis
• Expertise in creating mappings and other data visualizations
• Experience in programming and development of dashboards
Volodymyr Kindratenko

Enhancing Response Accuracy of an LLM-Based Teaching Assistant Tool
We are looking for a dedicated and enthusiastic student to work on an exciting project aimed at improving the response accuracy of an LLM-based teaching assistant tool. Our team has developed a functional version of the tool, accessible at uiuc.chat, and we are eager to enhance its capabilities further. The primary goal of this project is to develop innovative methods to improve the accuracy of the teaching assistant tool’s responses, particularly in the context of factual information. We are inspired by the concepts outlined in the research literature, specifically the Chain-of-Verification methodology. We are interested in creating a novel factual consistency model that will ensure the correctness of fact-based answers provided by the tool.
Eliu Huerta

Foundation Models in Physics
The selected student will participate in the development of foundation models. both large language models and generative AI models, with specific applications in materials science discovery (design of metal-organic frameworks for carbon capture) and for the detection of neutron star mergers through gravitational wave and electromagnetic observations. Requirements: knowledge of python and linux is useful, practical knowledge of TensorFlow and PyTorch is also encouraged.
Xin Liu

DeepDISC: Detection, Instance Segmentation, and Classification for Astronomical Surveys with Deep Learning
The next generation of massive astronomical surveys such as the upcoming Legacy Survey of Space and Time (LSST) on the Rubin Observatory will deliver unprecedented amounts of images through the 2020s and beyond. As both the sensitivity and depth increase, larger numbers of blended (overlapping) sources will occur. If left unaccounted for, blending would result in biased measurements of sources that are assumed isolated, contaminating key cosmological inferences such as photometry and photometric redshift and morphology and weak gravitational lensing to probe the nature of dark matter and dark energy.
In the LSST era, efficient deblending techniques are a necessity and thus have been recognized a high priority. However, an efficient and robust method to detect, deblend, and classify sources for upcoming massive surveys is still lacking. Leveraging the rapidly developing field of computer vision, this NCSA project will develop a deep learning framework “DeepDISC”. DeepDISC will efficiently process images and accurately identify blended galaxies with the lowest possible latency to maximize the science returns of upcoming massive astronomical surveys. The approach is fundamentally different from traditional methods. The project is interdisciplinary, combining state-of-the-art astronomy data with the latest deep learning tools from computer science. DeepDISC will efficiently and robustly detect, deblend, and classify sources in upcoming surveys at depths close to the confusion limit. It will also provide accurate estimates of the deblending uncertainty, which can be propagated further down the analysis of galaxy properties for cosmological inferences. The project will have strong implications for a wide range of problems in astronomy, ranging from efficiently detecting transients and solar system objects to the nature of dark matter and dark energy. DeepDISC will be directly applicable for LSST as well as other upcoming massive surveys such as NASA’s Roman Space Telescope. The program will reinforce the Illinois brand in big data and survey science.
Preferred Skills:
- Deep Learning
- Python
- PyTorch