Spin – 2020-2021 Academic Year Mentors

2020-2021 Academic Year Mentors

Academic Year Mentors

Athletic Data Analysis

Loretta Auvil, Kingsley Osei-Asibey

Are you interested in football? Sports? Analytics? We are looking for a student with willingness to apply their data science capabilities to sports data. We have football data at the play level and we are looking for a student to help find patterns in this data. The data contains various features that describe the play and what happened on the field.

Learn More

We are looking for a Python programmer who has knowledge of artificial intelligence or machine learning skills to find patterns. Data transformations will be necessary as well. We would also like the ability to validate the patterns in another set of data.

Using Satellite Data for Large-Scale Crop Monitoring

Kaiyu Guan, Jian Peng

Dr. Kaiyu Guan’s lab is conducting research on using novel satellite data from the NASA satellites to study environmental impact on global and U.S. agriculture productivity, in the platform of the most powerful supercomputer in scientific research (Blue Waters). We are looking for highly motivated and programming-savvy undergraduate students to join the lab for the SPIN program.

Learn More

The chosen students will be closely mentored by Dr. Guan, and will be working on issues including processing large satellite data, understand and implement remote sensing algorithms, and solve questions that are related to the global food production and food security.

Automatically Testing the Einstein Toolkit Using Github

Roland Haas

Modern scientific simulations have enabled us to study non-linear phenomena that are impossible to study otherwise. Among the most challenging problems is the study of Einstein’s theory of relativity which predicts the existence of gravitational waves detected very recently be the LIGO collaboration. The Einstein Toolkit is a community driven framework for astrophysical simulations. I am interested in recruiting a student interested in porting the automated testing framework in the Einstein Toolkit to GitHub.

Learn More

This project will involve setting up a continuous integration framework for the Einstein Toolkit using GitHub Actions and creating HTML pages summarizing the result of each test.

The successful applicant will be involved with both the Relativity Group at NCSA and the Blue Waters project and will be invited to participate in the weekly group meetings and discussions of their research projects.

Details: The Einstein Toolkit contains almost 300 regression tests to ensure that additions to the code do not introduce bugs that affect scientific results. These tests are run each time a modification if pushed into the central code repositories and are handled by a dedicated Jenkins server hosted at NCSA. The goal of this SPIN project is to port the testing system over to GitHub’s Actions framework, thus avoiding the need to maintain a dedicated server. This will involve familiarizing oneself with the GitHub Actions framework as well as understanding how to run the Einstein Toolkit test harness and interpret its results. Before applying for this project please work through the exercise at: https://wiki.ncsa.illinois.edu/x/WQrqBg

Skills desired:

Familiarity with the Linux command line interface, the Git command line client, scripting languages such as Perl and/or Python as well as GitHub and basic knowledge of HTML to generate reports is essential

Deep Learning for Multi-Messenger Astrophysics

Eliu Huerta

Develop novel AI models to identify gravitational wave sources that stay in LIGO band for several minutes, with a particular focus on early identification of neutron star mergers and black hole-neutron star mergers. No background in astronomy or physics is needed, though knowledge of machine and deep learning is highly recommended.

Machine Learning Approach to Computational Fluid Dynamics

Shirui Luo, Volodymyr Kindratenko

Machine learning (ML) has made transformative impacts on modelling many high-dimensional complex dynamical systems. Multiphase flow is one of the promising targets for using ML to improve both the fidelity and efficiency of computational fluid dynamics (CFD) simulations. We are examining the use of ML to fit the CFD simulation data to develop closure relations for multiphase flow system.

Learn More

For example, DNNs can be trained on datasets with flows where the initial velocity and void fraction are different. The trained model is then used to predict other flow evolutions with different initial conditions. More broadly, we are tackling problems encountered with the interplay between learning and multiphase flow such as: How can learning algorithms be constructed to include physical constraints such as the incompressibility of fluid? What dimensionality reduction techniques and coarsening strategies are most applicable to identify hidden low-dimensional features? How can the computational scientists, experimentalists and theorists collaborate to produce sufficient training database for multiphase flow simulation?

The student will use open source software packages such as TensorFlow and PyTorch to construct networks to improve predictive capabilities based on a high-fidelity DNS simulation database. The student will have access to HPC platform at NCSA and learn to analyze CFD data at large scale. Besides of the practice of typical ML skills, the student will also learn more fundamentally on how the neural networks be designed to best incorporate physical constraints while avoiding overfitting to imposed physics, as typical statistical learning methods can ignore underlying physical principles.

Determination of Biomarkers to be Used in the Diagnosis of Cardiac Microvascular Disease in Postmenopausal Women

Zeynep Madak-Erdogan, Justina Zurauskiene

Coronary microvascular disease (CMD) is a common form of heart disease in postmenopausal women. CMD is due to dysfunction of microvessels that feed the heart muscle and is different from coronary artery disease (CAD), which is due to plaque formation. Majority of the patients do not receive a proper diagnosis and have to go back to hospital with persistent symptoms. We are proposing to identify circulating biomarkers of CMD.

Learn More

We hypothesize that plasma metabolite and protein profiles are different for postmenopausal women with no heart disease, with CAD or with CMD. We are collaborating with clinicians from Izmir Katip Celebi Research and Training Hospital, Turkey. They already recruited 75 patients (25 patients per group, 3 groups-healthy, CAD and CMD) and completed their full health screening and tests. We are proposing to perform full metabolite and proteomic profiling of plasma samples from these individuals, identify biomarkers using machine learning approaches and validate our findings in a second cohort of patients that our collaborators are currently recruiting. Our research will identify circulating biomarkers of this debilitating heart disease in postmenopausal women and will have a clinical impact by providing biomarkers that can be used for diagnostic test design in the future.

Deblending and Classifying Astronomical Sources with Deep Learning

Xin Liu

The Legacy Survey of Space and Time (LSST) at Vera C. Rubin Observatory will obtain TB of data/night, producing a high-definition “movie” of the night sky. Correctly detecting, identifying, and segmenting astronomical sources efficiently is a top priority. LSST images will be so deep and detailed, that sources will be crowded or “blended” together. Looking to the rapidly-developing field of computer vision, our group has developed a proof-of-principle deep learning framework to process images and identify blended sources in them.

Learn More

However, the work and most others in the literature have thus far been limited to simulated images, untested on a real and large image dataset. To solve this problem, here we propose a SPIN research project to apply our novel image segmentation method to real images taken by the most powerful camera in the world, the Hyper-Suprime Cam (HSC), on one of the largest ground-based telescopes, the 8.5 meter Subaru telescope. Being the closest match to LSST in terms of the expected cutting-edge image data quality, HSC is the ideal dataset for evaluating and developing this interdisciplinary approach on real astronomical images. The SPIN student will develop and test deep learning architectures within our already proven framework. The student will leverage NCSA resources, such as the HAL GPU cluster, as well as collaborate with local experts. This is an opportunity to test and develop a competitive new approach combining astronomy big data with machine learning and computer vision.

Automation of Genomic Analyses for the Cloud, Grid, and Analytics Platforms

Liudmila Sergeevna Mainzer

Genomic analyses have moved into the arena of big data, thus requiring full automation for deployment on advanced computing infrastructure. The computational workflows tend to be complex, consist of multiple steps, fans, merges, and user level conditionals. Numerous quality control and job monitoring procedures are required.

Learn More

Deployment and optimization of this large and complex workload is a big challenge in itself. Different strategies are appropriate for running these analyses in the cloud, on analytics platforms or the traditional grid clusters. NCSA Genomics invites a computationally-savvy student to partake in this activity and learn about the different workflow management systems, code benchmarking and optimization, cloud computing and big data analytics. Desired skills: computing, engineering, bioinformatics, genomics.

Resolving Health Disparities by Using Advanced Statistics on Complex Multidimensional Datasets

Liudmila Sergeevna Mainzer

Health disparities, be it racial, economic, rural-urban, gender- or age-based, have come to the forefront across the world. Understanding their underlying causes and making reliable predictions that drive informed decisions by policy makers and health practitioners requires tackling complex multidimensional datasets that include proteomic, genomic, biometric, geographic and socioeconomic measurements. These dimensions need to be harmonized and correct statistical approaches applied, in order to determine the exact combination of factors that drive health disparities, without making the problem computationally challenging.

Learn More

This requires development of advanced statistical and machine learning approaches. For example, we intend to do this when studying effects of pollution and poverty on rural and racial health disparities in Illinois. Another aspect of our work involves civil infrastructure: quality of water, sewage, electricity, proximity to education, transportation and medical centers, and quality of buildings where people live and work. We intend to use advanced geostatistical methods to isolate neighborhood clusters and test whether these neighborhoods are more or less likely to exhibit certain soil or water contamination, or socioeconomic patterns and increased health risks. Machine learning (ML), including deep learning, will be applied to high-resolution satellite images, aerial imagery and LiDAR data (elevation images) to detect variations in human environments quality. ML models will be trained to relate elements that are visible in geospatial data such as roof conditions, vegetation status, the degree of land use mixing, etc. to socioeconomic and health information about individuals who live and work in Champaign, Urbana, and Rantoul. We would like to have a team of talented students to participate in this important and exciting project, with the desired skills in statistics, machine learning, computing, bioinformatics.

Speech-to-Text Auto Captioning

Michael Miller – Event Services

This project researches frameworks and workflows for speech-to-text recognition in order to facilitate live auto captioning and creation of standard caption files for use in live events and video editing, utilizing and enhancing speech-to-text HPC/cloud services and seeks to advance the state of the art in speech-to-text recognition. A successful candidate would need to have completed CS125 (Intro to Computer Science) or have equivalent experience.

Implementation of Machine Learning Algorithms on FPGAS

Ashish Misra, Volodymyr Kindratenko

Many research domains, such as computer vision and language understanding, have been transformed using novel machine learning (ML) and deep learning (DL) methods and techniques. However, these methods are very compute-intensive and rely on state-of-the-art hardware and large datasets to achieve an acceptable level of performance. Research team at the Innovative Systems Lab (ISL) at NCSA has been investigating how neural networks at the core of DL algorithms can be implemented on reconfigurable hardware with the objective to speedup the execution and reduce power requirements for inference algorithms.

Learn More

FPGAs are a good choice for implementing neural networks since they enable highly customized parallel hardware implementation and provide a great degree of flexibility with regards to numerical data types. Most recently, ISL started to explore a novel platform enabled by IBM’s CAPI 2.0 interface and SNAP API. This platform allows to develop FPGA applications using high-level synthesis (HLS) methodology rather than a traditional hardware design approach and integrate kernels accelerated on an FPGA with the host-side applications running on IBM POWER9 servers.

The students working on this project will acquire the skillsets that are required to develop ML/DL algorithms in hardware using HLS approach. The students will be involved with a) evaluating performance of existing ML/DL implementations on reconfigurable hardware platforms and documenting the results, b) developing new ML/DL algorithms for implementation on reconfigurable hardware and preparing datasets for testing and evaluation, and c) helping ISL research staff with porting the algorithms to reconfigurable hardware. Required skills include completion of ECE 385 and ECE 408 or equivalent courses.

Improving Champaign County 211 Service Provision

Jorge Rojas Alvarez, Anita Chan

The Community Data Clinic and Cunningham Township are seeking a student with programming and web design skills for a project rebuilding the Illinois PATH 211 database from scratch. The new version will be a web-based directory based on a relational database of social service providers in the Champaign County area.

Learn More

The student’s main task will be buidling out this web application to filter results and integrate user feedback. We are collaborating with local partners and stakeholders to ensure that our platform is as accessible and intuitive as possible, so a keen UI sensibility or experience with frontend accessibility would be a plus.

For the first year we are imagining a lightweight proof-of-concept, perhaps just using Javascript and Firebase, plus HTML and CSS. Future architectures might be designed to work from the PATH servers, but that’s a long-term goal for now.

Interested applicants should provide links to examples of relevant past projects and briefly explain their interest in community-based work.

Multiscale Modeling of the Cell Membrane-Associated Phenomena

Taras Pogorelov

The cell membrane environment is complex and challenging to model. The Pogorelov Lab at Illinois develops workflows that combining computational and experimental molecular data. We work in close collaboration with experimental labs. Modeling approaches include classical molecular dynamics, quantum electronic structure, and quantum nuclear dynamics.

Learn More

These projects include development of workflows for modeling and analysis of the lipid interactions with proteins and ions that are vital for life of the cell. The qualified student should have experience with R/Python programming, use of Linux environment, and of NAMD molecular modeling software.

Virtual Reality and Holographic Data Visualization in Materials Science

Andre Schleife

Computational materials science research produces large amounts of static and time-dependent data for atomic positions and electron densities that is rich in information. Determining underlying processes and mechanisms from this data, and visualizing it in a comprehensive way, constitutes an important scientific challenge.

Learn More

In this project we will visualize large electron-density data sets using virtual reality hardware (Google Cardboard, Windows Mixed Reality) and a holographic display (LookingGlass+LeapMotion). We will explore both static and time-dependent visualization for immersive movies, and implement user interaction with the data. Experience with virtual-reality SDKs, raytracing, and data visualization are a plus for this project.

Data Storage and Analysis Framework for Semiconductor Nanocrystals

Andre Schleife

The overarching goal of this project is to develop a data-science infrastructure that will enable a new computational/experimental approach to design semiconductor nanocrystals for bioimaging. Experiments and simulation produce a large variety of data from instruments and electronic-structure calculations, respectively. This data needs to be analyzed, post-processed, and shared in order to enable design of optimized nanocrystals.

Learn More

We will use and improve a computational collaborative infrastructure needed to facilitate a close feedback loop between experiment and theory to extract relevant information and to determine underlying physical processes and mechanisms. In particular, we will develop a descriptor set that is suitable to capture the structural complexity of modern nanoscopic materials, in order to relate them to optical properties. It will be implemented into our web-based infrastructure and collaborators will be equipped with these tools. For this project, machine learning skills and basic understanding of materials science or atomic structure of matter is helpful.

Music on High-Performance Computers

Sever Tipei

The project centers on DISSCO, software for composition, sound design and music notation/printing developed at Illinois and Argonne National Laboratory. Written in C++, it includes a graphical user interface using gtkmm, a parallel version is being developed at the San Diego Supercomputer Center. DISSCO has a directed graph structure and uses stochastic distributions, sieves (part of number theory) and elements of information theory to produce musical compositions.

Learn More

Presently, efforts are directed toward refining a system for the notation of music as well as to the realization of an evolving entity, a composition whose aspects change when computed recursively over long periods of time thus mirroring the way living organisms are transformed in time (artificial life).

Another possible direction of research is sonification, the aural rendition of computer generated complex data.

Skills desired: Proficiency in C++ programming, familiarity with Linux operation system, familiarity with music notation preferred but not required

UNIVERSITY OF ILLINOIS URBANA-CHAMPAIGN

NCSA HOME

Spin – 2020-2021 Academic Year Mentors