Identify the authors of synthetic DNA sequences.

The advancements in knowledge and tools in the context of systems and synthetic biology increase the applicability of biotechnology. At the same time, these advances lower the burden on the accessibility of approaches such as genome engineering and increase their user group. From the biosecurity perspective, possible misuses of this technology result in serious security threads.

Possible counter mechanisms include the so called lab-of-origin approaches. The goal of these approaches is to identify the group or lab in which a genetically engineered construct was produced. Thereby, machine learning algorithms identify relevant features within DNA sequences allowing for their attribution to specific labs. While there already exist lab-of-origin predictors, new datasets and advancements in the field of large language models feature the potential to improve this process further. The goal of this project is to develop a new lab-of-origin predictor from a newly created dataset.

Interest in the work with machine learning and/or deep learning models is important. Programming skills are important and experience with deep learning frameworks such as PyTorch or Tensorflow is beneficial.

Additional Information

Project Capacity Three IREP student
Project available for Spring, Summer and Fall 2024
Credits 18
Available via Remote No
Project Supervisor Erik Kubaczka