Team

Why this project

Embedding-based property prediction is one of the cleanest applied uses of protein language models. Common targets like stability and solubility have been benchmarked extensively — but less-studied properties like optimum pH, enzyme activity, or protein flexibility could yield a more interesting benchmark, if the right datasets exist.

What a team could build in one day

  • Implement a clean embed → prediction head pipeline.
  • Run a small set of PLMs (ESM2, ESM3, ProtT5, etc.) against one chosen property dataset.
  • Compare a few head architectures: linear, shallow MLP, attention pooling.
  • Produce a benchmark table with held-out evaluation.

Minimum viable demo: at least two PLMs, one property dataset, one head architecture — with held-out results.

Stretch directions

  • More PLMs and more property datasets for a real benchmark.
  • Transfer-learning and multi-task heads.
  • Compare to ESM-IF or structure-aware models on the same task.

Resources

  • Protein language models: ESM2 / ESM3, ProtT5, Ankh, ProGen.
  • Datasets: TAPE / FLIP / PEER suites, plus more specific datasets for optimum pH or enzyme activity if findable.