ptshen

Currently building infrastructure for agent frameworks at DAPLab at Columbia. Working on building AI forks and developing multi-agent supervised fine tuning methods. Stay tuned for updates!

Terminal Bench tasks

As part of my onboarding, I was tasked with creating Terminal Bench tasks to evaluate agents on real-world data manipulation tasks. Taking inspiration from Spider2, I created 65 agent tasks for the agent to work on. Terminal-Bench tasks are supposed to be difficult, in that agents should be able to perform real world database tasks, rather than just writing SQL queries.

After making the Spider2 tasks, I also made some swebench tasks (difficult swe-related tasks for agents to do) to test agents on.

View my Spider2 tasks on GitHub here!

View my swebench tasks on Github here!

[DAPLab]: AI forks and multi-agent supervised fine tuning

Terminal Bench tasks