Currently building infrastructure for agent frameworks at DAPLab at Columbia. Working on building AI forks and developing multi-agent supervised fine tuning methods. Stay tuned for updates!
Terminal Bench tasks
As part of my onboarding, I was tasked with creating Terminal Bench tasks to evaluate agents on real-world data manipulation tasks. Taking inspiration from Spider2, I created 65 agent tasks for the agent to work on. Terminal-Bench tasks are supposed to be difficult, in that agents should be able to perform real world database tasks, rather than just writing SQL queries.
After making the Spider2 tasks, I also made some swebench tasks (difficult swe-related tasks for agents to do) to test agents on.
View my Spider2 tasks on GitHub here!
View my swebench tasks on Github here!