Verifiable on-chain Large Language Models drift benchmarking
This project was inspired by the following paper: https://arxiv.org/abs/2307.09009
It detailed how ChatGPT's behavior and performance have been changing over time. That makes it difficult for companies to integrate the models into their pipeline, considering the unpredictability of these changes. The researchers thus developed a set of benchmarks that they ran on two snapshots of OpenAI's models. In this project, this benchmarking process is made recurrent and the results are stored on-chain for immutability and transparency purposes.
The project is split into three modules: