I've deleted most of them just through encoding all documents to UTF-8 with no bom and afterwards checking if the filesize is similar. But clearly if somebody places an advertisement in there, the filesize is different...Has anyone carried out that nevertheless? Upon getting a great tuned model, Whisper is vastly more challenging to run so far as I