If you want an CLI version of a similar idea, you can use yt-dlp and some simple...

varenc · on Dec 18, 2022

That gets me the subs in some usual complicated format? It's a bit of work to extract the actual text from that.

To get youtube subs in .srt format this gave me some limited success:

    yt-dlp --convert-subs=srt  --write-auto-sub --write-sub --sub-lang "en,en-us,en-GB,automatic-caption-en" --skip-download  "https://www.youtube.com/watch?v=1OfxlSG6q5Y"

Behind the scenes yt-dlp is downloading the subs in .vtt format than using ffmpeg to convert those to .srt. Depending on your situation the original .vtt format might be fine.

cercatrova · on Dec 19, 2022

Or even better, yt-whisper, which uses OpenAI's Whisper speech to text. I guess it'd be better to first check whether the video has captions first before Whispering, so maybe both your command and this one could be used together.

https://github.com/m1guelpf/yt-whisper

ptspts · on Dec 18, 2022

Not all YouTube videos with spoken text have automatic captions.

arboles · on Dec 18, 2022

https://news.ycombinator.com/item?id=34041455

cercatrova · on Dec 19, 2022

https://github.com/m1guelpf/yt-whisper

jck · on Dec 19, 2022

I am not a fan of this pattern - if I'm understanding correctly, you would have to part with all of yt-dlp's niceties like playlist/channel handling, quality selection, file naming, logging config etc.

Why not just use the whisper cli on yt-dlp CLI's output for videos with bad or no subtitles?

cercatrova · on Dec 19, 2022

Sure you could do that too. yt-whisper uses yt-dlp underneath so there might be a way to pass arguments to the inner yt-dlp instance. Or if not you can modify the source directly, it seems to be a simple wrapper. Or again you can do what you were saying, using the Whisper CLI. All good options, I just mentioned this one since it's easier if I just want to download a video with subs.

rpigab · on Dec 19, 2022

  ... | split_sentences | grep -viE '*vpn*'

MollyRealized · on Dec 21, 2022

I apologize for the question, but I am not entirely clear where "split_sentences" is. Is it a separate script? I have been looking for something with that sort of functionality for a while, very often for this very purpose, splitting transcripts.

rpigab · on Dec 29, 2022

Sorry for the late answer, but yes, it would have to be a separate script or command, it is purely fictional, I made it up because it made more sense for the joke to have it, and people might have pointed out that my grep would have filtered out too much context, so I had to add this.

I'm sure there are many unix-y tools for this purpose, but I don't know of them. If you're looking for something that's installed everywhere, maybe a very big awk or sed regex with multiline wizardry could do the trick for most easy-to-parse latin languages and you'd just have to copypaste it around. It prolly becomes harder for regexes once you start working with right-to-left languages like Arabic, and languages with different ponctuation, so it might not be i18n-friendly.

Related Stackoverflow : https://stackoverflow.com/questions/33704443/python-regexp-s...

tetris11 · on Dec 19, 2022

`| grep -viE 'skillshare'`