Easy spoken corpora with YouTube

I don’t think it’s exactly a secret that I rather like corpora. In this post I shall show you how you can create an easy spoken corpus using YouTube and a subtitle downloader. Use at your own risk, and YouTube might disable this usability at any time.

Find your videos.

Search YouTube. You know how to do this.

Download subtitles.

I used DownSub.com. It opens a pop-up ad the first time you get paste the video address in the search box but is otherwise benign.

Download your subtitles. Repeat for as many videos as required. Yes this is a pain in the bum but it’s the best I can do.

Edit text.

Open all your subtitle files in the text editor of your choice and replace nonsense/ html codes with nothing. Save them as .txt files.

Wow, a corpus!

Or a small one, depending on how much time you have. Tag the corpus if you wish, using TagAnt by Laurence Anthony.  You can open the corpus in AntWord by him, too. Free downloads.

 

Advertisements

2 thoughts on “Easy spoken corpora with YouTube

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s