Resource Pool – Speech Communication Lab

Resource Pool

We aim to offer useful tools and datasets in the field of Language and Speech Communication Science. This page will be regularly updated to provide more relevant information regarding the areas of Speech and Communication Science.

Speech Databases
- Perceptimatic Dataset: The Perceptimatic Dataset links provide access to different datasets composed of stimuli in French, English, Brazilian Portuguese, Turkish, Estonian, and German. You can access the dataset and download files and resources.
Human Perception Experimental Data
- Additional information and related materials for this resource will be included soon. Please check back later for updates.
Neural Datasets
- Additional information and related materials for this resource will be included soon. Please check back later for updates.
Psycholinguistic resources
English Corpora
- A corpus (plural: corpora) is a highly structured text collection enabling sophisticated searches to explore language nuances, such as variations between genres, dialects, and over time. Unlike simple search engines like Google, corpora provide researchers, learners, and teachers with extensive data on words, phrases, and grammatical structures beyond textbook or dictionary limits. English-Corpora.org’s corpora, utilized by over 85,000 users monthly, offer detailed “word sketches” for the top 60,000 English words. These sketches include definitions, genre-specific frequencies, synonyms, collocates, related topics, clusters, concordance lines, and links to external resources like dictionary entries, pronunciation, images, videos, and translations to 100+ languages.
English-Corpora.org: introduction Video
L2-ARCTIC: a non-native English speech corpus
- L2-ARCTIC is a speech corpus designed for research in areas such as voice conversion, accent conversion, and mispronunciation detection. This corpus consists of recordings from 24 non-native English speakers, each from one of the following first languages (L1s): Hindi, Korean, Mandarin, Spanish, Arabic, and Vietnamese. For each L1, the corpus includes recordings from two male and two female speakers.
- Each speaker contributed approximately one hour of read speech, using CMU’s ARCTIC prompts. From these recordings, orthographic transcriptions and forced-aligned phonetic transcriptions were generated. Additionally, a set of 150 utterances per speaker was manually annotated to identify three types of mispronunciation errors: substitutions, deletions, and additions. This makes L2-ARCTIC a valuable resource not only for voice conversion and accent conversion research, but also for computer-assisted pronunciation training.
- The corpus was developed through the collaborative efforts of researchers at Texas A&M University and Iowa State University. Future updates may include additional speakers from other L1s, should they be deemed valuable for the research community.
- Overview of the Corpus
- For each speaker, the L2-ARCTIC corpus contains the following data:
- Speech Recordings: Over one hour of prompted recordings, consisting of phonetically-balanced short sentences (~1132 sentences in total).
- Word-Level Transcriptions: Orthographic transcriptions along with forced-aligned word boundaries for each sentence.
- Phoneme-Level Transcriptions: Forced-aligned phoneme transcriptions for each sentence.
- Manual Annotations: A selected subset of utterances (~150) which includes 100 sentences produced by all speakers, plus 50 sentences featuring phonemes likely to be difficult for each speaker’s L1. These sentences are annotated with corrected word and phone boundaries, and errors such as phone substitutions, deletions, and additions are tagged.
Phenxtoolkit.org: Research Domain – Speech, Language and Hearing
- The PhenX Toolkit is an online repository that houses crucial measures associated with complex diseases, phenotypic traits, and environmental factors. These measures undergo meticulous selection by expert working groups, ensuring a consensus-driven approach. The Toolkit provides Standard Measurement Protocols for various research domains.
- Under the Research Domain of Speech, Language, and Hearing, the scope includes:
  - Apraxia/Speech/Sound Disorder (articulation disorders)
  - Audiogram
  - Central Auditory Processing
  - Dysphagia
  - Dysarthria
  - Dyslexia/Reading Disorder
  - Family History (Family History of Speech and Language Impairment, Personal and Family History of Hearing Loss)
  - Late Language Emergence (Early Childhood Speech and Language Assessment)
  - Morphosyntactic/Syntactic Impairments
  - Noise-induced Hearing Loss
  - Nonsyndromic Hearing Loss
  - Otitis Media/Ear Infections
  - Pitch-perception Disorders
  - Presbycusis
  - Semantic Impairments
  - Stuttering/Cluttering
  - Tinnitus
  - Verbal Memory
  - Vertigo
  - Vocal Cord Function
  - Velopharyngeal Incompetence (VPI)
ConceptNet:
An open, multilingual knowledge graph
- ConceptNet is a freely-available semantic network, designed to help computers understand the meanings of words that people use.
- ConceptNet originated from the crowdsourcing project Open Mind Common Sense, which was launched in 1999 at the MIT Media Lab. It has since grown to include knowledge from other crowdsourced resources, expert-created resources, and games with a purpose.
- ConceptNet is used to create word embeddings — representations of word meanings as vectors, similar to word2vec, GloVe, or fastText.
- These word embeddings are free, multilingual, aligned across languages, and designed to avoid representing harmful stereotypes. Their performance at word similarity, within and across languages, was shown to be state of the art at SemEval 2017
- This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 4.0) from https://conceptnet.io. The included data was created by contributors to Commonsense Computing projects, contributors to Wikimedia projects, Games with a Purpose, Princeton University’s WordNet, DBPedia, OpenCyc, and Umbel.
The SpiCE Corpuset:
- SpiCE is a phonological corpus.
- SpiCE is an open-access corpus of conversational bilingual Speech in Cantonese and English. SpiCE includes high-quality audio recordings of 30-minute interviews with 34 early bilinguals in each language with accompanying transcriptions and language background information. The corpus was first released in May 2021.
- A phonological corpus: A phonological corpus for spoken language has audio recordings, linguistic annotations at the level of the word and phone, as well as metadata. A phonological corpus should also be representative of the selected population, big enough, and collected for a purpose.
- The Cantonese-speaking community in Metro Vancouver is a unique bilingual community. Not only is Cantonese very widely spoken in the area, it has been for a long time, and by a heterogeneous group of people. Statistics Canada has some useful visualizations for getting a broad picture of the linguistic landscape—in particular: Proportion of mother tongue responses for various regions in Canada from the 2016 Census. Needless to say, there is a lot more that could be said here!
English Lexicon Project: The English Lexicon Project, supported by the National Science Foundation, provides access to a comprehensive repository of lexical characteristics and behavioral data, covering studies on visual lexical decision and naming with 40,481 words and 40,481 nonwords. Data for the naming and lexical decision studies are compiled from six distinct universities. This dataset includes information collected from over 815 subjects for the lexical decision experiment and from 443 subjects in the naming experiment.