Speech Production

  • Reference work entry
  • First Online: 01 January 2015
  • pp 1493–1498
  • Cite this reference work entry

creative speech production meaning

  • Laura Docio-Fernandez 3 &
  • Carmen García Mateo 4  

1217 Accesses

3 Altmetric

Sound generation; Speech system

Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved.

Speaking is in essence the by-product of a necessary bodily process, the expulsion from the lungs of air charged with carbon dioxide after it has fulfilled its function in respiration. Most of the time, one breathes out silently; but it is possible, by contracting and relaxing the vocal tract, to change the characteristics of the air expelled from the lungs.

Introduction

Speech is one of the most natural forms of communication for human beings. Researchers in speech technology are working on developing systems with the ability to understand speech and speak with a human being.

Human-computer interaction is a discipline concerned with the design, evaluation, and implementation...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

T. Hewett, R. Baecker, S. Card, T. Carey, J. Gasen, M. Mantei, G. Perlman, G. Strong, W. Verplank, Chapter 2: Human-computer interaction, in ACM SIGCHI Curricula for Human-Computer Interaction ed. by B. Hefley (ACM, 2007)

Google Scholar  

G. Fant, Acoustic Theory of Speech Production , 1st edn. (Mouton, The Hague, 1960)

G. Fant, Glottal flow: models and interaction. J. Phon. 14 , 393–399 (1986)

R.D. Kent, S.G. Adams, G.S. Turner, Models of speech production, in Principles of Experimental Phonetics , ed. by N.J. Lass (Mosby, St. Louis, 1996), pp. 2–45

T.L. Burrows, Speech Processing with Linear and Neural Network Models (1996)

J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete-Time Processing of Speech Signals , 1st edn. (Macmillan, New York, 1993)

Download references

Author information

Authors and affiliations.

Department of Signal Theory and Communications, University of Vigo, Vigo, Spain

Laura Docio-Fernandez

Atlantic Research Center for Information and Communication Technologies, University of Vigo, Pontevedra, Spain

Carmen García Mateo

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Center for Biometrics and Security, Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Departments of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

Anil K. Jain

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this entry

Cite this entry.

Docio-Fernandez, L., García Mateo, C. (2015). Speech Production. In: Li, S.Z., Jain, A.K. (eds) Encyclopedia of Biometrics. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7488-4_199

Download citation

DOI : https://doi.org/10.1007/978-1-4899-7488-4_199

Published : 03 July 2015

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4899-7487-7

Online ISBN : 978-1-4899-7488-4

eBook Packages : Computer Science Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Subject List
  • Take a Tour
  • For Authors
  • Subscriber Services
  • Publications
  • African American Studies
  • African Studies
  • American Literature
  • Anthropology
  • Architecture Planning and Preservation
  • Art History
  • Atlantic History
  • Biblical Studies
  • British and Irish Literature
  • Childhood Studies
  • Chinese Studies
  • Cinema and Media Studies
  • Communication
  • Criminology
  • Environmental Science
  • Evolutionary Biology
  • International Law
  • International Relations
  • Islamic Studies
  • Jewish Studies
  • Latin American Studies
  • Latino Studies

Linguistics

  • Literary and Critical Theory
  • Medieval Studies
  • Military History
  • Political Science
  • Public Health
  • Renaissance and Reformation
  • Social Work
  • Urban Studies
  • Victorian Literature
  • Browse All Subjects

How to Subscribe

  • Free Trials

In This Article Expand or collapse the "in this article" section Speech Production

Introduction.

  • Historical Studies
  • Animal Studies
  • Evolution and Development
  • Functional Magnetic Resonance and Positron Emission Tomography
  • Electroencephalography and Other Approaches
  • Theoretical Models
  • Speech Apparatus
  • Speech Disorders

Related Articles Expand or collapse the "related articles" section about

About related articles close popup.

Lorem Ipsum Sit Dolor Amet

Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam ligula odio, euismod ut aliquam et, vestibulum nec risus. Nulla viverra, arcu et iaculis consequat, justo diam ornare tellus, semper ultrices tellus nunc eu tellus.

  • Acoustic Phonetics
  • Animal Communication
  • Articulatory Phonetics
  • Biology of Language
  • Clinical Linguistics
  • Cognitive Mechanisms for Lexical Access
  • Cross-Language Speech Perception and Production
  • Dementia and Language
  • Early Child Phonology
  • Interface Between Phonology and Phonetics
  • Khoisan Languages
  • Language Acquisition
  • Speech Perception
  • Speech Synthesis
  • Voice and Voice Quality

Other Subject Areas

Forthcoming articles expand or collapse the "forthcoming articles" section.

  • Cognitive Grammar
  • Edward Sapir
  • Find more forthcoming articles...
  • Export Citations
  • Share This Facebook LinkedIn Twitter

Speech Production by Eryk Walczak LAST REVIEWED: 22 February 2018 LAST MODIFIED: 22 February 2018 DOI: 10.1093/obo/9780199772810-0217

Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics , Acoustic Phonetics and Speech Perception , which are all studying various elements of language and are part of a broader field of Linguistics . Because of the interdisciplinary nature of the current topic, it is usually studied on several levels: neurological, acoustic, motor, evolutionary, and developmental. Each of these levels has its own literature but in the vast majority of speech production literature, each of these elements will be present. The large body of relevant literature is covered in the speech perception entry on which this bibliography builds upon. This entry covers general speech production mechanisms and speech disorders. However, speech production in second language learners or bilinguals has special features which were described in separate bibliography on Cross-Language Speech Perception and Production . Speech produces sounds, and sounds are a topic of study for Phonology .

As mentioned in the introduction, speech production tends to be described in relation to acoustics, speech perception, neuroscience, and linguistics. Because of this interdisciplinarity, there are not many published textbooks focusing exclusively on speech production. Guenther 2016 and Levelt 1993 are the exceptions. The former has a stronger focus on the neuroscientific underpinnings of speech. Auditory neuroscience is also extensively covered by Schnupp, et al. 2011 and in the extensive textbook Hickok and Small 2015 . Rosen and Howell 2011 is a textbook focusing on signal processing and acoustics which are necessary to understand by any speech scientist. A historical approach to psycholinguistics which also covers speech research is Levelt 2013 .

Guenther, F. H. 2016. Neural control of speech . Cambridge, MA: MIT.

This textbook provides an overview of neural processes responsible for speech production. Large sections describe speech motor control, especially the DIVA model (co-authored by Guenther). It includes extensive coverage of behavioral and neuroimaging studies of speech as well as speech disorders and ties them together with a unifying theoretical framework.

Hickok, G., and S. L. Small. 2015. Neurobiology of language . London: Academic Press.

This voluminous textbook edited by Hickok and Small covers a wide range of topics related to neurobiology of language. It includes a section devoted to speaking which covers neurobiology of speech production, motor control perspective, neuroimaging studies, and aphasia.

Levelt, W. J. M. 1993. Speaking: From intention to articulation . Cambridge, MA: MIT.

A seminal textbook Speaking is worth reading particularly for its detailed explanation of the author’s speech model, which is part of the author’s language model. The book is slightly dated, as it was released in 1993, but chapters 8–12 are especially relevant to readers interested in phonetic plans, articulating, and self-monitoring.

Levelt, W. J. M. 2013. A history of psycholinguistics: The pre-Chomskyan era . Oxford: Oxford University Press.

Levelt published another important book detailing the development of psycholinguistics. As its title suggests, it focuses on the early history of discipline, so readers interested in historical research on speech can find an abundance of speech-related research in that book. It covers a wide range of psycholinguistic specializations.

Rosen, S., and P. Howell. 2011. Signals and Systems for Speech and Hearing . 2d ed. Bingley, UK: Emerald.

Rosen and Howell provide a low-level explanation of speech signals and systems. The book includes informative charts explaining the basic acoustic and signal processing concepts useful for understanding speech science.

Schnupp, J., I. Nelken, and A. King. 2011. Auditory neuroscience: Making sense of sound . Cambridge, MA: MIT.

A general introduction to speech concepts with main focus on neuroscience. The textbook is linked with a website which provides a demonstration of described phenomena.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login .

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here .

  • About Linguistics »
  • Meet the Editorial Board »
  • Acceptability Judgments
  • Accessibility Theory in Linguistics
  • Acquisition, Second Language, and Bilingualism, Psycholin...
  • Adpositions
  • African Linguistics
  • Afroasiatic Languages
  • Algonquian Linguistics
  • Altaic Languages
  • Ambiguity, Lexical
  • Analogy in Language and Linguistics
  • Applicatives
  • Applied Linguistics, Critical
  • Arawak Languages
  • Argument Structure
  • Artificial Languages
  • Australian Languages
  • Austronesian Linguistics
  • Auxiliaries
  • Balkans, The Languages of the
  • Baudouin de Courtenay, Jan
  • Berber Languages and Linguistics
  • Bilingualism and Multilingualism
  • Borrowing, Structural
  • Caddoan Languages
  • Caucasian Languages
  • Celtic Languages
  • Celtic Mutations
  • Chomsky, Noam
  • Chumashan Languages
  • Classifiers
  • Clauses, Relative
  • Cognitive Linguistics
  • Colonial Place Names
  • Comparative Reconstruction in Linguistics
  • Comparative-Historical Linguistics
  • Complementation
  • Complexity, Linguistic
  • Compositionality
  • Compounding
  • Computational Linguistics
  • Conditionals
  • Conjunctions
  • Connectionism
  • Consonant Epenthesis
  • Constructions, Verb-Particle
  • Contrastive Analysis in Linguistics
  • Conversation Analysis
  • Conversation, Maxims of
  • Conversational Implicature
  • Cooperative Principle
  • Coordination
  • Creoles, Grammatical Categories in
  • Critical Periods
  • Cyberpragmatics
  • Default Semantics
  • Definiteness
  • Dene (Athabaskan) Languages
  • Dené-Yeniseian Hypothesis, The
  • Dependencies
  • Dependencies, Long Distance
  • Derivational Morphology
  • Determiners
  • Dialectology
  • Distinctive Features
  • Dravidian Languages
  • Endangered Languages
  • English as a Lingua Franca
  • English, Early Modern
  • English, Old
  • Eskimo-Aleut
  • Euphemisms and Dysphemisms
  • Evidentials
  • Exemplar-Based Models in Linguistics
  • Existential
  • Existential Wh-Constructions
  • Experimental Linguistics
  • Fieldwork, Sociolinguistic
  • Finite State Languages
  • First Language Attrition
  • Formulaic Language
  • Francoprovençal
  • French Grammars
  • Gabelentz, Georg von der
  • Genealogical Classification
  • Generative Syntax
  • Genetics and Language
  • Grammar, Categorial
  • Grammar, Construction
  • Grammar, Descriptive
  • Grammar, Functional Discourse
  • Grammars, Phrase Structure
  • Grammaticalization
  • Harris, Zellig
  • Heritage Languages
  • History of Linguistics
  • History of the English Language
  • Hmong-Mien Languages
  • Hokan Languages
  • Humor in Language
  • Hungarian Vowel Harmony
  • Idiom and Phraseology
  • Imperatives
  • Indefiniteness
  • Indo-European Etymology
  • Inflected Infinitives
  • Information Structure
  • Interjections
  • Iroquoian Languages
  • Isolates, Language
  • Jakobson, Roman
  • Japanese Word Accent
  • Jones, Daniel
  • Juncture and Boundary
  • Kiowa-Tanoan Languages
  • Kra-Dai Languages
  • Labov, William
  • Language and Law
  • Language Contact
  • Language Documentation
  • Language, Embodiment and
  • Language for Specific Purposes/Specialized Communication
  • Language, Gender, and Sexuality
  • Language Geography
  • Language Ideologies and Language Attitudes
  • Language in Autism Spectrum Disorders
  • Language Nests
  • Language Revitalization
  • Language Shift
  • Language Standardization
  • Language, Synesthesia and
  • Languages of Africa
  • Languages of the Americas, Indigenous
  • Languages of the World
  • Learnability
  • Lexical Access, Cognitive Mechanisms for
  • Lexical Semantics
  • Lexical-Functional Grammar
  • Lexicography
  • Lexicography, Bilingual
  • Linguistic Accommodation
  • Linguistic Anthropology
  • Linguistic Areas
  • Linguistic Landscapes
  • Linguistic Prescriptivism
  • Linguistic Profiling and Language-Based Discrimination
  • Linguistic Relativity
  • Linguistics, Educational
  • Listening, Second Language
  • Literature and Linguistics
  • Machine Translation
  • Maintenance, Language
  • Mande Languages
  • Mass-Count Distinction
  • Mathematical Linguistics
  • Mayan Languages
  • Mental Health Disorders, Language in
  • Mental Lexicon, The
  • Mesoamerican Languages
  • Minority Languages
  • Mixed Languages
  • Mixe-Zoquean Languages
  • Modification
  • Mon-Khmer Languages
  • Morphological Change
  • Morphology, Blending in
  • Morphology, Subtractive
  • Munda Languages
  • Muskogean Languages
  • Nasals and Nasalization
  • Niger-Congo Languages
  • Non-Pama-Nyungan Languages
  • Northeast Caucasian Languages
  • Oceanic Languages
  • Papuan Languages
  • Penutian Languages
  • Philosophy of Language
  • Phonetics, Acoustic
  • Phonetics, Articulatory
  • Phonological Research, Psycholinguistic Methodology in
  • Phonology, Computational
  • Phonology, Early Child
  • Policy and Planning, Language
  • Politeness in Language
  • Positive Discourse Analysis
  • Possessives, Acquisition of
  • Pragmatics, Acquisition of
  • Pragmatics, Cognitive
  • Pragmatics, Computational
  • Pragmatics, Cross-Cultural
  • Pragmatics, Developmental
  • Pragmatics, Experimental
  • Pragmatics, Game Theory in
  • Pragmatics, Historical
  • Pragmatics, Institutional
  • Pragmatics, Second Language
  • Pragmatics, Teaching
  • Prague Linguistic Circle, The
  • Presupposition
  • Psycholinguistics
  • Quechuan and Aymaran Languages
  • Reading, Second-Language
  • Reciprocals
  • Reduplication
  • Reflexives and Reflexivity
  • Register and Register Variation
  • Relevance Theory
  • Representation and Processing of Multi-Word Expressions in...
  • Salish Languages
  • Sapir, Edward
  • Saussure, Ferdinand de
  • Second Language Acquisition, Anaphora Resolution in
  • Semantic Maps
  • Semantic Roles
  • Semantic-Pragmatic Change
  • Semantics, Cognitive
  • Sentence Processing in Monolingual and Bilingual Speakers
  • Sign Language Linguistics
  • Sociolinguistics
  • Sociolinguistics, Variationist
  • Sociopragmatics
  • Sound Change
  • South American Indian Languages
  • Specific Language Impairment
  • Speech, Deceptive
  • Speech Production
  • Switch-Reference
  • Syntactic Change
  • Syntactic Knowledge, Children’s Acquisition of
  • Tense, Aspect, and Mood
  • Text Mining
  • Tone Sandhi
  • Transcription
  • Transitivity and Voice
  • Translanguaging
  • Translation
  • Trubetzkoy, Nikolai
  • Tucanoan Languages
  • Tupian Languages
  • Usage-Based Linguistics
  • Uto-Aztecan Languages
  • Valency Theory
  • Verbs, Serial
  • Vocabulary, Second Language
  • Vowel Harmony
  • Whitney, William Dwight
  • Word Classes
  • Word Formation in Japanese
  • Word Recognition, Spoken
  • Word Recognition, Visual
  • Word Stress
  • Writing, Second Language
  • Writing Systems
  • Zapotecan Languages
  • Privacy Policy
  • Cookie Policy
  • Legal Notice
  • Accessibility

Powered by:

  • [66.249.64.20|185.80.150.64]
  • 185.80.150.64
  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Papyrology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Evolution
  • Language Reference
  • Language Acquisition
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Media
  • Music and Religion
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Science
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Clinical Neuroscience
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Ethics
  • Business Strategy
  • Business History
  • Business and Technology
  • Business and Government
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic History
  • Economic Systems
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Theory
  • Politics and Law
  • Public Policy
  • Public Administration
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Psycholinguistics

A newer edition of this book is available.

  • < Previous chapter
  • Next chapter >

29 Speech Production

Carol A. Fowler, Haskins Laboratories and Department of Psychology, University of Connecticut, and Department of Linguistics, Yale University.

  • Published: 18 September 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. Such an account must address several issues. Two central issues are considered in this article. One issue concerns the nature of language forms that ostensibly compose plans for utterances. Because of their role in making linguistic messages public, a straightforward idea is that language forms are themselves the public behaviors in which members of a language community engage when talking. By most accounts, however, the relation of phonological segments to actions of the vocal tract is not one of identity. Rather, phonological segments are mental categories with featural attributes. Another issue concerns what, at various levels of description, the talker aims to achieve. This article focuses on speech production, and considers language forms and plans for speaking, along with speakers' goals as acoustic targets or vocal tract gestures, the DIVA theory of speech production, the task dynamic model, coarticulation, and prosody.

L anguage forms provide the means by which language users can make an intended linguistic message available to other members of the language community. Necessarily, then, they have two distinct characteristics. On the one hand, they are linguistic entities, morphemes and phonological segments, that encode the talker's linguistic message. On the other hand, they either have physical properties themselves (e.g. Browman and Goldstein, 1986 ) or, by other accounts, they serve as an interface between the linguistic and physical domains of language use.

A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. 1 Such an account must address several issues. Two central issues are discussed here.

One issue concerns the nature of language forms that ostensibly compose plans for utterances. Because of their role in making linguistic messages public, a straightforward idea is that language forms are themselves the public behaviors in which members of a language community engage when talking. By most accounts, however, the relation of phonological segments to actions of the vocal tract is not one of identity. Rather, phonological segments are mental categories with featural attributes. We will consider reasons for this stance, relevant evidence, and an alternative theoretical perspective.

Another issue concerns what, at various levels of description, the talker aims to achieve (e.g. Levelt et al., 1999 ). In my discussion of this issue, I focus here on the lowest level of description—that is, on what talkers aim to make publicly available to listeners. A fundamental theoretical divide here concerns whether the aims are acoustic or articulatory. On the one hand, it is the acoustic signal that stimulates the listener's ears, and so one might expect talkers to aim for acoustic targets that point listeners toward the language forms that compose the talker's intended message. On the other hand, acoustic speech signals are produced by vocal tract actions. The speaker has to get the actions right to get the acoustic signal right.

Readers may wonder whether this is a “tempest in a teapot.” That is, why not suppose that talkers plan and control articulations that will get the signal right, so that in a sense both articulation and acoustics are controlled? Readers will see, however, that there are reasons why theorists typically choose one account or the other.

These issues are considered in turn in the following two sections.

29.1 Language forms and plans for speaking

By most accounts, as already noted, neither articulation nor the acoustic signal is presumed to implement phonological language forms transparently. Language forms are conceived of as abstract mental categories about which acoustic speech signals provide cues.

There are two quite different reasons for this point of view. One is that language forms are cognitive entities (e.g. Pierrehumbert, 1990 ). In particular, word forms are associated, presumably in the lexical memory of a language user, with word meanings. As such they constitute an important part of what a language user knows that permits him or her to produce and understand language. Moreover, word forms in the lexicons of languages exhibit systematic properties which can be captured by formal rules. There is some evidence that language users know these rules. For example, in English, voiceless stop consonants are aspirated in stressed syllable-initial position. That systematic property can be captured by a rule (Kenstowicz and Kisseberth, 1979 ).

Evidence that such a rule is part of a language user's competence is provided, for example, by foreign accents. When native English speakers produce words in a Romance language such as French, which has unaspirated stops where English has aspirated stops, they tend to aspirate the stops. Accordingly, the word pas , [pa] 2 in French is pronounced [p h a] as if the English speaker is applying the English rule to French words. A second source of evidence comes from spontaneous errors of speech production. Kenstowicz and Kisseberth ( 1979 ) report an error in which a speaker intended to produce tail spin , but instead said pail stin. In the intended utterance, /t/ in tail is aspirated; /p/ in spin is unaspirated. The authors report, however, that, in the error, appropriately for their new locations, /p/ was pronounced [p h ]; /t/ was pronounced [t]. One account of this “accommodation” (but not the only one possible) is that the exchange of /t/ and /p/ occurred before the aspiration rule had been applied by the talker. When the aspiration rule was applied, /p/ was accommodated to its new context. 3

A second reason to suppose that language forms exist only in the mind is coarticulation. Speakers temporally overlap the articulatory movements for successive consonants and vowels. This makes the movements associated with a given phonetic segment context-sensitive and lacking an obvious discrete segmental structure. Likewise, the acoustic signal which the movements produce is context-sensitive. Despite researchers' best efforts (e.g. Stevens and Blumstein, 1981 ) they have not uncovered invariant acoustic information for individual consonants and vowels. In addition, the acoustic signal, like the movements that produce it, lacks a phone-sized segmental structure.

This evidence notwithstanding, there are reasons to resist the idea that language forms reside only in the minds of language users. They are, as noted, the means that languages provide to make linguistic messages public. Successful recognition of language forms would seem more secure were the forms themselves public things.

Browman and Goldstein (e.g. 1986 ; 1992 ) have proposed that phonological language forms are gestures achieved by vocal tract synergies that create and release constrictions. They are both the actions of the vocal tract (properly described) that occur during speech and at the same time units of linguistic contrast. (“Contrast” means that a change in a gesture or gestural parameter can change the identity of a word. For example, the word hot can become tot by addition of a tongue tip constriction gesture; tot can become sot by a change in the tongue tip's constriction degree.)

From this perspective, phonetic gestures are cognitive in nature. That is, they are components of a language users' language competence, and, as noted, they serve as units of contrast in the language. However, cognitive entities need not be covert (see e.g. Ryle, 1949 ). They can be psychologically meaningful actions, in this case of a language user. As for coarticulation, although it creates context sensitivity in articulatory movements, it does not make gestures context-sensitive. For example, lip closure for /b/, /p/, and /m/ occurs despite coarticulatory encroachment from vowels that affects jaw and lip motion.

There is some skepticism about whether Browman and Goldstein's “articulatory phonology” as just described goes far enough beyond articulatory phonetics. 4 This is in part because it does not yet provide an account of many of the phonological systematicities (e.g. vowel harmony in Hungarian, Turkish, and many other languages; but see Gafos and Benus, 2003 ) which exist across the lexicon of languages and that other theories of phonology capture by means of rules (e.g. Kenstowicz and Kisseberth, 1979 ) or constraints (Archangeli,1997). However, the theory is well worth considering, because it is unique in proposing that language forms are public events.

Spontaneous errors of speech production have proved important sources of evidence about language planning units. These errors, produced by people who are capable of producing error-free tokens, appear to provide evidence both about the units of language that speakers plan to produce and about the domain over which they plan. Happily, the units which participate in errors have appeared to converge with units that linguistic analysis has identified as real units of the language. For example, words participate in errors as anticipations (e.g. sky is in the sky for intended sun is in the sky; this and other errors from Dell, 1986 ), perseverations ( class will be about discussing the class for intended class will be about discussing the test ), exchanges ( writing a mother to my letter for writing a letter to my mother ), and non-contextual substitutions ( pass the salt for pass the pepper ) . Consonants and vowels participate in the same kinds of error. Syllables do so only rarely; however, they serve as frames that constrain how consonants and vowels participate in errors. Onset consonants interact only with onset consonants; vowels interact with vowels; and, albeit rarely, coda consonants interact with coda consonants. Interacting segments tend to be featurally similar to one another. Moreover, when segments move, they tend to move to contexts which are featurally similar to the contexts in which they were planned to occur. Segments are anticipated over shorter distances than words (Garrett, 1980 ), suggesting that the planning domains for words and phonological segments are different.

Historically, most error corpora were collected by individuals who transcribed the errors that they heard. As noted, the errors tended to converge with linguists' view of language forms as cognitive, not physical entities (e.g. Pierrehumbert, 1990 ). As researchers moved error collection into the laboratory, however, it became clear that errors occur that are inaudible. Moreover, these errors violate constraints on errors that collectors had identified.

One constraint was that errors are categorical in nature. If, in production of Bob flew by Bligh Bay , the /l/ of Bligh were perseverated into the onset of Bay, producing Blay , the /l/ would be a fully audible production. However, electromyographic evidence revealed to Mowrey and MacKay ( 1990 ) that errors are gradient. Some produce an audible token of/l/; others do not, yet show activity of a lingual muscle indicating the occurrence of a small lingual (tongue) gesture for /l/.

A second constraint is that errors result in phonologically well-formed utterances. Not only do vowels interact only with other vowels in errors, and onsets with onsets and codas with codas, but also sequences of consonants in onsets and codas tend to be permissible in the speaker's language. Or so investigators thought before articulatory data were collected in the laboratory. Pouplier ( 2003a ; 2003b ) used a midsagittal electromagnetometer to collect articulator movement data as participants produced repetitions of pairs of words such as cop—top or sop—shop. Like Mowrey and MacKay ( 1990 ), she found errorful articulations (for example, intrusive tongue tip movement toward a /t/ articulation during cop ) in utterances that sounded error-free. In addition, however, she found that characteristically intrusions were not accompanied by reductions of the intended gesture. This meant that, in the foregoing example, constriction gestures for both /t/ and /k/ occurred in the onset of a syllable, a phonotactically impermissible cluster for her English speakers.

What do these findings imply for theories of speech production? For Pouplier and colleagues (Pouplier, 2003b ; Goldstein et al., forthcoming), planning units are intended sequences of vocal-tract gestures that are coordinated in the manner of coupled oscillators. In the literature on limb movements, it has been found that two modes of coordination are stable. Limbs (or hands or fingers) may be oscillated in phase or 180 degrees out of phase (so that extension of one limb occurs when the other limb is flexing). In tasks in which, for example (Kelso, 1984 ; see also Yamanishi et al., 1980 ), hands are oscillated about the wrist at increasing rates, in-phase movements remain stable; however, out-of-phase movements become unstable. Participants attempting to maintain out-of-phase movements slip into phase. Pouplier and colleagues suggest that findings of intrusive tongue tip gestures in the onset of cop and of intrusive tongue body gestures in top constitute a similar shift from a less to a more stable oscillation mode. When top—cop is repeated, syllable onsets /t/ and /k/ each occur once for each pair of rime (/ap/) productions giving a 1:2 coordination mode. When intrusive /t/ and /k/ gestures occur, the new coordination mode is 1:1; that is, the new onset is produced once for each one production of the syllable rime. A 1:1 coordination mode is more stable than a 1:2 mode.

A question is what the findings of gradient, phonotactically impermissible errors imply about the interpretability of error analyses based on transcribed, rather than articulatory, corpora. Certainly these errors occur, and certainly they were missed in transcription corpora. However, does it mean that categorical consonant and vowel errors do not occur, that planning units should be considered to be intended phonetic gestures (Pouplier) or even commands to muscles (Mowrey and MacKay), not the consonants and vowels of traditional phonetic analysis?

There are clearly categorical errors that occur at the level of whole words (recall writing a mother to my letter ) . It does not seem implausible, therefore, that categorical phonetic errors also occur. It may be appropriate (as in the model of Levelt et al., 1999 ) to imagine levels of speech planning, with consonants and vowels of traditional analyses serving as elements of plans at one level, giving way to planned gestures at another.

Findings that error corpora in some ways misrepresent the nature of spontaneous errors of speech production, however, have had the positive consequence that researchers have sought converging (or, as appropriate, diverging) evidence from experiments that elicit error-free speech. For example, Meyer ( 1991 ) found evidence for syllable constituents serving as “encoding” units in language production planning. Participants memorized sets of word pairs consisting of a prompt word produced by the experimenter and a response word produced as quickly as possible by the participant. Response words in a set were “homogeneous” if they shared one or more phonological segments; otherwise they were “heterogeneous.” Meyer found faster responses to words in homogeneous compared to heterogeneous sets if response words shared their initial consonant or initial syllable, but not if they shared the syllable rime (that is, the vowel and any following consonants). There was no further advantage over responses to heterogeneous words when the CV of a CVC syllable was shared in homogeneous sets as compared to when just the initial C was shared. There was an advantage over responses to words sharing the initial consonant of responses to words sharing the whole first syllable. These findings suggest, as errors do, that syllable constituents are among the planning units. They also suggest that encoding for production is a sequential “left-to-right” process.

Sevald et al. ( 1995 ) obtained converging evidence with errors data suggesting that syllables serve as planning frames. They asked participants to repeat pairs of non-words (e.g. KIL KILPER or KIL KILPNER) in which the initial monosyllable either did or did not match the initial syllable of the disyllable. The task was to repeat the pair as many times as possible in four seconds. Mean syllable production time was less when the syllable structure matched. Remarkably, the advantage of matching syllable structure was no less when only syllable structure, but not syllable content, matched (e.g. KEM TILFER vs. KEM TILFNER). In the foregoing examples, it looks as if the advantage could be due to the fact that there were fewer phonetic segments to produce in the matching condition. However, there were other items in which the length advantage was reversed.

29.2 Speakers' goals as acoustic targets or vocal tract gestures

A next issue is how intended sequences of phonetic entities are planned to be implemented as actions or their consequences that are available to a listener. In principle, this issue is orthogonal to the one just considered about the nature of planned language forms. As just discussed, these forms are variously held to be covert, cognitive representations or public, albeit still cognitive, entities. Either view is compatible with proposals that, at the lowest level of description, talkers aim to achieve either acoustic or gestural targets. In the discussion below, therefore, the issue of whether language forms are covert or public in nature is set aside. It may be obvious, however, that, in fact, acoustic target theorists at least implicitly hold the former view and gesture theorists the latter.

Guenther et al. ( 1998 ) argue against gestural targets on several grounds and argue for acoustic targets. One ground for rejecting gestural targets, such as constriction location and degree, concerns the feedback information that speakers would need to implement the targets sufficiently accurately. To know whether or not a particular constriction has been achieved requires perceptual information. If, for example, an intended constriction is by the lips (as for /b/, /p/, or /m/), talkers can verify that the lips are closed from proprioceptive information for lip contact. However, Guenther et al. argue that, in particular for vowels, constrictions do not always involve contact by articulators, and therefore intended constrictions cannot be verified. In addition, they argue, to propose that talkers intend to achieve particular constrictions implies that talkers should not be able to compensate for experimental perturbations that prevent those constrictions from being achieved. However, some evidence suggests that they can. For example, Savariaux et al. ( 1995 ) had talkers produce vowels with a tube between their lips that prevented normal lip rounding for the vowel /u/. The acoustic effects of the lip tube could be compensated for by lowering the larynx (thereby enlarging the oral cavity by another means than rounding). Of the eleven participants, one compensated fully for the lip tube. Six others showed limited evidence of compensation.

A third argument for acoustic targets is provided by American English /r/. According to Guenther et al., /r/ is produced in very different ways by different speakers or even by the same speaker in different contexts. The different means of producing /r/ are acoustically very similar. One account for the articulatory variability, then, is that it is tolerated if the different means of production produce inaudibly different acoustic signals, the talker's production aim. Finally, Guenther et al. argue that ostensible evidence for constriction targets—that, for example, invariant constriction gestures occur for /b/ and other segments—need not be seen as evidence uniquely favoring gestural targets. Their model “DIVA” (originally “directions in orosensory space onto velocities of articulators”; described below) learns to achieve acoustic-perceptual targets, but nonetheless shows constriction invariance. However, there is also evidence favoring the alternative idea that talkers' goals are articulatory not acoustic. Moreover, the arguments of Guenther et al. favoring acoustic targets can be challenged.

Tremblay et al. ( 2003 ) applied mechanical perturbations to the jaw of talkers producing the word sequence see—at. The perturbation altered the motion path of the jaw, but had small and inaudible acoustic effects. Even though acoustic effects were inaudible, over repetitions, talkers compensated for the perturbations and showed after-effects when the perturbation was removed. Compensation also occurred in a silent speech condition, but not in a non-speech jaw movement condition. These results appear inconsistent with a hypothesis that speech targets are acoustic.

There is also a more natural speech example of preservation of inaudible articulations. In an investigation of an X-ray microbeam database, Browman and Goldstein ( 1991 ) found examples of utterances such as perfect memory in which transcription suggested deletion of the final /t/ of perfect. However, examination of the tongue tip gesture for the /t/ revealed its presence. Because of overlap from the bilabial gesture of /m/, however, acoustic consequences of the /t/ constriction gesture were absent or inaudible. As for the suggestion that constriction goals should be unverifiable by feedback when constricting articulators are not in contact with another structure, to my knowledge this is untested speculation.

As for the compensation found by Savariaux et al ( 1995 ; see also Perkell et al., 1993 ), Guenther et al. do not remark that the compensation is markedly different from that associated with certain other perturbations in being, for most participants, either partial or absent. Compensations for a bite block (which prevents jaw movement) are immediate and nearly complete in production of vowels (e.g. Lindblom et al., 1979 ). Compensations for jaw and lip perturbations during speech (e.g. tugging the jaw down as it raises to close the lips for a /b/) are very short in latency, immediate, and nearly complete (e.g. Kelso et al., 1984 ). These different patterns of compensation are not distinct in the DIVA model. However, they are in speakers. The difference may be understood as relating to the extent to which they mimic perturbations which occur naturally in speech production. When a speaker produces, say, /ba/ versus /bi/, coarticulation by the following low (/a/) or high (/i/) vowel will tug the jaw and lower lip down or up. Speakers have to compensate for that to get the lips shut for bilabial /b/. That routine compensation for coarticulation may underlie fast and functional compensations which occur in the laboratory (Fowler and Saltzman, 1993 ). However, it is a rare perturbation outside the laboratory that prevents lip rounding. Accordingly, talkers may have no routines in place to compensate for the lip tube, and have to learn them. In a gestural theory, they have to learn to create a mirage—that is, an acoustic signal that mimics consequences of lip rounding.

As for /r/, ironically, it has turned out to be a poster child for both acoustic and articulatory theorists. Delattre and Freeman ( 1968 ), whom Guenther et al. cite as showing considerable variability in American English articulation of /r/, in fact remark that in every variant they observed there were two constrictions, one by the back of the tongue in the pharyngeal region and one by the tongue tip against the hard palate. (Delattre and Freeman were only looking at the tongue, and so did not remark on a third shared constriction, rounding by the lips.) Accordingly, whether one sees variability or invariance in /r/ articulations may depend on the level of description of the vocal tract configuration deemed relevant to talkers and listeners. In Browman and Goldstein's articulatory phonology (e.g. 1986 ; 1995 ), the relevant level is that of constriction locations and degrees, and those are invariant across the /r/ variants.

Focus on constrictions permits an understanding of a source of dialect variation in American English /r/ that is not illuminated by a proposal that acoustic targets are talkers' aims. Among consonants involving more than one constriction—for example, the nasal consonants (constrictions by lips, tongue tip or tongue body, and by the velum), the liquids, /l/ (tongue tip and body) and /r/ (tongue body, tip, and lips), and the approximant /w/ (tongue body and lips)—a generalization holds regarding the phasing of the constriction gestures. Prevocalically, the gestures are achieved nearly simultaneously; postvocalically, the gesture with the more open (vowel-like) constriction degree leads (see research by Sproat and Fujimora, 1993 ; Krakow, 1989 ; 1993 ; Gick, 1999 ). This is consistent with the general tendency in syllables for the more sonorant (roughly more vowel-like) consonants to be positioned closest to the vowel. (For example, the ordering in English is /tr/ before the vowel as in tray , but /rt/ after the vowel as in art.) Goldstein (pers. comm., 15 Aug. 2005) points out that, in two dialects of American English, one spoken in Brooklyn and one in New Orleans, talkers produce postvocalic consonants in such a way that, for example, bird sounds to listeners somewhat like boyd. This is understandable if talkers exaggerate the tendency for the open lip and tongue body constrictions to lead the tip constriction. Together, the lip and tongue body configurations create a vowel sound like /Ɔ/ (in saw ) ; by itself, the tip gesture is like /i/ (in see ) . Together, the set of gestures yield something resembling the diphthong /Ɔ i / as in boy.

In short, there are arguments and there is evidence favoring both theoretical perspectives—that targets of speech production planning are acoustic or else are gestural. Deciding between the perspectives will require further research.

29.2.1 Theories of speech production

As noted, theories of speech production differ in their answer to the question of what talkers aim to achieve, and a fundamental difference is whether intended targets are acoustic or articulatory. Within acoustic theories, accounts can differ in the nature of acoustic targets; within articulatory theories, accounts can be that muscle lengths or muscle contractions are targets, that articulatory movements are targets, or that coordinated articulatory gestures are targets. I will review one acoustic and one articulatory account. I chose these accounts because they are the most fully developed theories within the acoustic and articulatory domains.

29.3 The DIVA theory of speech production

In this account (e.g. Guenther et al., 1998 ), targets of speaking are normalized acoustic signals reflecting resonances of the vocal tract (“formants”). The normalization transformations create formant values that are the same for men, women, and children even though acoustic reflections of formants are higher in frequency for women than for men and for young children than for women. Because formants characterize vowels and sonorant consonants but not (for example) stop or fricative consonants, the model is restricted to explanation of just those classes of phones.

Between approximately six and eight months of age, infants engage in vocal behavior called babbling in which they produce what sounds like sequences of CV syllables. In this way, in DIVA, the young model learns a mapping from articulator positions to normalized acoustic signals. Over learning, this mapping is inverted so that acoustic-perceptual targets can underlie control of articulatory movements. In the model, the perceived acoustic signal has three degrees of freedom (one per normalized formant). In contrast, the articulatory system has the seven degrees of freedom of Maeda's ( 1990 ) articulatory model. This difference in degrees of freedom mean that the inverted mapping is one to many. Accordingly, a constraint is required to make the mapping determinate. Guenther et al. use a “postural relaxation” constraint whereby the articulators remain as close as possible to the centers of their ranges of motion. This constraint underlies the model's tendency to show near-invariance of constrictions despite having acoustic-perceptual rather than articulatory targets.

In addition to that characteristic, the model compensates for perturbations—not, however, distinguishing those that humans do well and poorly.

29.4 The task dynamic model

Substantially influenced by the theorizing of Bernstein ( 1967 ), Turvey ( 1977 ) introduced a theory of action in which he proposed that the minimal meaningful units of action were produced by synergies or coordinative structures (Easton, 1972 ). These are transiently established coordinative relations among articulators—those of the vocal tract for speech—which achieve action goals. An example in speech is the organized relation among the jaw and the two lips that achieves bilabial constriction for English /b/, /p/, or /m/. That coordinative relation is not in place when speakers produce a constriction which does not include lip closure (e.g. Kelso et al., 1984 ). The coordinative relation underlies the ability of speakers to compensate for jaw or lip perturbations in the laboratory, and presumably to compensate for coarticulatory demands on articulators shared by temporally overlapping phones outside the laboratory.

Saltzman and colleagues (e.g. Saltsman and Kelso, 1987; Saltzman and Munhall, 1989 ; see also Turvey, 1990 ) proposed that synergies are usefully modeled as dynamical systems. Specifically, they suggested that speech gestures can be modeled as mass-spring systems with point attractor dynamics. In turn those systems are characterized by equations that reflect how the systems' states undergo change over time. Each vocal tract gesture is defined in terms of “tract variables.” Variables include lip protrusion (a constriction location) and lip aperture (constriction degree). Appropriately parameterized, the variables achieve gestural goals. The tract variables have associated articulators (e.g. the jaw and the two lips) that constitute the synergy that achieves that gestural goal. In one version of the theory, a word is specified by a “gestural score” (Browman and Goldstein, 1986 ) which provides parameters for the relevant tract variables and the interval of time over which they should be active. In a more recent version (Saltzman et al., 2000 ) gestural scores are replaced by a central “clock” that regulates the timing of gesture activation. The clock's average “tick” rate determines the average rate of speaking. As we will see later, local clock slowing can mark the edges of prosodic domains.

These systems show the equifinality characteristic of real speakers which underlies their ability to compensate for perturbations. That is, although the parameters of the dynamical system for a gesture have context independent values, gestural goals are achieved in a context-dependent manner so that, for example, as in the research by Kelso et al. ( 1984 ), lip closure for /b/ is achieved by different contributions from the lips and jaw on perturbed and unperturbed trials. The model compensates for perturbations which speakers handle without learning, but not for those such as in the study by Savariaux et al., which speakers require learning to handle, if they handle them at all.

29.5 Coarticulation

A hallmark of speech production is coarticulation. Speakers talk very quickly, and talking involves rapid sequencing of the particulate atoms (Studdert-Kennedy, 1998 ) which constitute language forms. Although the atoms are discrete, their articulation is not. Much research on speech production has been conducted with an aim to understand coarticulation. Coarticulation is characterized either as context-sensitivity of production of language forms or as temporally overlapping production. It occurs in both an anticipatory and a carryover direction. In the word stew , for example, lip rounding from the vowel /u/ begins near the beginning of the /s/. In use , it carries over during /s/.

Thirty years ago, there were two classes of accounts of coarticulation. In one point of view (e.g. Daniloff and Hammarberg, 1973 ) coarticulation was seen as “feature spreading.” Consonants and vowels can be characterized by their featural attributes. For example, consonants can be described as being voiced or unvoiced, as having a particular place of articulation (e.g. bilabial for /b/, /p/, and /m/) and a particular manner of articulation (e.g. /b/ and /p/ are stops; /f/ is a fricative). Vowels are front, mid, or back; high, mid, or low, and rounded or unrounded. Many features which characterize consonants and vowels are contrastive, in that changing a feature value changes the identity of a consonant or vowel and the identity of a word that they, in part, compose. For example, changing the feature of a consonant from voiced to unvoiced can change a consonant from /b/ to /p/ and a word from bat to pat . However, some features are not contrastive. Adding rounding to a consonant does not change its identity in English; adding nasalization to a vowel in English likewise does not change its identity.

In feature spreading accounts of coarticulation, non-contrastive features were proposed to spread in an anticipatory direction to any phone unspecified for the feature (i.e. for which the feature was non-contrastive). Accordingly, lip rounding should spread through any consonant preceding a rounded vowel; nasalization should spread through any vowel preceding a nasal consonant. Carryover coarticulation was seen as inertial. Articulators cannot stop on a dime. Accordingly lip rounding might continue during a segment following a rounded vowel. There was some supportive evidence for the feature spreading view of anticipatory coarticulation (Daniloff and Moll, 1968 ).

However, there was also disconfirming evidence. One was a persistent finding (e.g. Benguerel and Cowan, 1974 ) that indications of coarticulation did not neatly begin at phonetic segment edges, as they should if a feature had spread from one phone to another. A second kind of evidence consisted of reports of “troughs” (e.g. Gay, 1978 ; Boyce, 1990 ). These were findings that, for example, during a consonant string between two rounded vowels, the lips would reduce their rounding and lip muscle activity would reduce, inconsistent with an idea that a rounding feature had spread to consonants in the string.

A different general point of view was that coarticulation was “coproduction” (e.g. Fowler, 1977 )—i.e. temporal overlap in the production of two or more phones. In this point of view, for example, rounding need not begin at the beginning of a consonant string preceding a rounded vowel, and a trough during a consonant string between two rounded vowels would be expected as the rounding gesture for the first vowel wound down and before rounding for the second vowel began. Bell-Berti and Harris ( 1981 ) proposed a specific account of coproduction, known as “frame” theory, in which anticipatory coarticulation began a fixed interval before the acoustically defined onset of a rounded vowel or nasalized consonant.

For a while (Bladon and Al-Bamerni, 1982 ; Perkell and Chiang, 1986 ), there was the congenial suggestion that both theories might be right. Investigators found evidence sometimes that there was a start of a rounding or nasalization gesture at the beginning of a consonant (for rounding) or vowel string preceding a rounded vowel or nasalized consonant. Then, at an invariant interval before the rounded or nasalized phone, there was a rapid increase in rounding or nasalization as predicted by frame theory. However, that evidence was contaminated by a confounding (Perkell and Matthies, 1992 ). Bell-Berti and colleagues (e.g. Boyce et al., 1990 ); Gelfer et al., 1989 ) pointed out that some consonants are associated with lip rounding (e.g. /s/). Similarly, vowels are associated with lower positions of the velum compared to oral obstruents. Accordingly, to assess when anticipatory coarticulation of lip rounding or nasalization begins requires appropriate control utterances, to enable a distinction to be made between lip rounding or velum lowering due to coarticulation and that due to characteristics of phonetic segments in the coarticulatory domain. For lip rounding, for example, rounding during an utterance such as stew requires comparision with rounding during a control utterance such as stee in which the rounded vowel is replaced by an unrounded vowel. Any lip rounding during the latter utterance indicates rounding associated with the consonant string, and needs to be subtracted from lip activity during stew. Likewise, velum movement during a CV n N sequence (that is, a sequence consisting of an oral consonant followed by n vowels preceding a nasal consonant) needs to be compared to velum movement during a CV n C sequence. When those comparisons are made, evidence for feature spreading evaporates.

Recently, two different coproduction theories have been distinguished (Lindblom et al., 2002 ). In the account proposed by Ohman ( 1966 ), vowels are produced continuously. In a VCV utterance, according to the account, speakers produce a diphthongal movement from the first to the second vowel. The consonant was superimposed on that diphthongal trajectory. In the alternative account (e.g. Fowler and Saltzman, 1993 ), gestures for consonants and vowels overlap temporally. Any vowel-to-vowel overlap is temporal overlap, not production of a diphthongal gesture.

Evidence favoring the view of Fowler and Saltzman is the same kind of evidence that disconfirmed feature spreading theory. As noted earlier, speakers show troughs in lip gestures in sequences of consonants that intervene between rounded vowels. They should not if vowels are produced as diphthongal tongue gestures, but they are expected to if vowels are produced as separate gestures that overlap temporally with consonantal gestures.

29.5.1 Coarticulation resistance

Coarticulation has been variously characterized as a source of distortion (e.g. Ohala, 1981 )—i.e. as a means by which articulation does not transparently implement essential phonological properties of consonants and vowels—or even as destructive of those properties (e.g. Hockett, 1955 ).

However, these characterizations overlook the finding of “coarticulation resistance”—an observation first made by Bladon and Al-Bamerni ( 1976 ), but developed largely by Recasens (e.g. 1984a ; 1984b ; 1985 ; 198); see also Farnetani, 1990 ). This is the observation that phones resist coarticulatory overlap by neighbors to the extent that the neighbors would interfere with achievement of the phones’ gestural goals. For example, Recasens ( 1984a ) found decreasing vowel-to-vowel coarticulation in Catalan VCV sequences when the intervening consonant was one of the set: /J/ (a dorso-palatal approximant), /ɲ/ (an alveolopalatal nasal), /ʎ/ (an alveolopalatal lateral), /n/ (an alveolar nasal). In the set, the consonants decreasingly use the tongue body to achieve their place of articulation. The tongue body is a major articulator in the production of vowels. Accordingly, it is likely that the decrease in vowel-to-vowel coarticulation in the consonant series occurs to prevent the vowels from interfering with achievement of the consonants' constriction location and degree. Recasens ( 1984b ) found increasing vowel-to-consonant coarticulation in the same consonant series.

Compatible data from English can be seen in Figure 29.1 . Figure 1a shows tongue body fronting data from a speaker of American English producing each of six consonants in the context of six following vowels (Fowler, 2005 ). During closure of three consonants (/b/, /v/, and /g/), there is a substantial shift in the tongue body height depending on the following vowel. During closure of the other three consonants (/d/, /z/, and /ð/, there is considerably less. Figure 1b shows similar results for tongue dorsum fronting. /b/, /v/, and, perhaps surprisingly, /g/ show less resistance to coarticulation for this speaker of American English than do /d/, /z/ and /ð/. The results for /b/ and /v/ most likely reflect the fact that they are labial consonants. They do not use the tongue, and so coproduction by vowels does not interfere with achievement of their gestural goals. The results for /g/, the fronting results at least, may reflect the fact that there is no stop in American English that is close in place of articulation with /g/ that might be confused with it were /g/'s place of articulation to shift due to coarticulation by the vowels.

 Tongue body height (a) and fronting (b) during production of three high and three low coarticulation resistant consonants produced in the context of six following stressed vowels. Measures taken in mid consonant closure.

Tongue body height (a) and fronting (b) during production of three high and three low coarticulation resistant consonants produced in the context of six following stressed vowels. Measures taken in mid consonant closure.

29.5.2 Other factors affecting coarticulation

Frame theory (Bell-Berti and Harris, 1981 ) suggests a fixed extent of anticipatory coarticulation, modulated perhaps by speaking rate. However, the picture is more complicated. Browman and Goldstein ( 1988 ) reported a difference in respect to how consonants are phased to a tautosyllabic vowel depending on whether the consonants were in the syllable onset or in the coda. Consonants in the onset of American English syllables are phased so that the gestural midpoint of the consonants aligns with the vowel. In contrast, in the coda, the first consonant is phased invariantly with respect to the vowel regardless of the number of consonants in the coda.

For multi-gesture consonants, such as /l/ (Sproat and Fujimura, 1993 ), /r/, /w/ (Gick, 1999 ), and the nasal consonants (Krakow, 1989 ), the gestures are phased differently in the onset and coda. Whereas they are nearly simultaneous in the onset, the more open (more vowel-like) gestures precede in the coda. This latter phasing appears to respect the “sonority hierarchy” such that more vowel-like phones are closest to the vowel.

29.6 Prosody

There is more to producing speech than sequencing consonants and vowels. Speech has prosodic properties including an intonation contour, various temporal properties, and variations in articulatory “strength.”

Theorists (see Shattuck-Hufnagel and Turk, 1996 for a review) identify hierarchical prosodic domains, each marked in some way phonologically. Domains include intonational phrases, which constitute the domain of complete intonational contours, intermediate phrases marked by a major (“nuclear”) pitch accent and a tone at the phrase boundary, prosodic words (lexical words or a content word followed by a function word as in “call up”), feet (a strong syllable followed by zero or one weak syllables), and syllables. Larger prosodic domains often, but not always, set off syntactic phrases or clauses.

Intonation contours are patterns of variation in fundamental frequency consisting of high and low pitch accents, or accents that combine a high and low (or low and high) pitch excursions, and boundary tones at intonational and intermediate phrase boundaries. Pitch accents in the contours serve to accent information that the speaker wants to focus attention on, perhaps because it is new information in the utterance or because the speaker wants to contrast that information with other information. A whole intonation contour expresses some kind of meaning. For example, intonation contours can distinguish yes/no questions from statements (e.g. So you are staying home this weekend ? ) Other contours can express surprise, disbelief or other expressions.

Because intonation contours reflect variation in fundamental frequency (f0), their production involves laryngeal control. This laryngeal control is coarticulated with other uses of the larynx, for example, to implement voicing or devoicing, intrinsic f0 (higher f0 for higher vowels), and tonal accompaniments of obstruent devoicing (a high tone on a vowel following an unvoiced obstruent).

Prosody is marked by other indications of phrasing. Prosodic domains from intonational phrases to prosodic words tend to be marked by final lengthening, pausing, and initial and final “strengthening.” These effects generally increase in magnitude with the “strength” of the prosodic boundary (where “strength” increases with height of a phrase in the prosodic hierarchy). Final lengthening is an increase in the duration of articulatory gestures and their acoustic consequences before a phrase boundary. Strengthening is a quite local increase in the magnitude of gestures at phrase edges (e.g. Fougeron and Keating, 1997 ). Less coarticulation occurs across stronger phrase boundaries, and accented vowels resist vowel-to-vowel coarticulation (Cho, 2004 ).

These marks of prosodic structure serve to demarcate informational units in an utterance. However, we need to ask: why these marks? Final lengthening and pausing are, perhaps, intuitive. Physical systems cannot stop on a dime, and if the larger prosodic domains involve articulatory stoppings and restartings, then we should expect to see slowing to a stop and, sometimes, pausing before restarting. However, why strengthening? Byrd and Saltzman ( 2003 ) provide an account of final lengthening and pausing that may also provide some insight into at least some of the occurrences of strengthening. They have extended the task dynamic model, described earlier, to produce the timing variation that characterizes phrasing in prosody. They do so by slowing the rate of time flow of the model's central clock at phrase boundaries. Clock slowing gives rise to longer and less overlapped gestures at phrase edges. The magnitude of slowing reflects the strength of a phrase boundary. Byrd and Saltzman conceive of the slowing as a gesture (a “π gesture”) that consists of an activation wave applied to any segmental gesture with which it overlaps temporally. π gestures span phrase boundaries, and therefore have effects at both edges of a phrase. Because clock slowing has as one effect, less overlap of gestures, a consequence may be less truncation of gestures due to overlap and so larger gestures.

Acknowledgments

Preparation of the manuscript was supported by NICHD grant HD-01994 and NIDCD grant DC-03782 to Haskins Laboratories.

Archangeli, D. ( 1997 ) Optimality theory: an introduction to linguistics in the 1990s. In D. Archangeli and D. T. Langendoen (eds), Optimality Theory: An Overview , pp. 1–32. Blackwell, Malden, MA.

Google Scholar

Google Preview

Bell-Berti, F., and Harris, K. S. ( 1981 ) A temporal model of speech production.   Phonetica , 38: 9–20.

Benguerel, A., and Cowan, H. ( 1974 ) Coarticulation of upper lip protrusion in French.   Phonetica , 30: 41–55.

Bernstein, N. ( 1967 ) The Coordination and Regulation of Movement . Pergamon, London.

Bladon, A., and Al-Bamerni, A. ( 1982 ) One-stage and two-stage temporal patterns of coarticulation.   Journal of the Acoustical Society of America , 72: S104.

Bladon, A., and Al-Bamerni, A. ( 1976 ) Coarticulation resistance in English /l/.   Journal of Phonetics , 4: 137–50.

Boyce, S. ( 1990 ) Coarticulatory organization for lip rounding in Turkish and in English.   Journal of the Acoustical Society of America , 8: 2584–95.

Boyce, S., Krakow, R., Bell-Berti, F., and Gelfer, C. ( 1990 ) Converging sources of evidence for dissecting articulatory movements into gestures.   Journal of Phonetics , 18: 173–88.

Browman, C., and Goldstein, L. ( 1986 ) Towards an articulatory phonology.   Phonology Yearbook , 3: 219–52.

Browman, C., and Goldstein, L. ( 1988 ) Some notes on syllable structure in articulatory phonology.   Phonetica , 45: 140–55.

Browman, C., and Goldstein, L. ( 1991 ) Tiers in articulatory phonology, with some implications for casual speech. In J. Kingston and M. Beckman (eds), Papers in Laboratory Phonology , vol. 1: Between the Grammar and the Physics of Speech , pp. 341–76. Cambridge University Press, Cambridge.

Browman, C., and Goldstein, L. ( 1992 ) Articulatory phonology: an overview.   Phonetica , 49: 155–80.

Browman, C., and Goldstein, L. ( 1995 ) Dynamics and articulatory phonology. In R. Port and T. van Gelder (eds), Mind as Motion: Explorations in the Dynamics of Cognition , pp. 175–93. MIT Press, Cambridge, MA.

Byrd, D., and Saltzman, E. ( 2003 ) The elastic phrase: modeling the dynamics of boundary-adjacent lengthening.   Journal of Phonetics , 31: 149–80.

Cho, T. ( 2004 ) Prosodically conditioned strengthening and vowel-to-vowel coarticulation in English.   Journal of Phonetics , 32: 141–76.

Daniloff, R., and Hammarberg, R. ( 1973 ) On defining coarticulation.   Journal of Phonetics , 1: 2390–48.

Daniloff, R., and Moll, K. ( 1968 ) Coarticulation of lip rounding.   Journal of Speech and Hearing Research , 11: 707–21.

Delattre, P., and Freeman, D. ( 1968 ) A dialect study of American r's by x-ray motion picture.   Linguistics , 44: 29–68.

Dell, G. ( 1986 ) A spreading-activation theory of retrieval in speech production.   Psychological Review , 93: 283–321.

Easton, T. ( 1972 ) On the normal use of reflexes.   American Scientist , 60: 591–9.

Farnetani, E. ( 1990 ) V-C-V lingual coarticulation and its spatiotemporal domain. In W. J. Hardcastle and A.Marchal (eds), Speech Production and Speech Modeling , pp. 93–130. Kluwer, The Netherlands.

Fougeron, C., and Keating, P. ( 1997 ) Articulatory strengthening at edges of prosodic domains.   Journal of the Acoustical Society of America , 101: 3728–40.

Fowler, C. A. ( 1977 ) Timing Control in Speech Production . Indiana University Linguistics Club, Bloomington.

Fowler, C. A. ( 2005 ) Parsing coarticulated speech: effects of coarticulation resistance.   Journal of Phonetics , 33: 195–213.

Fowler, C. A., and Saltzman, E. ( 1993 ) Coordination and coarticulation in speech production.   Language and Speech , 36: 171–95.

Gafos, A., and Benus, S. (2003) On neutral vowels in Hungarian. Paper presented at the 15th International Congress of Phonetic Sciences, Barcelona.

Garrett, M. ( 1980 ) Levels of processing in speech production. In B. Butterworth (ed.), Language Production , vol. 1: Speech and Talk , pp. 177–220. Academic Press, London.

Gay, T. ( 1978 ) Articulatory units: segments or syllables? In A. Bell and J. B. Hooper (eds), Syllables and Segments , pp. 121–31. North-Holland, Amsterdam.

Gelfer, C., Bell-Berti, F., and Harris, K. ( 1989 ) Determining the extent of coarticulation: effects of experimental design.   Journal of the Acoustical Society of America , 86: 2443–5.

Gick, B. (1999) The articulatory basis of syllable structure: a study of English glides and liquids. Ph.D. dissertation, Yale University.

Goldstein, L., Pouplier, M., Chen, L. Saltzman, E., and Byrd, D. ( forthcoming ) Action units slip in speech production errors.   Cognition .

Guenther, F., Hampson, M., and Johnson, D. ( 1998 ) A theoretical investigation of reference frames for the planning of speech.   Psychological Review , 105: 611–633.

Hockett, C. ( 1955 ) A Manual of Phonetics . Indiana University Press, Bloomington.

Kelso, J. A. S. ( 1984 ) Phase transitions and critical behavior in human bimanual coordination.   American Journal of Physiology , 246: 1000–1004.

Kelso, J. A. S., Tuller, B., Vatikiotis-Bateson, E., and Fowler, C. A. ( 1984 ) Functionally-specific articulatory cooperation following jaw perturbation during speech: evidence for coordinative structures.   Journal of Experimental Psychology: Human Perception and Performance , 10: 812–32.

Kenstowicz, M., and Kisseberth, C. ( 1979 ) Generative Phonology . Academic Press, New York.

Krakow, R. (1989) The articulatory organization of syllables: a kinematic analysis of labial and velar gestures. Ph.D. dissertation, Yale University.

Krakow, R. ( 1993 ) Nonsegmental influences on velum movement patterns: syllables, segments, stress and speaking rate. In M. Huffman, and R. Krakow (eds), Phonetics and Phonology , vol. 5: Nasals, Nasalization and the Velum , pp. 87–116. Academic Press, New York.

Levelt, W., Roelofs, A., and Meyer, A. ( 1999 ) A theory of lexical access in speech production.   Behavioral and Brain Sciences , 22: 1–38.

Lindblom, B., Lubker, J., and Gay, T. ( 1979 ) Formant frequencies of some fixed mandible vowels and a model of speech motor programming by predictive simulation.   Journal of Phonetics , 7: 147–61.

Lindblom, B., Sussman, H., Modaressi, G., and Burlingame, E. ( 2002 ) The trough effect in speech production: implications for speech motor programming.   Phonetica , 59: 245–62.

Maeda, S. ( 1990 ) Compensatory articulation during speech: evidence from the analysis and synthesis of vocal tract shapes using an articulatory model. In W. Hardcastle and A. Marchal (eds), Speech Production and Speech Modeling , pp. 131–49. Kluwer Academic, Boston, MA.

Meyer, A. ( 1991 ) The time course of phonological encoding in language production: phonological encoding inside a syllable.   Journal of Memory and Language , 30: 69–89.

Mowrey, R. and MacKay, I. ( 1990 ) Phonological primitives: electromyographic speech error evidence.   Journal of the Acoustical Society of America , 88: 1299–1312.

Ohala, J. ( 1981 ) The listener as a source of sound change. In C. Masek, R. Hendrick, R. Miller, and M. Mille (eds), Papers from the Parasession on Language and Behavior , pp. 178–03. Chicago Linguistics Society, Chicago.

Ohman, S. ( 1966 ) Coarticulation in VCV utterances: spectrographic measurements.   Journal of the Acoustical Society of America , 39: 151–68.

Perkell, J. and Chiang, C. ( 1986 ) Preliminary support for a ‘hybrid model’ of anticipatory coarticulation. In Proceedings of the 12th International Congress of Acoustic , pp. A3–A6.

Perkell, J. and Matthies, M. ( 1992 ) Temporal measures of labial coarticulation for the vowel /u/.   Journal of the Acoustical Society of America , 91: 2911–25.

Perkell, J., Matthies, M., Svirsky, M., and Jordan, M. ( 1993 ) Trading relations between tongue-body raising and lip rounding in production of the vowel /u/: a pilot ‘motor equivalence’ study.   Journal of the Acoustical Society of America , 93: 2948–61.

Pierrehumbert, J. ( 1990 ) Phonological and phonetic representations.   Journal of Phonetics , 18: 375–94.

Pouplier, M. (2003a) The dynamics of error. Paper presented at the 15th International Congress of Phonetic Sciences, Barcelona.

Pouplier, M. (2003b) Units of phonological encoding: empirical evidence. Ph.D. dissertation, Yale University.

Recasens, D. ( 1984 a) Vowel-to-vowel coarticulation in Catalan VCV sequences.   Journal of the Acoustical Society of America , 76: 1624–35.

Recasens, D. ( 1984 b) V-to-C coarticulation in Catalan VCV sequences: an articulatory and acoustical study.   Journal of Phonetics , 12: 61–73.

Recasens, D. ( 1985 ) Coarticulatory patterns and degrees of coarticulation resistance in catalan cv sequences.   Language and Speech , 28: 97–114.

Recasens, D. ( 1987 ) An acoustic analysis of v-to-c and v-to v coarticulatory effects in Catalan and Spanish VCV sequences.   Journal of Phonetics , 15: 299–312.

Ryle, G. ( 1949 ) The Concept of Mind . Barnes & Noble, New York.

Saltzman, E., and Kelso, J. A. S. ( 1987 ) Skilled action: a task-dynamic approach.   Psychological Review , 94: 84–106.

Saltzman, E., Lofqvist, A., and Mitra, S. ( 2000 ) ‘Clocks’ and ‘glue’: global timing and intergestural cohesion. In M. B. Broe and J. Pierrehumbert (eds), Papers in Laboratory Phonology , vol. 5: Acquisition and the Lexicon , pp. 88–101. Cambridge University Press, Cambridge.

Saltzman, E., and Munhall, K. ( 1989 ) A dynamical approach to gestural patterning in speech production.   Ecological Psychology , 1: 333–82.

Savariaux, C., Perrier, P., and Orliaguet, J. P. ( 1995 ) Compensation strategies for the perturbation of the rounded vowel [u] using a lip tube: a study of the control space in speech production.   Journal of the Acoustical Society of America , 98: 2428–42.

Sevald, C. A., Dell, G., and Cole, J. ( 1995 ) Syllable structure in speech production: are syllables chunks or schemas?   Journal of Memory and Language , 34: 807–20.

Shattuck-Hufnagel, S., and Turk, A. E. ( 1996 ) A prosody tutorial for investigators of auditory sentence processing.   Journal of Psycholinguistic Research , 25: 193–247.

Sproat, R., and Fujimura, O. ( 1993 ) Allophonic variation in English /l/ and its implications for phonetic implementation.   Journal of Phonetics , 21: 291–311.

Stevens, K., and Blumstein, S. ( 1981 ) The search for invariant correlates of phonetic features. In P Eimas and J Miller (eds), Perspectives on the Study of Speech , pp. 1–38. Erlbaum, Hillsdale, NJ.

Studdert-Kennedy, M. ( 1998 ) The particulate origins of language generativity: from syllable to gesture. In J. Hurford, M. Studdert-Kennedy, and C. Knight (eds), Approaches to the Evolution of Language , pp. 202–21. Cambridge University Press, Cambridge.

Tremblay, S., Shiller, D., and Ostry, D. ( 2003 ) Somatosensory basis of speech production.   Nature , 423: 866–9.

Turvey, M. T. ( 1977 ) Preliminaries to a theory of action with reference to vision. In R. Shaw and J. Bransford (eds), Perceiving, Acting and Knowing: Toward an Ecological Psychology , pp. 211–66. Erlbaum, Hillsdale, NJ.

Turvey, M. T. ( 1990 ) Coordination.   American Psychologist , 45: 938–53.

Yamanishi, J. Kawato, M., and Suzuki, R. ( 1980 ) Two coupled oscillators as a model for the coordinated finger tapping by both hands.   Biological Cybernetics , 37: 219–25.

By this definition, I intend to contrast the more comprehensive theories of language production from theories of speech production. A theory of language production (e.g. Levelt et al., 1999 ) offers an account of planning for and implementation of meaningful utterances. A theory of speech production concerns itself only with planning for and implementation of language forms.

Slashes (e.g. /p/) indicate phonological segments; square brackets (e.g. [p]) signify phonetic segments. The difference is one of abstractness. For example, the phonological segment /p/ is said to occur in two varieties—the aspirated phonetic segment [p h ] and the unaspirated [p].

An alternative account, which does not implicate rule use, is that pail stin reflects a single feature or gesture error. From a featural standpoint, place of articulation features of /p/ and /t/ exchange, stranding the aspiration feature.

See articles in the 1992 special issue of the journal Phonetica devoted to a critical analysis of articulatory phonology.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 31 January 2024

Single-neuronal elements of speech production in humans

  • Arjun R. Khanna   ORCID: orcid.org/0000-0003-0677-5598 1   na1 ,
  • William Muñoz   ORCID: orcid.org/0000-0002-1354-3472 1   na1 ,
  • Young Joon Kim 2   na1 ,
  • Yoav Kfir 1 ,
  • Angelique C. Paulk   ORCID: orcid.org/0000-0002-4413-3417 3 ,
  • Mohsen Jamali   ORCID: orcid.org/0000-0002-1750-7591 1 ,
  • Jing Cai   ORCID: orcid.org/0000-0002-2970-0567 1 ,
  • Martina L. Mustroph 1 ,
  • Irene Caprara 1 ,
  • Richard Hardstone   ORCID: orcid.org/0000-0002-7502-9145 3 ,
  • Mackenna Mejdell 1 ,
  • Domokos Meszéna   ORCID: orcid.org/0000-0003-4042-2542 3 ,
  • Abigail Zuckerman 2 ,
  • Jeffrey Schweitzer   ORCID: orcid.org/0000-0003-4079-0791 1 ,
  • Sydney Cash   ORCID: orcid.org/0000-0002-4557-6391 3   na2 &
  • Ziv M. Williams   ORCID: orcid.org/0000-0002-0017-0048 1 , 4 , 5   na2  

Nature volume  626 ,  pages 603–610 ( 2024 ) Cite this article

25k Accesses

4 Citations

476 Altmetric

Metrics details

  • Extracellular recording

Humans are capable of generating extraordinarily diverse articulatory movement combinations to produce meaningful speech. This ability to orchestrate specific phonetic sequences, and their syllabification and inflection over subsecond timescales allows us to produce thousands of word sounds and is a core component of language 1 , 2 . The fundamental cellular units and constructs by which we plan and produce words during speech, however, remain largely unknown. Here, using acute ultrahigh-density Neuropixels recordings capable of sampling across the cortical column in humans, we discover neurons in the language-dominant prefrontal cortex that encoded detailed information about the phonetic arrangement and composition of planned words during the production of natural speech. These neurons represented the specific order and structure of articulatory events before utterance and reflected the segmentation of phonetic sequences into distinct syllables. They also accurately predicted the phonetic, syllabic and morphological components of upcoming words and showed a temporally ordered dynamic. Collectively, we show how these mixtures of cells are broadly organized along the cortical column and how their activity patterns transition from articulation planning to production. We also demonstrate how these cells reliably track the detailed composition of consonant and vowel sounds during perception and how they distinguish processes specifically related to speaking from those related to listening. Together, these findings reveal a remarkably structured organization and encoding cascade of phonetic representations by prefrontal neurons in humans and demonstrate a cellular process that can support the production of speech.

Similar content being viewed by others

creative speech production meaning

Dimensionality reduction beyond neural subspaces with slice tensor component analysis

creative speech production meaning

Unsupervised restoration of a complex learned behavior after large-scale neuronal perturbation

creative speech production meaning

Control of working memory by phase–amplitude coupling of human hippocampal neurons

Humans can produce a remarkably wide array of word sounds to convey specific meanings. To produce fluent speech, linguistic analyses suggest a structured succession of processes involved in planning the arrangement and structure of phonemes in individual words 1 , 2 . These processes are thought to occur rapidly during natural speech and to recruit prefrontal regions in parts of the broader language network known to be involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and sentence construction 13 , 14 , 15 , 16 and which widely connect with downstream areas that play a role in their motor production 17 , 18 , 19 . Cortical surface recordings have also demonstrated that phonetic features may be regionally organized 20 and that they can be decoded from local-field activities across posterior prefrontal and premotor areas 21 , 22 , 23 , suggesting an underlying cortical structure. Understanding the basic cellular elements by which we plan and produce words during speech, however, has remained a significant challenge.

Although previous studies in animal models 24 , 25 , 26 and more recent investigation in humans 27 , 28 have offered an important understanding of how cells in primary motor areas relate to vocalization movements and the production of sound sequences such as song, they do not reveal the neuronal process by which humans construct individual words and by which we produce natural speech 29 . Further, although linguistic theory based on behavioural observations has suggested tightly coupled sublexical processes necessary for the coordination of articulators during word planning 30 , how specific phonetic sequences, their syllabification or inflection are precisely coded for by individual neurons remains undefined. Finally, whereas previous studies have revealed a large regional overlap in areas involved in articulation planning and production 31 , 32 , 33 , 34 , 35 , little is known about whether and how these linguistic process may be uniquely represented at a cellular scale 36 , what their cortical organization may be or how mechanisms specifically related to speech production and perception may differ.

Single-neuronal recordings have the potential to begin revealing some of the basic functional building blocks by which humans plan and produce words during speech and study these processes at spatiotemporal scales that have largely remained inaccessible 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 . Here, we used an opportunity to combine recently developed ultrahigh-density microelectrode arrays for acute intraoperative neuronal recordings, speech tracking and modelling approaches to begin addressing these questions.

Neuronal recordings during natural speech

Single-neuronal recordings were obtained from the language-dominant (left) prefrontal cortex in participants undergoing planned intraoperative neurophysiology (Fig. 1a ; section on ‘Acute intraoperative single-neuronal recordings’). These recordings were obtained from the posterior middle frontal gyrus 10 , 46 , 47 , 48 , 49 , 50 in a region known to be broadly involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and sentence construction 13 , 14 , 15 , 16 and to connect with neighbouring motor areas shown to play a role in articulation 17 , 18 , 19 and lexical processing 51 , 52 , 53 (Extended Data Fig. 1a ). This region was traversed during recordings as part of planned neurosurgical care and roughly ranged in distribution from alongside anterior area 55b to 8a, with sites varying by approximately 10 mm (s.d.) across subjects (Extended Data Fig. 1b ; section on ‘Anatomical localization of recordings’). Moreover, the participants undergoing recordings were awake and thus able to perform language-based tasks (section on ‘Study participants’), together providing an extraordinarily rare opportunity to study the action potential (AP) dynamics of neurons during the production of natural speech.

figure 1

a , Left, single-neuronal recordings were confirmed to localize to the posterior middle frontal gyrus of language-dominant prefrontal cortex in a region known to be involved in word planning and production (Extended Data Fig. 1a,b ); right, acute single-neuronal recordings were made using Neuropixels arrays (Extended Data Fig. 1c,d ); bottom, speech production task and controls (Extended Data Fig. 2a ). b , Example of phonetic groupings based on the planned places of articulation (Extended Data Table 1 ). c , A ten-dimensional feature space was constructed to provide a compositional representation of all phonemes per word. d , Peri-event time histograms were constructed by aligning the APs of each neuron to word onset at millisecond resolution. Data are presented as mean (line) values ± s.e.m. (shade). Inset, spike waveform morphology and scale bar (0.5 ms). e , Left, proportions of modulated neurons that selectively changed their activities to specific planned phonemes; right, tuning curve for a cell that was preferentially tuned to velar consonants. f , Average z -scored firing rates as a function of the Hamming distance between the preferred phonetic composition of the neuron (that producing largest change in activity) and all other phonetic combinations. Here, a Hamming distance of 0 indicates that the words had the same phonetic compositions, whereas a Hamming distance of 1 indicates that they differed by a single phoneme. Data are presented as mean (line) values ± s.e.m. (shade). g , Decoding performance for planned phonemes. The orange points provide the sampled distribution for the classifier’s ROC-AUC; n  = 50 random test/train splits; P  = 7.1 × 10 −18 , two-sided Mann–Whitney U -test. Data are presented as mean ± s.d.

Source Data

To obtain acute recordings from individual cortical neurons and to reliably track their AP activities across the cortical column, we used ultrahigh-density, fully integrated linear silicon Neuropixels arrays that allowed for high throughput recordings from single cortical units 54 , 55 . To further obtain stable recordings, we developed custom-made software that registered and motion-corrected the AP activity of each unit and kept track of their position across the cortical column (Fig. 1a , right) 56 . Only well-isolated single units, with low relative neighbour noise and stable waveform morphologies consistent with that of neocortical neurons were used (Extended Data Fig. 1c,d ; section on ‘Acute intraoperative single-neuronal recordings’). Altogether, we obtained recordings from 272 putative neurons across five participants for an average of 54 ± 34 (s.d.) single units per participant (range 16–115 units).

Next, to study neuronal activities during the production of natural speech and to track their per word modulation, the participants performed a naturalistic speech production task that required them to articulate broadly varied words in a replicable manner (Extended Data Fig. 2a ) 57 . Here, the task required the participants to produce words that varied in phonetic, syllabic and morphosyntactic content and to provide them in a structured and reproducible format. It also required them to articulate the words independently of explicit phonetic cues (for example, from simply hearing and then repeating the same words) and to construct them de novo during natural speech. Extra controls were further used to evaluate for preceding word-related responses, sensory–perceptual effects and phonetic–acoustic properties as well as to evaluate the robustness and generalizability of neuronal activities (section on ‘Speech production task’).

Together, the participants produced 4,263 words for an average of 852.6 ± 273.5 (s.d.) words per participant (range 406–1,252 words). The words were transcribed using a semi-automated platform and aligned to AP activity at millisecond resolution (section on ‘Audio recordings and task synchronization’) 51 . All participants were English speakers and showed comparable word-production performances (Extended Data Fig. 2b ).

Representations of phonemes by neurons

To first examine the relation between single-neuronal activities and the specific speech organs involved 58 , 59 , we focused our initial analyses on the primary places of articulation 60 . The places of articulation describe the points where constrictions are made between an active and a passive articulator and are what largely give consonants their distinctive sounds. Thus, for example, whereas bilabial consonants (/p/ and /b/) involve the obstruction of airflow at the lips, velar consonants are articulated with the dorsum of the tongue placed against the soft palate (/k/ and /g/; Fig. 1b ). To further examine sounds produced without constriction, we also focused our initial analyses on vowels in relation to the relative height of the tongue (mid-low and high vowels). More phonetic groupings based on the manners of articulation (configuration and interaction of articulators) and primary cardinal vowels (combined positions of the tongue and lips) are described in Extended Data Table 1 .

Next, to provide a compositional phonetic representation of each word, we constructed a feature space on the basis of the constituent phonemes of each word (Fig. 1c , left). For instance, the words ‘like’ and ‘bike’ would be represented uniquely in vector space because they differ by a single phoneme (‘like’ contains alveolar /l/ whereas ‘bike’ contains bilabial /b/; Fig. 1c , right). The presence of a particular phoneme was therefore represented by a unitary value for its respective vector component, together yielding a vectoral representation of the constituent phonemes of each word (section on ‘Constructing a word feature space’). Generalized linear models (GLMs) were then used to quantify the degree to which variations in neuronal activity during planning could be explained by individual phonemes across all possible combinations of phonemes per word (section on ‘Single-neuronal analysis’).

Overall, we find that the firing activities of many of the neurons (46.7%, n  = 127 of 272 units) were explained by the constituent phonemes of the word before utterance (−500 to 0 ms); GLM likelihood ratio test, P  < 0.01); meaning that their activity patterns were informative of the phonetic content of the word. Among these, the activities of 56 neurons (20.6% of the 272 units recorded) were further selectively tuned to the planned production of specific phonemes (two-sided Wald test for each GLM coefficient, P  < 0.01, Bonferroni-corrected across all phoneme categories; Fig. 1d,e and Extended Data Figs. 2 and 3 ). Thus, for example, whereas certain neurons changed their firing rate when the upcoming words contained bilabial consonants (for example, /p/ or /b/), others changed their firing rate when they contained velar consonants. Of these neurons, most encoded information both about the planned places and manners of articulation ( n  = 37 or 66% overlap, two-sided hypergeometric test, P  < 0.0001) or planned places of articulation and vowels ( n  = 27 or 48% overlap, two-sided hypergeometric test, P  < 0.0001; Extended Data Fig. 4 ). Most also reflected the spectral properties of the articulated words on a phoneme-by-phoneme basis (64%, n  = 36 of 56; two-sided hypergeometric test, P  = 1.1 × 10 −10 ; Extended Data Fig. 5a,b ); together providing detailed information about the upcoming phonemes before utterance.

Because we had a complete representation of the upcoming phonemes for each word, we could also quantify the degree to which neuronal activities reflected their specific combinations. For example, we could ask whether the activities of certain neurons not only reflected planned words with velar consonants but also words that contained the specific combination of both velar and labial consonants. By aligning the activity of each neuron to its preferred phonetic composition (that is, the specific combination of phonemes to which the neuron most strongly responded) and by calculating the Hamming distance between this and all other possible phonetic compositions across words (Fig. 1c , right; section on ‘Single-neuronal analysis’), we find that the relation between the vectoral distances across words and neuronal activity was significant (two-sided Spearman’s ρ  = −0.97, P  = 5.14 × 10 −7 ; Fig. 1f ). These neurons therefore seemed not only to encode specific planned phonemes but also their specific composition with upcoming words.

Finally, we asked whether the constituent phonemes of the word could be robustly decoded from the activity patterns of the neuronal population. Using multilabel decoders to classify the upcoming phonemes of words not used for model training (section on ‘Population modelling’), we find that the composition of phonemes could be predicted from neuronal activity with significant accuracy (receiver operating characteristic area under the curve; ROC-AUC = 0.75 ± 0.03 mean ± s.d. observed versus 0.48 ± 0.02 chance, P  < 0.001, two-sided Mann–Whitney U -test; Fig. 1g ). Similar findings were also made when examining the planned manners of articulation (AUC = 0.77 ± 0.03, P  < 0.001, two-sided Mann–Whitney U -test), primary cardinal vowels (AUC = 0.79 ± 0.04, P  < 0.001, two-sided Mann–Whitney U -test) and their spectral properties (AUC = 0.75 ± 0.03, P  < 0.001, two-sided Mann–Whitney U -test; Extended Data Fig. 5a , right). Taken together, these neurons therefore seemed to reliably predict the phonetic composition of the upcoming words before utterance.

Motoric and perceptual processes

Neurons that reflected the phonetic composition of the words during planning were largely distinct from those that reflected their composition during perception. It is possible, for instance, that similar response patterns could have been observed when simply hearing the words. Therefore, to test for this, we performed an extra ‘perception’ control in three of the participants whereby they listened to, rather than produced, the words ( n  = 126 recorded units; section on ‘Speech production task’). Here, we find that 29.3% ( n  = 37) of the neurons showed phonetic selectively during listening (Extended Data Fig. 6a ) and that their activities could be used to accurately predict the phonemes being heard (AUC = 0.70 ± 0.03 observed versus 0.48 ± 0.02 chance, P  < 0.001, two-sided Mann–Whitney U -test; Extended Data Fig. 6b ). We also find, however, that these cells were largely distinct from those that showed phonetic selectivity during planning ( n  = 10; 7.9% overlap) and that their activities were uninformative of phonemic content of the words being planned (AUC = 0.48 ± 0.01, P  = 0.99, two-sided Mann–Whitney U -test; Extended Data Fig. 6b ). Similar findings were also made when replaying the participant’s own voices to them (‘playback’ control; 0% overlap in neurons); together suggesting that speaking and listening engaged largely distinct but complementary sets of cells in the neural population.

Given the above observations, we also examined whether the activities of the neurons could have been explained by the acoustic–phonetic properties of the preceding spoken words. For example, it is possible that the activities of the neuron may have partly reflected the phonetic composition of the previous articulated word or their motoric components. Thus, to test for this, we repeated our analyses but now excluded words in which the preceding articulated word contained the phoneme being decoded (section on ‘Single-neuronal analysis’) and find that decoding performance remained significant (AUC = 0.72 ± 0.1, P  < 0.001, two-sided Mann–Whitney U -test). We also find that decoding performance remained significant when constricting (−400 to 0 ms window instead of −500:0 ms; AUC = 0.72 ± 0.1, P  < 0.001, two-sided Mann–Whitney U -test) or shifting the analysis window closer to utterance (−300 to +200 ms window results in AUC = 0.76 ± 0.1, P  < 0.001, two-sided Mann–Whitney U -test); indicating that these neurons coded for the phonetic composition of the upcoming words.

Syllabic and morphological features

To transform sets of consonants and vowels into words, the planned phonemes must also be arranged and segmented into distinct syllables 61 . For example, even though the words ‘casting’ and ‘stacking’ possess the same constituent phonemes, they are distinguished by their specific syllabic structure and order. Therefore, to examine whether neurons in the population may further reflect these sublexical features, we created an extra vector space based on the specific order and segmentation of phonemes (section on ‘Constructing a word feature space’). Here, focusing on the most common syllables to allow for tractable neuronal analysis (Extended Data Table 1 ), we find that the activities of 25.0% ( n  = 68 of 272) of the neurons reflected the presence of specific planned syllables (two-sided Wald test for each GLM coefficient, P  < 0.01, Bonferroni-corrected across all syllable categories; Fig. 2a,b ). Thus, whereas certain neurons may respond selectively to a velar-low-alveolar syllable, other neurons may respond selectively to an alveolar-low-velar syllable. Together, the neurons responded preferentially to specific syllables when tested across words (two-sided Spearman’s ρ  = −0.96, P  = 1.85 × 10 −6 ; Fig. 2c ) and accurately predicted their content (AUC = 0.67 ± 0.03 observed versus 0.50 ± 0.02 chance, P  < 0.001, two-sided Mann–Whitney U -test; Fig. 2d ); suggesting that these subsets of neurons encoded information about the syllables.

figure 2

a , Peri-event time histograms were constructed by aligning the APs of each neuron to word onset. Data are presented as mean (line) values ± s.e.m. (shade). Examples of two representative neurons which selectively changed their activity to specific planned syllables. Inset, spike waveform morphology and scale bar (0.5 ms). b , Scatter plots of D 2 values (the degree to which specific features explained neuronal response, n  = 272 units) in relation to planned phonemes, syllables and morphemes. c , Average z -scored firing rates as a function of the Hamming distance between the preferred syllabic composition and all other compositions of the neuron. Data are presented as mean (line) values ± s.e.m. (shade). d , Decoding performance for planned syllables. The orange points provide the sampled distribution for the classifier’s ROC-AUC values ( n  = 50 random test/train splits; P  = 7.1 × 10 −18 two-sided Mann–Whitney U -test). Data are presented as mean ± s.d. e , To evaluate the selectivity of neurons to specific syllables, their activities were further compared for words that contained the preferred syllable of each neuron (that is, the syllable to which they responded most strongly; green) to (i) words that contained one or more of same individual phonemes but not necessarily their preferred syllable, (ii) words that contained different phonemes and syllables, (iii) words that contained the same phonemes but divided across different syllables and (iv) words that contained the same phonemes in a syllable but in different order (grey). Neuronal activities across all comparisons (to green points) were significant ( n  = 113; P  = 6.2 × 10 −20 , 8.8 × 10 −20 , 4.2 × 10 −20 and 1.4 × 10 −20 , for the comparisons above, respectively; two-sided Wilcoxon signed-rank test). Data are presented as mean (dot) values ± s.e.m.

Next, to confirm that these neurons were selectively tuned to specific syllables, we compared their activities for words that contained the preferred syllable of each neuron (for example, /d-iy/) to words that simply contained their constituent phonemes (for example, d or iy). Thus, for example, if these neurons reflected individual phonemes irrespective of their specific order, then we would observe no difference in response. On the basis of these comparisons, however, we find that the responses of the neurons to their preferred syllables was significantly greater than to that of their individual constituent phonemes ( z -score difference 0.92 ± 0.04; two-sided Wilcoxon signed-rank test, P  < 0.0001; Fig. 2e ). We also tested words containing syllables with the same constituent phonemes but in which the phonemes were simply in a different order (for example, /g-ah-d/ versus /d-ah-g/) but again find that the neurons were preferentially tuned to specific syllables ( z -score difference 0.99 ± 0.06; two-sided Wilcoxon signed-rank test, P  < 1.0 × 10 −6 ; Fig. 2e ). Then, we examined words that contained the same arrangements of phonemes but in which the phonemes themselves belonged to different syllables (for example, /r-oh-b/ versus r-oh/b-; accounting prosodic emphasis) and similarly find that the neurons were preferentially tuned to specific syllables ( z -score difference 1.01 ± 0.06; two-sided Wilcoxon signed-rank test, P  < 0.0001; Fig. 2e ). Therefore, rather than simply reflecting the phonetic composition of the upcoming words, these subsets of neurons encoded their specific segmentation and order in individual syllables.

Finally, we asked whether certain neurons may code for the inclusion of morphemes. Unlike phonemes, bound morphemes such as ‘–ed’ in ‘directed’ or ‘re–’ in ‘retry’ are capable of carrying specific meanings and are thus thought to be subserved by distinct neural mechanisms 62 , 63 . Therefore, to test for this, we also parsed each word on the basis of whether it contained a suffix or prefix (controlling for word length) and find that the activities of 11.4% ( n  = 31 of 272) of the neurons selectively changed for words that contained morphemes compared to those that did not (two-sided Wald test for each GLM coefficient, P  < 0.01, Bonferroni-corrected across morpheme categories; Extended Data Fig. 5c ). Moreover, neural activity across the population could be used to reliably predict the inclusion of morphemes before utterance (AUC = 0.76 ± 0.05 observed versus 0.52 ± 0.01 for shuffled data, P  < 0.001, two-sided Mann–Whitney U -test; Extended Data Fig. 5c ), together suggesting that the neurons coded for this sublexical feature.

Spatial distribution of neurons

Neurons that encoded information about the sublexical components of the upcoming words were broadly distributed across the cortex and cortical column depth. By tracking the location of each neuron in relation to the Neuropixels arrays, we find that there was a slightly higher preponderance of neurons that were tuned to phonemes (one-sided χ 2 test (2) = 0.7 and 5.2, P  > 0.05, for places and manners of articulation, respectively), syllables (one-sided χ 2 test (2) = 3.6, P  > 0.05) and morphemes (one-sided χ 2 test (2) = 4.9, P  > 0.05) at lower cortical depths, but that this difference was non-significant, suggesting a broad distribution (Extended Data Fig. 7 ). We also find, however, that the proportion of neurons that showed selectivity for phonemes increased as recordings were acquired more posteriorly along the rostral–caudal axis of the cortex (one-sided χ 2 test (4) = 45.9 and 52.2, P  < 0.01, for places and manners of articulation, respectively). Similar findings were also made for syllables and morphemes (one-sided χ 2 test (4) = 31.4 and 49.8, P  < 0.01, respectively; Extended Data Fig. 7 ); together suggesting a gradation of cellular representations, with caudal areas showing progressively higher proportions of selective neurons.

Collectively, the activities of these cell ensembles provided richly detailed information about the phonetic, syllabic and morphological components of upcoming words. Of the neurons that showed selectivity to any sublexical feature, 51% ( n  = 46 of 90 units) were significantly informative of more than one feature. Moreover, the selectivity of these neurons lay along a continuum and were closely correlated (two-sided test of Pearson’s correlation in D 2 across all sublexical feature comparisons, r  = 0.80, 0.51 and 0.37 for phonemes versus syllables, phonemes versus morphemes and syllables versus morphemes, respectively, all P  < 0.001; Fig. 2b ), with most cells exhibiting a mixture of representations for specific phonetic, syllabic or morphological features (two-sided Wilcoxon signed-rank test, P  < 0.0001). Figure 3a further illustrates this mixture of representations (Fig. 3a , left; t -distributed stochastic neighbour embedding (tSNE)) and their hierarchical structure (Fig. 3a , right; D 2 distribution), together revealing a detailed characterization of the phonetic, syllabic and morphological components of upcoming words at the level of the cell population.

figure 3

a , Left, response selectivity of neurons to specific word features (phonemes, syllables and morphemes) is visualized across the population using a tSNE procedure (that is, neurons with similar response characteristics were plotted in closer proximity). The hue of each point reflects the degree of selectivity to a particular sublexical feature whereas the size of each point reflects the degree to which those features explained neuronal response. Inset, the relative proportions of neurons showing selectivity and their overlap. Right, the D 2 metric (the degree to which specific features explained neuronal response) for each cell shown individually per feature. b , The relative degree to which the activities of the neurons were explained by the phonetic, syllabic and morphological features of the words ( D 2 metric) and their hierarchical structure (agglomerative hierarchical clustering). c , Distribution of peak decoding performances for phonemes, syllables and morphemes aligned to word utterance onset. Significant differences in peak decoding timings across sample distribution are labelled in brackets above ( n  = 50 random test/train splits; P  = 0.024, 0.002 and 0.002; pairwise, two-sided permutation tests of differences in medians for phonemes versus syllables, syllables versus morphemes and phonemes versus morphemes, respectively; Methods ). Data are presented as median (dot) values ± bootstrapped standard error of the median.

Temporal organization of representations

Given the above observations, we examined the temporal dynamic of neuronal activities during the production of speech. By tracking peak decoding in the period leading up to utterance onset (peak AUC; 50 model testing/training splits) 64 , we find these neural populations showed a consistent morphological–phonetic–syllabic dynamic in which decoding performance first peaked for morphemes. Peak decoding then followed for phonemes and syllables (Fig. 3b and Extended Data Fig. 8a,b ; section on ‘Population modelling’). Overall, decoding performance peaked for the morphological properties of words at −405 ± 67 ms before utterance, followed by peak decoding for phonemes at −195 ± 16 ms and syllables at −70 ± 62 ms (s.e.m.; Fig. 3b ). This temporal dynamic was highly unlikely to have been observed by chance (two-sided Kruskal–Wallis test, H  = 13.28, P  < 0.01) and was largely distinct from that observed during listening (two-sided Kruskal–Wallis test, H  = 14.75, P  < 0.001; Extended Data Fig. 6c ). The activities of these neurons therefore seemed to follow a consistent, temporally ordered morphological–phonetic–syllabic dynamic before utterance.

The activities of these neurons also followed a temporally structured transition from articulation planning to production. When comparing their activities before utterance onset (−500:0 ms) to those after (0:500 ms), we find that neurons which encoded information about the upcoming phonemes during planning encoded similar information during production ( P  < 0.001, Mann–Whitney U -test for phonemes and syllables; Fig. 4a ). Moreover, when using models that were originally trained on words before utterance onset to decode the properties of the articulated words during production (model-switch approach), we find that decoding accuracy for the phonetic, syllabic and morphological properties of the words all remained significant (AUC = 0.76 ± 0.02 versus 0.48 ± 0.03 chance, 0.65 ± 0.03 versus 0.51 ± 0.04 chance, 0.74 ± 0.06 versus 0.44 ± 0.07 chance, for phonemes, syllables and morphemes, respectively; P  < 0.001 for all, two-sided Mann–Whitney U -tests; Extended Data Fig. 8c ). Information about the sublexical features of words was therefore reliably represented during articulation planning and execution by the neuronal population.

figure 4

a , Top, the D 2 value of neuronal activity (the degree to which specific features explained neuronal response, n  = 272 units) during word planning (green) and production (orange) sorted across all population neurons. Middle, relationship between explanatory power ( D 2 ) of neuronal activity ( n  = 272 units) for phonemes (Spearman’s ρ  = 0.69), syllables (Spearman’s ρ  = 0.40) and morphemes (Spearman’s ρ  = 0.08) during planning and production ( P  = 1.3 × 10 −39 , P  = 6.6 × 10 −12 , P  = 0.18, respectively, two-sided test of Spearman rank-order correlation). Bottom, the D 2 metric for each cell during production per feature ( n  = 272 units). b , Top left, schematic illustration of speech planning (blue plane) and production (red plane) subspaces as traversed by a neuron for different phonemes (yellow arrows; Extended Data Fig. 9 ). Top right, subspace misalignment quantified by an alignment index (red) or Grassmannian chordal distance (red) compared to that expected from chance (grey), demonstrating that the subspaces occupied by the neural population ( n  = 272 units) during planning and production were distinct. Bottom, projection of neural population activity ( n  = 272 units) during word planning (blue) and production (red) onto the first three PCs for the planning (upper row) and production (lower row) subspaces.

Utilizing a dynamical systems approach to further allow for the unsupervised identification of functional subspaces (that is, wherein neural activity is embedded into a high-dimensional vector space; Fig. 4b , left; section on ‘Dynamical system and subspace analysis’) 31 , 34 , 65 , 66 , we find that the activities of the population were mostly low-dimensional, with more than 90% of the variance in neuronal activity being captured by its first four principal components (Fig. 4b , right). However, when tracking how the dimensions in which neural populations evolved over time, we also find that the subspaces which defined neural activity during articulation planning and production were largely distinct. In particular, whereas the first five subspaces captured 98.4% of variance in the trajectory of the population during planning, they captured only 11.9% of variance in the trajectory during articulation (two-sided permutation test, P  < 0.0001; Fig. 4b , bottom and Extended Data Fig. 9 ). Together, these cell ensembles therefore seemed to occupy largely separate preparatory and motoric subspaces while also allowing for information about the phonetic, syllabic and morphological contents of the words to be stably represented during the production of speech.

Using Neuropixels probes to obtain acute, fine-scaled recordings from single neurons in the language-dominant prefrontal cortex 3 , 4 , 5 , 6 —in a region proposed to be involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and production 13 , 14 , 15 , 16 —we find a strikingly detailed organization of phonetic representations at a cellular level. In particular, we find that the activities of many of the neurons closely mirrored the way in which the word sounds were produced, meaning that they reflected how individual planned phonemes were generated through specific articulators 58 , 59 . Moreover, rather than simply representing phonemes independently of their order or structure, many of the neurons coded for their composition in the upcoming words. They also reliably predicted the arrangement and segmentation of phonemes into distinct syllables, together suggesting a process that could allow the structure and order of articulatory events to be encoded at a cellular level.

Collectively, this putative mechanism supports the existence of context-general representations of classes of speech sounds that speakers use to construct different word forms. In contrast, coding of sequences of phonemes as syllables may represent a context-specific representation of these speech sounds in a particular segmental context. This combination of context-general and context-specific representation of speech sound classes, in turn, is supportive of many speech production models which suggest that speakers hold abstract representations of discrete phonological units in a context-general way and that, as part of speech planning, these units are organized into prosodic structures that are context-specific 1 , 30 . Although the present study does not reveal whether these representations may be stored in and retrieved from a mental syllabary 1 or are constructed from abstract phonology ad hoc, it lays a groundwork from which to begin exploring these possibilities at a cellular scale. It also expands on previous observations in animal models such as marmosets 67 , 68 , singing mice 69 and canaries 70 on the syllabic structure and sequence of vocalization processes, providing us with some of the earliest lines of evidence for the neuronal coding of vocal-motor plans.

Another interesting finding from these studies is the diversity of phonetic feature representations and their organization across cortical depth. Although our recordings sampled locally from relatively small columnar populations, most phonetic features could be reliably decoded from their collective activities. Such findings suggest that phonetic information necessary for constructing words may be potentially fully represented in certain regions along the cortical column 10 , 46 , 47 , 48 , 49 , 50 . They also place these populations at a putative intersection for the shared coding of places and manners of articulation and demonstrate how these representations may be locally distributed. Such redundancy and accessibility of information in local cortical populations is consistent with that observed from animal models 31 , 32 , 33 , 34 , 35 and could serve to allow for the rapid orchestration of neuronal processes necessary for the real-time construction of words; especially during the production of natural speech. Our findings are also supportive of a putative ‘mirror’ system that could allow for the shared representation of phonetic features within the population when speaking and listening and for the real-time feedback of phonetic information by neurons during perception 23 , 71 .

A final notable observation from these studies is the temporal succession of neuronal encoding events. In particular, our findings are supportive of previous neurolinguistic theories suggesting closely coupled processes for coordination planned articulatory events that ultimately produces words. These models, for example, suggest that the morphology of a word is probably retrieved before its phonologic code, as the exact phonology depends on the morphemes in the word form 1 . They also suggest the later syllabification of planned phonemes which would enable them to be sequentially arranged in specific order (although different temporal orders have been suggested as well) 72 . Here, our findings provide tentative support for a structured sublexical coding succession that could allow for the discretization of such information during articulation. Our findings also suggest (through dynamical systems modelling) a mechanism that, consistent with previous observations on motor planning and execution 31 , 34 , 65 , 66 , could enable information to occupy distinct functional subspaces 34 , 73 and therefore allow for the rapid separation of neural processes necessary for the construction and articulation of words.

Taken together, these findings reveal a set of processes and framework in the language-dominant prefrontal cortex by which to begin understanding how words may be constructed during natural speech at a single-neuronal level through which to start defining their fine-scale spatial and temporal dynamics. Given their robust decoding performances (especially in the absence of natural language processing-based predictions), it is interesting to speculate whether such prefrontal recordings could also be used for synthetic speech prostheses or for the augmentation of other emerging approaches 21 , 22 , 74 used in brain–machine interfaces. It is important to note, however, that the production of words also involves more complex processes, including semantic retrieval, the arrangement of words in sentences, and prosody, which were not tested here. Moreover, future experiments will be required to investigate eloquent areas such as ventral premotor and superior posterior temporal areas not accessible with our present techniques. Here, this study provides a prospective platform by which to begin addressing these questions using a combination of ultrahigh-density microelectrode recordings, naturalistic speech tracking and acute real-time intraoperative neurophysiology to study human language at cellular scale.

Study participants

All aspects of the study were carried out in strict accordance with and were approved by the Massachusetts General Brigham Institutional Review Board. Right-handed native English speakers undergoing awake microelectrode recording-guided deep brain stimulator implantation were screened for enrolment. Clinical consideration for surgery was made by a multidisciplinary team of neurosurgeons, neurologists and neuropsychologists. Operative planning was made independently by the surgical team and without consideration of study participation. Participants were only enroled if: (1) the surgical plan was for awake microelectrode recording-guided placement, (2) the patient was at least 18 years of age, (3) they had intact language function with English fluency and (4) were able to provide informed consent for study participation. Participation in the study was voluntary and all participants were informed that they were free to withdraw from the study at any time.

Acute intraoperative single-neuronal recordings

Single-neuronal prefrontal recordings using neuropixels probes.

As part of deep brain stimulator implantation at our institution, participants are often awake and microelectrode recordings are used to optimize anatomical targeting of the deep brain structures 46 . During these cases, the electrodes often traverse part of the posterior language-dominant prefrontal cortex 3 , 4 , 5 , 6 in an area previously shown be involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and sentence construction 13 , 14 , 15 , 16 and which broadly connects with premotor areas involved in their articulation 51 , 52 , 53 and lexical processing 17 , 18 , 19 by imaging studies (Extended Data Fig. 1a,b ). All microelectrode entry points and placements were based purely on planned clinical targeting and were made independently of any study consideration.

Sterile Neuropixels probes (v.1.0-S, IMEC, ethylene oxide sterilized by BioSeal 54 ) together with a 3B2 IMEC headstage were attached to cannula and a manipulator connected to a ROSA ONE Brain (Zimmer Biomet) robotic arm. Here, the probes were inserted into the cortical ribbon under direct robot navigational guidance through the implanted burr hole (Fig. 1a ). The probes (width 70 µm; length 10 mm; thickness 100 µm) consisted of a total of 960 contact sites (384 preselected recording channels) laid out in a chequerboard pattern with approximately 25 µm centre-to-centre nearest-neighbour site spacing. The IMEC headstage was connected through a multiplexed cable to a PXIe acquisition module card (IMEC), installed into a PXIe Chassis (PXIe-1071 chassis, National Instruments). Neuropixels recordings were performed using SpikeGLX (v.20201103 and v.20221012-phase30; http://billkarsh.github.io/SpikeGLX/ ) or OpenEphys (v.0.5.3.1 and v.0.6.0; https://open-ephys.org/ ) on a computer connected to the PXIe acquisition module recording the action potential band (AP, band-pass filtered from 0.3 to 10 kHz) sampled at 30 kHz and a local-field potential band (LFP, band-pass filtered from 0.5 to 500 Hz), sampled at 2,500 Hz. Once putative units were identified, the Neuropixels probe was briefly held in position to confirm signal stability (we did not screen putative neurons for speech responsiveness). Further description of this recording approach can be found in refs. 54 , 55 . After single-neural recordings from the cortex were completed, the Neuropixels probe was removed and subcortical neuronal recordings and deep brain stimulator placement proceeded as planned.

Single-unit isolation

Single-neuronal recordings were performed in two main steps. First, to track the activities of putative neurons at high spatiotemporal resolution and to account for intraoperative cortical motion, we use a Decentralized Registration of Electrophysiology Data software (DREDge; https://github.com/evarol/DREDge ) and interpolation approach ( https://github.com/williamunoz/InterpolationAfterDREDge ). Briefly, and as previously described 54 , 55 , 56 , an automated protocol was used to track LFP voltages using a decentralized correlation technique that re-aligned the recording channels in relation to brain movements (Fig. 1a , right). Following this step, we then interpolated the AP band continuous voltage data using the DREDge motion estimate to allow the activities of the putative neurons to be stably tracked over time. Next, single units were isolated from the motion-corrected interpolated signal using Kilosort (v.1.0; https://github.com/cortex-lab/KiloSort ) followed by Phy for cluster curation (v.2.0a1; https://github.com/cortex-lab/phy ; Extended Data Fig. 1c,d ). Here, units were selected on the basis of their waveform morphologies and separability in principal component space, their interspike interval profiles and similarity of waveforms across contacts. Only well-isolated single units with mean firing rates ≥0.1 Hz were included. The range of units obtained from these recordings was 16–115 units per participant.

Audio recordings and task synchronization

For task synchronization, we used the TTL output and audio output to send the synchronization trigger through the SMA input to the IMEC PXIe acquisition module card. To allow for added synchronizing, triggers were also recorded on an extra breakout analogue and digital input/output board (BNC2110, National Instruments) connected through a PXIe board (PXIe-6341 module, National Instruments).

Audio recordings were obtained at 44 kHz sampling frequency (TASCAM DR-40×4-Channel/ 4-Track Portable Audio Recorder and USB Interface with Adjustable Microphone) which had an audio input. These recordings were then sent to a NIDAQ board analogue input in the same PXIe acquisition module containing the IMEC PXIe board for high-fidelity temporal alignment with neuronal data. Synchronization of neuronal activity with behavioural events was performed through TTL triggers through a parallel port sent to both the IMEC PXIe board (the sync channel) and the analogue NIDAQ input as well as the parallel audio input into the analogue input channels on the NIDAQ board.

Audio recordings were annotated in semi-automated fashion (Audacity; v.2.3). Recorded audio for each word and sentence by the participants was analysed in Praat 75 and Audacity (v.2.3). Exact word and phoneme onsets and offsets were identified using the Montreal Forced Aligner (v.2.2; https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner ) 76 and confirmed with manual review of all annotated recordings. Together, these measures allowed for the millisecond-level alignment of neuronal activity with each produced word and phoneme.

Anatomical localization of recordings

Pre-operative high-resolution magnetic resonance imaging and postoperative head computerized tomography scans were coregistered by combination of ROSA software (Zimmer Biomet; v.3.1.6.276), Mango (v.4.1; https://mangoviewer.com/download.html ) and FreeSurfer (v.7.4.1; https://surfer.nmr.mgh.harvard.edu/fswiki/DownloadAndInstall ) to reconstruct the cortical surface and identify the cortical location from which Neuropixels recordings were obtained 77 , 78 , 79 , 80 , 81 . This registration allowed localization of the surgical areas that underlaid the cortical sites of recording (Fig. 1a and Extended Data Fig. 1a ) 54 , 55 , 56 . The MNI transformation of these coordinates was then carried out to register the locations in MNI space with Fieldtrip toolbox (v.20230602; https://www.fieldtriptoolbox.org/ ; Extended Data Fig. 1b ) 82 .

For depth calculation, we estimated the pial boundary of recordings according to the observed sharp signal change in signal from channels that were implanted in the brain parenchyma versus those outside the brain. We then referenced our single-unit recording depth (based on their maximum waveform amplitude channel) in relation to this estimated pial boundary. Here, all units were assessed on the basis of their relative depths in relation to the pial boundary as superficial, middle and deep (Extended Data Fig. 7 ).

Speech production task

The participants performed a priming-based naturalistic speech production task 57 in which they were given a scene on a screen that consisted of a scenario that had to be described in specific order and format. Thus, for example, the participant may be given a scene of a boy and a girl playing with a balloon or they may be given a scene of a dog chasing a cat. These scenes, together, required the participants to produce words that varied in phonetic, syllabic and morphosyntactic content. They were also highlighted in a way that required them to produce the words in a structured format. Thus, for example, a scene may be highlighted in a way that required the participants to produce the sentence “The mouse was being chased by the cat” or in a way that required them to produce the sentence “The cat was chasing the mouse” (Extended Data Fig. 2a ). Because the sentences had to be constructed de novo, it also required the participants to produce the words without providing explicit phonetic cues (for example, from hearing and then repeating the word ‘cat’). Taken together, this task therefore allowed neuronal activity to be examined whereby words (for example, ‘cat’), rather than independent phonetic sounds (for example, /k/), were articulated and in which the words were produced during natural speech (for example, constructing the sentence “the dog chased the cat”) rather than simply repeated (for example, hearing and then repeating the word ‘cat’).

Finally, to account for the potential contribution of sensory–perceptual responses, three of the participants also performed a ‘perception’ control in which they listened to words spoken to them. One of these participants further performed an auditory ‘playback’ control in which they listened to their own recorded voice. For this control, all words spoken by the participant were recorded using a high-fidelity microphone (Zoom ZUM-2 USM microphone) and then played back to them on a word-by-word level in randomized separate blocks.

Constructing a word feature space

To allow for single-neuronal analysis and to provide a compositional representation for each word, we grouped the constituent phonemes on the basis of the relative positions of articulatory organs associated with their production 60 . Here, for our primary analyses, we selected the places of articulation for consonants (for example, bilabial consonants) on the basis of established IPA categories defining the primary articulators involved in speech production. For consonants, phonemes were grouped on the basis of their places of articulation into glottal, velar, palatal, postalveolar, alveolar, dental, labiodental and bilabial. For vowels, we grouped phonemes on the basis of the relative height of the tongue with high vowels being produced with the tongue in a relatively high position and mid-low (that is, mid+low) vowels being produced with it in a lower position. Here, this grouping of phonemes is broadly referred to as ‘places of articulation’ together reflecting the main positions of articulatory organs and their combinations used to produce the words 58 , 59 . Finally, to allow for comparison and to test their generalizability, we examined the manners of articulation stop, fricative, affricate, nasal, liquid and glide for consonants which describe the nature of airflow restriction by various parts of the mouth and tongue. For vowels, we also evaluated the primary cardinal vowels i, e, ɛ, a, α, ɔ, o and u which are described, in combination, by the position of the tongue relative to the roof of the mouth, how far forward or back it lies and the relative positions of the lips 83 , 84 . A detailed summary of these phonetic groupings can be found in Extended Data Table 1 .

Phoneme feature space

To further evaluate the relationship between neuronal activity and the presence of specific constituent phonemes per word, the phonemes in each word were parsed according to their precise pronunciation provided by the English Lexicon Project (or the Longman Pronunciation Dictionary for American English where necessary) as described previously 85 . Thus, for example, the word ‘like’ (l-aɪ-k) would be parsed into a sequence of alveolar-mid-low-velar phonemes, whereas the word ‘bike’ (b-aɪ-k) would be parsed into a sequence of bilabial-mid-low-velar phonemes.

These constituent phonemes were then used to represent each word as a ten-dimensional vector in which the value in each position reflected the presence of each type of phoneme (Fig. 1c ). For example, the word ‘like’, containing a sequence of alveolar-mid-low-velar phonemes, was represented by the vector [0 0 0 1 0 0 1 0 0 1], with each entry representing the number of the respective type of phoneme in the word. Together, such vectors representing all words defined a phonetic ‘vector space’. Further analyses to evaluate the precise arrangement of phonemes per word are described further below. Goodness-of-fit and selectivity metrics used to evaluate single-neuronal responses to these phonemes and their specific combination in words are described further below.

Syllabic feature space

Next, to evaluate the relationship between neuronal activity and the specific arrangement of phonemes in syllables, we parsed the constituent syllables for each word using American pronunciations provided in ref. 85 . Thus, for example, ‘back’ would be defined as a labial-low-velar sequence. Here, to allow for neuronal analysis and to limit the combination of all possible syllables, we selected the ten most common syllable types. High and mid-low vowels were considered as syllables here only if they reflected syllables in themselves and were unbound from a consonant (for example, /ih/ in ‘hesitate’ or /ah-/ in ‘adore’). Similar to the phoneme space, the syllables were then transformed into an n -dimensional binary vector in which the value in each dimension reflected the presence of specific syllables (similar to construction of the phoneme space). Thus, for the n -dimensional representation of each word in this syllabic feature space, the value in each dimension could be also interpreted in relation to neuronal activity.

To account for the functional distinction between phonemes and morphemes 62 , 63 , we also parsed words into those that contained bound morphemes which were either prefixed (for example, ‘re–’) or suffixed (for example, ‘–ed’). Unlike phonemes, morphemes such as ‘–ed’ in ‘directed’ or ‘re–’ in ‘retry’ are the smallest linguistic units capable of carrying meaning and, therefore, accounting for their presence allowed their effect on neuronal responses to be further examined. To allow for neuronal analysis and to control for potential differences in neuronal activity due to word lengths, models also took into account the total number of phonemes per word.

Spectral features

To evaluate the time-varying spectral features of the articulated phonemes on a phoneme-by-phoneme basis, we identified the occurrence of each phoneme using a Montreal Forced Aligner (v.2.2; https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner ). For pitch, we calculated the spectral power in ten log-spaced frequency bins from 200 to 5,000 Hz for each phoneme per word. For amplitude, we took the root-mean-square of the recorded waveform of each phoneme.

Single-neuronal analysis

Evaluating the selectivity of single-neuronal responses.

To investigate the relationship between single-neuronal activity and specific word features, we used a regression analysis to determine the degree to which variation in neural activity could be explained by phonetic, syllabic or morphologic properties of spoken words 86 , 87 , 88 , 89 . For all analyses, neuronal activity was considered in relation to word utterance onset ( t  = 0) and taken as the mean spike count in the analysis window of interest (that is, −500 to 0 ms from word onset for word planning and 0 to +500 ms for word production). To limit the potential effects of preceding words on neuronal activity, words with planning periods that overlapped temporally were excluded from regression and selectivity analyses. For each neuron, we constructed a GLM that modelled the spike count rate as the realization of a Poisson process whose rate varied as a function of the linguistic (for example, phonetic, syllabic and morphologic) or acoustic features (for example, spectral power and root-mean-square amplitude) of the planned words.

Models were fit using the Python (v.3.9.17) library statsmodels (v.0.13.5) by iterative least-squares minimization of the Poisson negative log-likelihood function 86 . To assess the goodness-of-fit of the models, we used both the Akaike information criterion ( \({\rm{AIC}}=2k-2{\rm{ln}}(L)\) where k is the number of estimated parameters and L is the maximized value of the likelihood function) and a generalization of the R 2 score for the exponential family of regression models that we refer to as D 2 whereby 87 :

y is a vector of realized outcomes, μ is a vector of estimated means from a full (including all regressors) or restricted (without regressors of interest) model and \({K}({\bf{y}}\,,{\boldsymbol{\mu }})=2\bullet {\rm{llf}}({\bf{y}}\,;{\bf{y}})-2\bullet {\rm{llf}}({\boldsymbol{\mu }}\,;{\bf{y}})\) where \({\rm{llf}}({\boldsymbol{\mu }}\,;{\bf{y}})\) is the log-likelihood of the model and \({\rm{llf}}({\bf{y}}\,;{\bf{y}})\) is the log-likelihood of the saturated model. The D 2 value represents the proportion of reduction in uncertainty (measured by the Kullback–Leibler divergence) due to the inclusion of regressors. The statistical significance of model fit was evaluated using the likelihood ratio test compared with a model with all covariates except the regressors of interest (the task variables).

We characterized a neuron as selectively ‘tuned’ to a given word feature if the GLM of neuronal firing rates as a function of task variables for that feature exhibited a statistically significant model fit (likelihood ratio test with α set at 0.01). For neurons meeting this criterion, we also examined the point estimates and confidence intervals for each coefficient in the model. A vector of these coefficients (or, in our feature space, a vector of the sign of these coefficients) indicates a word with the combination of constituent elements expected to produce a maximal neuronal response. The multidimensional feature spaces also allowed us to define metrics that quantified the phonemic, syllabic or morphologic similarity between words. Here, we calculated the Hamming distance between the vector describing each word u and the vector of the sign of regression coefficients that defines each neuron’s maximal predicted response v , which is equal to the number of positions at which the corresponding values are different:

For each ‘tuned’ neuron, we compared the Z -scored firing rate elicited by each word as a function of the Hamming distance between the word and the ‘preferred word’ of the neuron to examine the ‘tuning’ characteristics of these neurons (Figs. 1f and  2c ). A Hamming distance of zero would therefore indicate that the words have phonetically identical compositions. Finally, to examine the relationship between neuronal activity and spectral features of each phoneme, we extracted the acoustic waveform for each phoneme and calculated the power in ten log-spaced spectral bands. We then constructed a ‘spectral vector’ representation for each word based on these ten values and fit a Poisson GLM of neuronal firing rates against these values. For amplitude analysis, we regressed neuronal firing rates against the root-mean-square amplitude of the waveform for each word.

Controlling for interdependency between phonetic and syllabic features

Three more word variations were used to examine the interdependency between phonetic and syllabic features. First, we compared firing rates for words containing specific syllables with words containing individual phonemes in that syllable but not the syllable itself (for example, simply /d/ in ‘god’ or ‘dog’). Second, we examined words containing syllables with the same constituent phonemes but in a different order (for example, /g-ah-d/ for ‘god’ versus /d-ah-g/ for ‘dog’). Thus, if neurons responded preferentially to specific syllables, then they should continue to respond to them preferentially even when comparing words that had the same arrangements of phonemes but in different or reverse order. Third, we examined words containing the same sequence of syllables but spanning a syllable boundary such that the cluster of phonemes did not constitute a syllable (that is, in the same syllable versus spanning across syllable boundaries).

Visualization of neuronal responses within the population

To allow for visualization of groupings of neurons with shared representational characteristics, we calculated the AIC and D 2 for phoneme, syllable and morpheme models for each neuron and conducted tSNE procedure which transformed these data into two dimensions such that neurons with similar feature representations are spatially closer together than those with dissimilar representations 90 . We used the tSNE implantation in the scikit-learn Python module (v.1.3.0). In Fig. 3a left, a tSNE was fit on the AIC values for phoneme, syllable and morpheme models for each neuron during the planning period with the following parameters: perplexity = 35, early exaggeration = 2 and using Euclidean distance as the metric. In Fig. 3a right and Fig. 4a bottom, a different tSNE was fit on the D 2 values for all planning and production models using the following parameters: perplexity = 10, early exaggeration = 10 and using a cosine distance metric. The resulting embeddings were mapped onto a grid of points according to a linear sum assignment algorithm between embeddings and grid points.

Population modelling

Modelling population activity.

To quantify the degree to which the neural population coded information about the planned phonemes, syllables and morphemes, we modelled the activity of the entire pseudopopulation of recorded neurons. To match trials across the different participants, we first labelled each word according to whether it contained the feature of interest and then matched words across subjects based on the features that were shared. Using this procedure, no trials or neural data were duplicated or upsampled, ensuring strict separation between training and testing sets during classifier training and subsequent evaluation.

For decoding, words were randomly split into training (75%) and testing (25%) trials across 50 iterations. A support vector machine (SVM) as implemented in the scikit-learn Python package (v.1.3.0) 91 was used to construct a hyperplane in n -dimensional space that optimally separates samples of different word features by solving the following minimization problem:

subject to \({y}_{i}({w}^{T}\phi ({x}_{i})+b)\ge 1-{\zeta }_{i}\) and \({\zeta }_{i}\ge 0\) for all \(i\in \left\{1,\ldots ,n\right\}\) , where w is the margin in feature space, C is the regularization strength, ζ i is the distance of each point from the margin, y i is the predicted class for each sample and ϕ ( x i ) is the image of each datapoint in transformed feature space. A radial basis function kernel with coefficient γ  = 1/272 was applied. The penalty term C was optimized for each classifier using a cross-validation procedure nested in the training set.

A separate classifier was trained for each dimension in a task space (for example, separate classifiers for bilabial, dental and alveolar consonants) and scores for each of these classifiers were averaged to calculate an overall decoding score for that feature type. Each decoder was trained to predict whether the upcoming word contained an instance of a specific phoneme, syllable or morpheme arrangement. For phonemes, we used nine of the ten phoneme groups (there were insufficient instances of palatal consonants to train a classifier; Extended Data Table 1 ). For syllables, we used ten syllables taken from the most common syllables across the study vocabulary (Extended Data Table 1 ). For morpheme analysis, a single classifier was trained to predict the presence or absence of any bound morpheme in the upcoming word.

Finally, to assess performance, we scored classifiers using the area under the curve of the receiver operating characteristic (AUC-ROC) model. With this scoring metric, a classifier that always guesses the most common class (that is, an uninformative classifier) results in a score of 0.5 whereas a perfect classification results in a score of 1. The overall decoding score for a particular feature space was the mean score of the classifier for each dimension in the space. The entire procedure was repeated 50 times with random train/test splits. Summary statistics for these 50 iterations are presented in the main text.

Model switching

Assessing decoder generalization across different experimental conditions provides a powerful method to evaluate the similarity of neuronal representations of information in different contexts 64 . To determine how neurons encoded the same word features but under different conditions, we trained SVM decoders using neuronal data during one condition (for example, word production) but tested the decoder using data from another (for example, no word production). Before decoder training or testing, trials were split into disjoint training and testing sets, from which the neuronal data were extracted in the epoch of interest. Thus, trials used to train the model were never used to test the model while testing either native decoder performance or decoder generalizability.

Modelling temporal dynamic

To further study the temporal dynamic of neuronal response, we trained decoders to predict the phonemes, syllables and morpheme arrangement for each word across successive time points before utterance 64 . For each neuron, we aligned all spikes to utterance onset, binned spikes into 5 ms windows and convolved with a Gaussian kernel with standard deviation of 25 ms to generate an estimated instantaneous firing rate at each point in time during word planning. For each time point, we evaluated the performance of decoders of phonemes, syllables and morphemes trained on these data over 50 random splits of training and testing trials. The distribution of times of peak decoding performance across the planning or perception period revealed the dynamic of information encoding by these neurons during word planning or perception and we then calculated the median peak decoding times for phonemes, syllables or morphemes.

Dynamical system and subspace analysis

To study the dimensionality of neuronal activity and to evaluate the functional subspaces occupied by the neuronal population, we used dynamical systems approach that quantified the time-dependent changes in neural activity patterns 31 . For the dynamical system analysis, activity for all words were averaged for each neuron to come up with a single peri-event time projection (aligned to word onset) which allowed all neurons to be analysed together as a pseudopopulation. First, we calculated the instantaneous firing rates of the neuron which showed selectivity to any word feature (phonemes, syllables or morpheme arrangement) into 5 ms bins convolved with a Gaussian filter with standard deviation of 50 ms. We used equal 500 ms windows set at −500 to 0 ms before utterance onset for the planning phase and 0 to 500 ms following utterance onset for the production phase to allow for comparison. These data were then standardized to zero mean and unit variance. Finally, the neural data were concatenated into a T   ×   N matrix of sampled instantaneous firing rates for each of the N neurons at every time T .

Together, these matrices represented the evolution of the system in N -dimensional space over time. A principal component analysis revealed a small set of five principal components (PC) embedded in the full N -dimensional space that captured most of the variance in the data for each epoch (Fig. 4b ). Projection of the data into this space yields a T  × 5 matrix representing the evolution of the system in five-dimensional space over time. The columns of the N  × 5 principal components form an orthonormal basis for the five-dimensional subspace occupied by the system during each epoch.

Next, to quantify the relationship between these subspaces during planning and production, we took two approaches. First, we calculated the alignment index from ref. 66 :

where D A is the matrix defined by the orthonormal basis of subspace A, C B is the covariance of the neuronal data as it evolves in space B, \({\sigma }_{{\rm{B}}}(i)\) is the i th singular value of the covariance matrix C B and Tr(∙) is the matrix trace. The alignment index A ranges from 0 to 1 and quantifies the fraction of variance in space B recovered when the data are projected into space A. Higher values indicate that variance in the data is adequately captured by either subspace.

As discussed in ref. 66 , subspace misalignment in the form of low alignment index A can arise by chance when considering high-dimensional neuronal data because of the probability that two randomly selected sets of dimensions in high-dimensional space may not align well. Therefore, to further explore the degree to which our subspace misalignment was attributable to chance, we used the Monte Carlo analysis to generate random subspaces from data with the same covariance structure as the true (observed) data:

where V is a random subspace, U and S are the eigenvectors and eigenvalues of the covariance matrix of the observed data across all epochs being compared, v is a matrix of white noise and orth(∙) orthogonalizes the matrix. The alignment index A of the subspaces defined by the resulting basis vectors V was recalculated 1,000 times to generate a distribution of alignment index values A attributable to chance alone (compare Fig. 4b ).

Finally, we calculated the projection error between each pair of subspaces on the basis of relationships between the three orthonormal bases (rather than a projection of the data into each of these subspaces). The set of all (linear) subspaces of dimension k   <   n embedded in an n -dimensional vector space V forms a manifold known as the Grassmannian, endowed with several metrics which can be used to quantify distances between two subspaces on the manifold. Thus, the subspaces (defined by the columns of a T   ×   N ′ matrix, where N ′ is the number of selected principal components; five in our case) explored by the system during planning and production are points on the Grassmannian manifold of the full N -neuron dimensional vector space. Here, we used the Grassmannian chordal distance 92 :

where A and B are matrices whose columns are the orthonormal basis for their respective subspaces and \({\parallel \cdot \parallel }_{F}\) is the Frobenius norm. By normalizing this distance by the Frobenius norm of subspace A , we scale the distance metric from 0 to 1, where 0 indicates a subspace identical to A (that is, completely overlapping) and increasing values indicate greater misalignment from A . Random sampling of subspaces under the null hypothesis was repeated using the same procedure outlined above.

Participant demographics

Across the participants, there was no statistically significant difference in word length based on sex (three-way analysis of variance, F (1,4257) = 1.78, P  = 0.18) or underlying diagnosis (essential tremor versus Parkinson’s disease; F (1,4257) = 0.45, P  = 0.50). Among subjects with Parkinson’s disease, there was a significant difference based on disease severity (both ON score and OFF score) with more advanced disease (higher scores) correlating with longer word lengths ( F (1,3295) = 145.8, P  = 7.1 × 10 −33 for ON score and F (1,3295) = 1,006.0, P  = 6.7 × 10 −193 for OFF score, P  < 0.001) and interword intervals ( F (1,3291) = 14.9, P  = 1.1 × 10 −4 for ON score and F (1,3291) = 31.8, P  = 1.9 × 10 −8 for OFF score). Modelling neuronal activities in relation to these interword intervals (bottom versus top quartile), decoding performances were slightly higher for longer compared to shorter delays (0.76 ± 0.01 versus 0.68 ± 0.01, P  < 0.001, two-sided Mann–Whitney U -test).

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All the primary data supporting the main findings of this study are available online at https://doi.org/10.6084/m9.figshare.24720501 .  Source data are provided with this paper.

Code availability

All codes necessary for reproducing the main findings of this study are available online at https://doi.org/10.6084/m9.figshare.24720501 .

Levelt, W. J. M., Roelofs, A. & Meyer, A. S. A Theory of Lexical Access in Speech Production Vol. 22 (Cambridge Univ. Press, 1999).

Kazanina, N., Bowers, J. S. & Idsardi, W. Phonemes: lexical access and beyond. Psychon. Bull. Rev. 25 , 560–585 (2018).

Article   PubMed   Google Scholar  

Bohland, J. W. & Guenther, F. H. An fMRI investigation of syllable sequence production. NeuroImage 32 , 821–841 (2006).

Basilakos, A., Smith, K. G., Fillmore, P., Fridriksson, J. & Fedorenko, E. Functional characterization of the human speech articulation network. Cereb. Cortex 28 , 1816–1830 (2017).

Article   PubMed Central   Google Scholar  

Tourville, J. A., Nieto-Castañón, A., Heyne, M. & Guenther, F. H. Functional parcellation of the speech production cortex. J. Speech Lang. Hear. Res. 62 , 3055–3070 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Lee, D. K. et al. Neural encoding and production of functional morphemes in the posterior temporal lobe. Nat. Commun. 9 , 1877 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Glanz, O., Hader, M., Schulze-Bonhage, A., Auer, P. & Ball, T. A study of word complexity under conditions of non-experimental, natural overt speech production using ECoG. Front. Hum. Neurosci. 15 , 711886 (2021).

Yellapantula, S., Forseth, K., Tandon, N. & Aazhang, B. NetDI: methodology elucidating the role of power and dynamical brain network features that underpin word production. eNeuro 8 , ENEURO.0177-20.2020 (2020).

Hoffman, P. Reductions in prefrontal activation predict off-topic utterances during speech production. Nat. Commun. 10 , 515 (2019).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Glasser, M. F. et al. A multi-modal parcellation of human cerebral cortex. Nature 536 , 171–178 (2016).

Chang, E. F. et al. Pure apraxia of speech after resection based in the posterior middle frontal gyrus. Neurosurgery 87 , E383–E389 (2020).

Hazem, S. R. et al. Middle frontal gyrus and area 55b: perioperative mapping and language outcomes. Front. Neurol. 12 , 646075 (2021).

Fedorenko, E. et al. Neural correlate of the construction of sentence meaning. Proc. Natl Acad. Sci. USA 113 , E6256–E6262 (2016).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Nelson, M. J. et al. Neurophysiological dynamics of phrase-structure building during sentence processing. Proc. Natl Acad. Sci. USA 114 , E3669–E3678 (2017).

Walenski, M., Europa, E., Caplan, D. & Thompson, C. K. Neural networks for sentence comprehension and production: an ALE-based meta-analysis of neuroimaging studies. Hum. Brain Mapp. 40 , 2275–2304 (2019).

Elin, K. et al. A new functional magnetic resonance imaging localizer for preoperative language mapping using a sentence completion task: validity, choice of baseline condition and test–retest reliability. Front. Hum. Neurosci. 16 , 791577 (2022).

Duffau, H. et al. The role of dominant premotor cortex in language: a study using intraoperative functional mapping in awake patients. Neuroimage 20 , 1903–1914 (2003).

Ikeda, S. et al. Neural decoding of single vowels during covert articulation using electrocorticography. Front. Hum. Neurosci. 8 , 125 (2014).

Ghosh, S. S., Tourville, J. A. & Guenther, F. H. A neuroimaging study of premotor lateralization and cerebellar involvement in the production of phonemes and syllables. J. Speech Lang. Hear. Res. 51 , 1183–1202 (2008).

Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495 , 327–332 (2013).

Anumanchipalli, G. K., Chartier, J. & Chang, E. F. Speech synthesis from neural decoding of spoken sentences. Nature 568 , 493–498 (2019).

Moses, D. A. et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med. 385 , 217–227 (2021).

Wang, R. et al. Distributed feedforward and feedback cortical processing supports human speech production. Proc. Natl Acad. Sci. USA 120 , e2300255120 (2023).

Coudé, G. et al. Neurons controlling voluntary vocalization in the Macaque ventral premotor cortex. PLoS ONE 6 , e26822 (2011).

Hahnloser, R. H. R., Kozhevnikov, A. A. & Fee, M. S. An ultra-sparse code underlies the generation of neural sequences in a songbird. Nature 419 , 65–70 (2002).

Aronov, D., Andalman, A. S. & Fee, M. S. A specialized forebrain circuit for vocal babbling in the juvenile songbird. Science 320 , 630–634 (2008).

Article   ADS   CAS   PubMed   Google Scholar  

Stavisky, S. D. et al. Neural ensemble dynamics in dorsal motor cortex during speech in people with paralysis. eLife 8 , e46015 (2019).

Tankus, A., Fried, I. & Shoham, S. Structured neuronal encoding and decoding of human speech features. Nat. Commun. 3 , 1015 (2012).

Article   ADS   PubMed   Google Scholar  

Basilakos, A., Smith, K. G., Fillmore, P., Fridriksson, J. & Fedorenko, E. Functional characterization of the human speech articulation network. Cereb. Cortex 28 , 1816–1830 (2018).

Keating, P. & Shattuck-Hufnagel, S. A prosodic view of word form encoding for speech production. UCLA Work. Pap. Phon. 101 , 112–156 (1989).

Google Scholar  

Vyas, S., Golub, M. D., Sussillo, D. & Shenoy, K. V. Computation through neural population dynamics. Ann. Rev. Neurosci. 43 , 249–275 (2020).

Article   CAS   PubMed   Google Scholar  

Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Ryu, S. I. & Shenoy, K. V. Cortical preparatory activity: representation of movement or first cog in a dynamical machine? Neuron 68 , 387–400 (2010).

Shenoy, K. V., Sahani, M. & Churchland, M. M. Cortical control of arm movements: a dynamical systems perspective. Ann. Rev. Neurosci. 36 , 337–359 (2013).

Kaufman, M. T., Churchland, M. M., Ryu, S. I. & Shenoy, K. V. Cortical activity in the null space: permitting preparation without movement. Nat. Neurosci. 17 , 440–448 (2014).

Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503 , 78–84 (2013).

Vitevitch, M. S. & Luce, P. A. Phonological neighborhood effects in spoken word perception and production. Ann. Rev. Linguist. 2 , 75–94 (2016).

Jamali, M. et al. Dorsolateral prefrontal neurons mediate subjective decisions and their variation in humans. Nat. Neurosci. 22 , 1010–1020 (2019).

Mian, M. K. et al. Encoding of rules by neurons in the human dorsolateral prefrontal cortex. Cereb. Cortex 24 , 807–816 (2014).

Patel, S. R. et al. Studying task-related activity of individual neurons in the human brain. Nat. Protoc. 8 , 949–957 (2013).

Sheth, S. A. et al. Human dorsal anterior cingulate cortex neurons mediate ongoing behavioural adaptation. Nature 488 , 218–221 (2012).

Williams, Z. M., Bush, G., Rauch, S. L., Cosgrove, G. R. & Eskandar, E. N. Human anterior cingulate neurons and the integration of monetary reward with motor responses. Nat. Neurosci. 7 , 1370–1375 (2004).

Jang, A. I., Wittig, J. H. Jr., Inati, S. K. & Zaghloul, K. A. Human cortical neurons in the anterior temporal lobe reinstate spiking activity during verbal memory retrieval. Curr. Biol. 27 , 1700–1705 (2017).

Ponce, C. R. et al. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell 177 , 999–1009 (2019).

Yoshor, D., Ghose, G. M., Bosking, W. H., Sun, P. & Maunsell, J. H. Spatial attention does not strongly modulate neuronal responses in early human visual cortex. J. Neurosci. 27 , 13205–13209 (2007).

Jamali, M. et al. Single-neuronal predictions of others’ beliefs in humans. Nature 591 , 610–614 (2021).

Hickok, G. & Poeppel, D. Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92 , 67–99 (2004).

Poologaindran, A., Lowe, S. R. & Sughrue, M. E. The cortical organization of language: distilling human connectome insights for supratentorial neurosurgery. J. Neurosurg. 134 , 1959–1966 (2020).

Genon, S. et al. The heterogeneity of the left dorsal premotor cortex evidenced by multimodal connectivity-based parcellation and functional characterization. Neuroimage 170 , 400–411 (2018).

Milton, C. K. et al. Parcellation-based anatomic model of the semantic network. Brain Behav. 11 , e02065 (2021).

Sun, H. et al. Functional segregation in the left premotor cortex in language processing: evidence from fMRI. J. Integr. Neurosci. 12 , 221–233 (2013).

Peeva, M. G. et al. Distinct representations of phonemes, syllables and supra-syllabic sequences in the speech production network. Neuroimage 50 , 626–638 (2010).

Paulk, A. C. et al. Large-scale neural recordings with single neuron resolution using Neuropixels probes in human cortex. Nat. Neurosci. 25 , 252–263 (2022).

Coughlin, B. et al. Modified Neuropixels probes for recording human neurophysiology in the operating room. Nat. Protoc. 18 , 2927–2953 (2023).

Windolf, C. et al. Robust online multiband drift estimation in electrophysiology data.In Proc. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, Rhodes Island, 2023).

Mehri, A. & Jalaie, S. A systematic review on methods of evaluate sentence production deficits in agrammatic aphasia patients: validity and reliability issues. J. Res. Med. Sci. 19 , 885–898 (2014).

PubMed   PubMed Central   Google Scholar  

Abbott, L. F. & Sejnowski, T. J. Neural Codes and Distributed Representations: Foundations of Neural Computation (MIT, 1999).

Green, D. M. & Swets, J. A. Signal Detection Theory and Psychophysics (Wiley, 1966).

Association, I. P. & Staff, I. P. A. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet (Cambridge Univ. Press, 1999).

Indefrey, P. & Levelt, W. J. M. in The New Cognitive Neurosciences 2nd edn (ed. Gazzaniga, M. S.) 845–865 (MIT, 2000).

Slobin, D. I. Thinking for speaking. In Proc. 13th Annual Meeting of the Berkeley Linguistics Society (eds Aske, J. et al.) 435–445 (Berkeley Linguistics Society, 1987).

Pillon, A. Morpheme units in speech production: evidence from laboratory-induced verbal slips. Lang. Cogn. Proc. 13 , 465–498 (1998).

Article   Google Scholar  

King, J. R. & Dehaene, S. Characterizing the dynamics of mental representations: the temporal generalization method. Trends Cogn. Sci. 18 , 203–210 (2014).

Machens, C. K., Romo, R. & Brody, C. D. Functional, but not anatomical, separation of “what” and “when” in prefrontal cortex. J. Neurosci. 30 , 350–360 (2010).

Elsayed, G. F., Lara, A. H., Kaufman, M. T., Churchland, M. M. & Cunningham, J. P. Reorganization between preparatory and movement population responses in motor cortex. Nat. Commun. 7 , 13239 (2016).

Roy, S., Zhao, L. & Wang, X. Distinct neural activities in premotor cortex during natural vocal behaviors in a New World primate, the Common Marmoset ( Callithrix jacchus ). J. Neurosci. 36 , 12168–12179 (2016).

Eliades, S. J. & Miller, C. T. Marmoset vocal communication: behavior and neurobiology. Dev. Neurobiol. 77 , 286–299 (2017).

Okobi, D. E. Jr, Banerjee, A., Matheson, A. M. M., Phelps, S. M. & Long, M. A. Motor cortical control of vocal interaction in neotropical singing mice. Science 363 , 983–988 (2019).

Cohen, Y. et al. Hidden neural states underlie canary song syntax. Nature 582 , 539–544 (2020).

Hickok, G. Computational neuroanatomy of speech production. Nat. Rev. Neurosci. 13 , 135–145 (2012).

Sahin, N. T., Pinker, S., Cash, S. S., Schomer, D. & Halgren, E. Sequential processing of lexical, grammatical and phonological information within Broca’s area. Science 326 , 445–449 (2009).

Russo, A. A. et al. Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation. Neuron 107 , 745–758 (2020).

Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature 620 , 1031–1036 (2023).

Boersma, P. & Weenink, D. Praat: Doing Phonetics by Computer (2020); www.fon.hum.uva.nl/praat/ .

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. Montreal forced aligner: trainable text-speech alignment using kaldi. In Proc. Annual Conference of the International Speech Communication Association 498–502 (ISCA, 2017).

Lancaster, J. L. et al. Automated regional behavioral analysis for human brain images. Front. Neuroinform. 6 , 23 (2012).

Lancaster, J. L. et al. Automated analysis of fundamental features of brain structures. Neuroinformatics 9 , 371–380 (2011).

Fischl, B. & Dale, A. M. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl Acad. Sci. USA 97 , 11050–11055 (2000).

Fischl, B., Liu, A. & Dale, A. M. Automated manifold surgery: constructing geometrically accurate and topologically correct models of the human cerebral cortex. IEEE Trans. Med. Imaging 20 , 70–80 (2001).

Reuter, M., Schmansky, N. J., Rosas, H. D. & Fischl, B. Within-subject template estimation for unbiased longitudinal image analysis. Neuroimage 61 , 1402–1418 (2012).

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J. M. FieldTrip: open source software for advanced analysis of MEG, EEG and invasive electrophysiological data. Comput. Intell. Neurosci. 2011 , 156869 (2011).

Noiray, A., Iskarous, K., Bolanos, L. & Whalen, D. Tongue–jaw synergy in vowel height production: evidence from American English. In 8th International Seminar on Speech Production (eds Sock, R. et al.) 81–84 (ISSP, 2008).

Flege, J. E., Fletcher, S. G., McCutcheon, M. J. & Smith, S. C. The physiological specification of American English vowels. Lang. Speech 29 , 361–388 (1986).

Wells, J. Longman Pronunciation Dictionary (Pearson, 2008).

Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. In Proc. 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SCIPY, 2010).

Cameron, A. C. & Windmeijer, F. A. G. An R -squared measure of goodness of fit for some common nonlinear regression models. J. Econometr. 77 , 329–342 (1997).

Article   MathSciNet   Google Scholar  

Hamilton, L. S. & Huth, A. G. The revolution will not be controlled: natural stimuli in speech neuroscience. Lang. Cogn. Neurosci. 35 , 573–582 (2020).

Hamilton, L. S., Oganian, Y., Hall, J. & Chang, E. F. Parallel and distributed encoding of speech across human auditory cortex. Cell 184 , 4626–4639 (2021).

Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 , 2579–2605 (2008).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   Google Scholar  

Ye, K. & Lim, L.-H. Schubert varieties and distances between subspaces of different dimensions. SIAM J. Matrix Anal. Appl. 37 , 1176–1197 (2016).

Download references

Acknowledgements

We thank all the participants for their generosity and willingness to take part in the research. We also thank A. Turk and S. Hufnagel for their insightful comments and suggestions as well as D. J. Kellar, Y. Chou, A. Zhang, A. O’Donnell and B. Mash for their assistance and contributions to the intraoperative setup and feedback. Finally, we thank B. Coughlin, E. Trautmann, C. Windolf, E. Varol, D. Soper, S. Stavisky and K. Shenoy for their assistance in developing the data processing pipeline. A.R.K. and W.M. are supported by the NIH Neuroscience Resident Research Program R25NS065743, M.J. is supported by CIHR and Foundations of Human Behavior Initiative, A.C.P. is supported by UG3NS123723, Tiny Blue Dot Foundation and P50MH119467. J.C. is supported by American Association of University Women, S.S.C. is supported by R44MH125700 and Tiny Blue Dot Foundation and Z.M.W. is supported by R01DC019653 and U01NS121616.

Author information

These authors contributed equally: Arjun R. Khanna, William Muñoz, Young Joon Kim

These authors jointly supervised this work: Sydney Cash, Ziv M. Williams

Authors and Affiliations

Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

Arjun R. Khanna, William Muñoz, Yoav Kfir, Mohsen Jamali, Jing Cai, Martina L. Mustroph, Irene Caprara, Mackenna Mejdell, Jeffrey Schweitzer & Ziv M. Williams

Harvard Medical School, Boston, MA, USA

Young Joon Kim & Abigail Zuckerman

Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

Angelique C. Paulk, Richard Hardstone, Domokos Meszéna & Sydney Cash

Harvard-MIT Division of Health Sciences and Technology, Boston, MA, USA

Ziv M. Williams

Harvard Medical School, Program in Neuroscience, Boston, MA, USA

You can also search for this author in PubMed   Google Scholar

Contributions

A.R.K. and Y.J.K. performed the analyses. Z.M.W., J.S. and W.M. performed the intraoperative neuronal recordings. W.M., Y.J.K., A.C.P., R.H. and D.M. performed the data processing and neuronal alignments. W.M. performed the spike sorting. A.C.P. and W.M. reconstructed the recording locations. A.R.K., W.M., Y.J.K., Y.K., A.C.P., M.J., J.C., M.L.M., I.C. and D.M. performed the experiments. Y.K. and M.J. implemented the task. M.M. and A.Z. transcribed the speech signals. A.C.P., S.C. and Z.M.W. devised the intraoperative Neuropixels recording approach. A.R.K., W.M., Y.J.K., A.C.P., M.J., J.S. and S.C. edited the manuscript and Z.M.W. conceived and designed the study, wrote the manuscript and directed and supervised all aspects of the research.

Corresponding author

Correspondence to Ziv M. Williams .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Eyiyemisi Damisah, Yves Boubenec and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 single-unit isolations from the human prefrontal cortex using neuropixels recordings..

a . Individual recording sites on a standardized 3D brain model (FreeSurfer), on side ( top ), zoomed-in oblique ( inset ) and top ( bottom ) views. Recordings lay across the posterior middle frontal gyrus of the language-dominant prefrontal cortex and roughly ranged in distribution from alongside anterior area 55b to 8a. b . Recording coordinates for the five participants are given in MNI space. c . Left , representative example of raw, motion-corrected action potential traces recorded across neighbouring channels over time. Right , an example of overlayed spike waveform morphologies and their distribution across neighbouring channels recorded from a Neuropixels array. d . Isolation metrics for the recorded population (n = 272 units) together with an example of spikes from four concomitantly recorded units (labelled red, blue, cyan and yellow) in principal component space.

Extended Data Fig. 2 Naturalistic speech production task performance and phonetic selectivity across neurons and participants.

a . A priming-based speech production task that provided participants with pictorial representations of naturalistic events and that had to be verbally described in specific order. The task trial example is given here for illustrative purposes (created with BioRender.com). b . Mean word production times across participants and their standard deviation of the mean. The blue bars and dots represent performances for the five participants in which recordings were acquired (n = 964, 1252, 406, 836, 805 words, respectively). The grey bar and dots represent healthy control (n = 1534 words). c . Percentage of modulated neurons that responded selectively to specific planned phonemes across participants. All participants possessed neurons that responded to various phonetic features (one-sided χ 2  = 10.7, 6.9, 7.4, 0.5 and 1.3, p = 0.22, 0.44, 0.49, 0.97, 0.86, for participants 1–5, respectively).

Extended Data Fig. 3 Examples of single-neuronal activities and their temporal dynamics.

a . Peri-event time histograms were constructed by aligning the action potentials of each neuron to word onset. Data are presented as mean (line) values ± standard error of the mean (shade). Examples of three representative neurons that selectively changed their activity to specific planned phonemes. Inset , spike waveform morphology and scale bar (0.5 ms). b . Peri-event time histogram and action potential raster for the same neurons above but now aligned to the onset of the articulated phonemes themselves. Data are presented as mean (line) values ± standard error of the mean (shade). c . Sankey diagram displaying the proportions of neurons (n = 56) that displayed a change in activity polarity (increases in orange and decreases in purple) from planning to production.

Extended Data Fig. 4 Generalizability of explanatory power across phonetic groupings for consonants and vowels.

a . Scatter plots of the model explanatory power (D 2 ) for different phonetic groupings across the cell population (n = 272 units). Phonetic groupings were based on the planned (i) places of articulation of consonants and/or vowels (ii) manners of articulation of consonants and (iii) primary cardinal vowels (Extended Data Table 1 ). Model D 2 explanatory power across all phonetic groupings were significantly correlated (from top left to bottom right, p = 1.6×10 −146 , p = 2.8×10 −70 , p = 6.1×10 −54 , p = 1.4×10 −57 , p = 2.3×10 −43 and p = 5.9×10 −43 , two-sided tests of Spearman rank-order correlations). Spearman’s ρ are 0.96, 0.83, 0.77, respectively for left to right top panels and 0.78, 0.71, 0.71, respectively for left to right bottom panels (dashed regression lines). Among phoneme-selective neurons, the planned places of articulation provided the highest explanatory power (two-sided Wilcoxon signed-rank test of model D 2 values, W = 716, p = 7.9×10 −16 ) and the best model fits (two-sided Wilcoxon signed-rank test of AIC, W = 2255, p = 1.3×10 −5 ) compared to manners of articulation. They also provided the highest explanatory power (two-sided Wilcoxon signed-rank test of model D 2 values, W = 846, p = 9.7×10 −15 ) and fits (two-sided Wilcoxon signed-rank test of AIC, W = 2088, p = 2.0×10 −6 ) compared to vowels. b . Multidimensional scaling (MDS) representation of all neurons across phonetic groupings. Neurons with similar response characteristics are plotted closer together. The hue of each point reflects the degree of selectivity to specific phonetic features. Here, the colour scale for places of articulation is provided in red, manners of articulation in green and vowels in blue. The size of each point reflects the magnitude of the maximum explanatory power in relation to each cell’s phonetic selectivity (maximum D 2 for places of articulation of consonants and/or vowels, manners of articulation of consonants and primary cardinal vowels).

Extended Data Fig. 5 Explanatory power for the acoustic–phonetic properties of phonemes and neuronal tuning to morphemes.

a . Left , scatter plot of the D 2 explanatory power of neurons for planned phonemes and their observed spectral frequencies during articulation (n = 272 units; Spearman’s ρ = 0.75, p = 9.3×10 −50 , two-sided test of Spearman rank-order correlation). Right , decoding performances for the spectral frequency of phonemes (n = 50 random test/train splits; p = 7.1×10 −18 , two-sided Mann–Whitney U-test). Data are presented as mean values ± standard error of the mean. b . Venn diagrams of neurons that were modulated by phonemes during planning and those that were modulated by the spectral frequency (left) and amplitude (right) of the phonemes during articulation. c . Left , peri-event time histogram and raster for a representative neuron exhibiting selectivity to words that contained bound morphemes (for example, –ing , –ed ) compared to words that did not. Data are presented as mean (line) values ± standard error of the mean (shade). Inset , spike waveform morphology and scale bar (0.5 ms). Right , decoding performance distribution for morphemes (n = 50 random test/train splits; p = 1.0×10 −17 , two-sided Mann–Whitney U-test). Data are presented as mean values ± standard deviation.

Extended Data Fig. 6 Phonetic representations of words during speech perception and the comparison of speaking to listening.

a . Left , Venn diagrams of neurons that selectively changed their activity to specific phonemes during word planning (−500:0 ms from word utterance onset) and perception (0:500 ms from word utterance onset). Right , average z-scored firing rate for selective neurons during word planning (black) and perception (grey) as a function of the Hamming distance. Here, the Hamming distance was based on the neurons’ preferred phonetic compositions during production and compared for the same neurons during perception. Data are presented as mean (line) values ± standard error of the mean (shade). b . Left , classifier decoding performances for selective neurons during word planning. The points provide the sampled distribution for the classifier’s ROC-AUC values (black) compared to random chance (grey; n = 50 random test/train splits; p = 7.1×10 −18 , two-sided Mann–Whitney U-test). Middle , decoding performance for selective neurons during perception (n = 50 random test/train splits; 7.1×10 −18 , two-sided Mann–Whitney U-test). Right , word planning-perception model-switch decoding performances for selective neurons. Here, models were trained on neural data for specific phonemes during planning and then used to decode those same phonemes during perception (n = 50 random test/train splits; p > 0.05, two-sided Mann–Whitney U-test; Methods ). The boundaries and midline of the boxplots represent the 25 th and 75 th percentiles and the median, respectively. c . Peak decoding performance for phonemes, syllables and morphemes as a function of time from perceived word onset. Peak decoding for morphemes was observed significantly later than for phonemes and syllables during perception (n = 50 random test/train splits; two-sided Kruskal–Wallis, H = 14.8, p = 0.00062). Data are presented here as median (dot) values ± bootstrapped standard error of the median.

Extended Data Fig. 7 Spatial distribution of representations based on cortical location and depth.

a . Relationship between recording location along the rostral–caudal axis of the prefrontal cortex and the proportion of neurons that displayed selectivity to specific phonemes, syllables and morphemes. Neurons that displayed selectivity were more likely to be found posteriorly (one-sided χ 2 test, p = 2.6×10 −9 , 3.0×10 −11 , 2.5×10 −6 , 3.9×10 −10 , for places of articulation, manners of articulation, syllables and morpheme, respectively). b . Relationship between recording depth along the cortical column and the proportion of neurons that display selectivity to specific phonemes, syllables and morphemes. Neurons that displayed selectivity were broadly distributed along the cortical column (one-sided χ 2 test, p > 0.05). Here, S indicates superficial, M middle and D deep.

Extended Data Fig. 8 Receiver operating characteristic curves across planned phonetic representations and decoding model-switching performances for word planning and production.

a . ROC-AUC curves for neurons across different phonemes, grouped by placed of articulation, during planning (there were insufficient palatal consonants to allow for classification and are therefore not displayed here). b . Average (solid line) and shuffled (dotted line) data across all phonemes. Data are presented as mean (line) values ± standard error of the mean (shade). c . Planning-production model-switch decoding performance sample distribution (n = 50 random test/train splits) for all selective neurons. Here, models were trained on neuronal data recorded during planning and then used to decode those same phoneme ( left ), syllable ( middle ), or morpheme ( right ) on neuronal data recorded during production. Slightly lower decoding performances were noted for syllables and morphemes when comparing word planning to production (p = 0.020 for syllable comparison and p = 0.032 for morpheme comparison, two-sided Mann–Whitney U-test). Data are presented as mean values ± standard deviation.

Extended Data Fig. 9 Example of phonetic representations in planning and production subspaces.

Modelled depiction of the neuronal population trajectory (bootstrap resampled) across averaged trials with (green) and without (grey) mid-low phonemes, projected into a plane within the “planning” subspace (y-axis) and a plane within the “production” subspace (z-axis). Projection planes within planning and production subspaces were chosen to enable visualization of trajectory divergence. Zero indicates word onset on the x-axis. Separation between the population trajectory during trials with and without mid-low phonemes is apparent in the planning subspace (y-axis) independently of the projection subspace (z-axis) because these subspaces are orthogonal. The orange plane indicates a hypothetical decision boundary learned by a classifier to separate neuronal activities between mid-low and non-mid-low trials. Because the classifier decision boundary is not constrained to lie within a particular subspace, classifier performance may therefore generalize across planning and production epochs, despite the near-orthogonality of these respective subspaces.

Supplementary information

Reporting summary, source data, source data fig. 1, source data fig. 2, source data fig. 3, source data fig. 4, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Khanna, A.R., Muñoz, W., Kim, Y.J. et al. Single-neuronal elements of speech production in humans. Nature 626 , 603–610 (2024). https://doi.org/10.1038/s41586-023-06982-w

Download citation

Received : 22 June 2023

Accepted : 14 December 2023

Published : 31 January 2024

Issue Date : 15 February 2024

DOI : https://doi.org/10.1038/s41586-023-06982-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

How speech is produced and perceived in the human cortex.

  • Yves Boubenec

Nature (2024)

Mind-reading devices are revealing the brain’s secrets

  • Miryam Naddaf

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

creative speech production meaning

Creativity and Production in Language Learning

As you read the anecdote below, try to figure out all the ways that students can receive language input and produce language during the project.

High beginning-level English language learners in Mr. Lin’s class are developing television and print ads for new products that they designed during a unit on advertising. Each team of three students has created and scanned a drawing of their product, and they have also developed a life-sized model for possible use in their TV commercial. Most of the students are now writing their ads. Mr. Lin watches as some of them are looking at  commercials on the Internet to get ideas; other students seem to be debating the wording of their ads. By assigning each group member a role, Mr. Lin has made sure that each student is responsible for an important piece of the project. Because he also has a rule that students must ask three other students a question before they bring it to him, he sees a lot of intergroup interaction. When the students have completed the scripts for the TV and print versions of their ads, they will try them out on another group, who will suggest  changes and other ideas before they go into production. Students will use the ESL program’s new digital camera to film their TV ad and then edit it with either iMovie for Mac or OpenShot Video Editor for PC. They will create their print ad using Microsoft Word software. The final versions of both ads will be posted to the Web, along with an explanation of the assignment and a reflection on the different processes and ideas behind the two types of ads. Students will then have the opportunity to obtain feedback from their classmates and from outside experts.

► Overview of Creativity and Production in Language Learning

Creativity and production are related to many of the standards and foundations for effective language learning. For example, although most researchers agree that input is important for language learning (see, e.g., seminal articles by Long, 1989, 1996; Pica, 1994), others have explored the important role of output , or language production, in language acquisition (Holliday, 1999; Quinn, 2018; Swain, 1995). Language production is important because it allows students to test their hypotheses about how language works and encourages students to use their preferred learning styles to gain additional input in the target language.

Social interaction not only enables valuable language input, but it also enables valuable language production. It allows learners to understand when others find their language incomprehensible and gives them an opportunity to explore various ways of making themselves understood. Feedback from others can also help them notice the discrete grammatical items that they need to focus on to improve their language. Language and content output can take many forms (e.g., speech, graphics, text), and it can range from essays to multimedia presentations.

Producing language assists the language learning process in many ways, but production does not in and of itself promote learning. For example, production can include relatively meaningless activities such as reciting answers to uncontextualized grammar drills. Creativity implies something more—doing something original, adapting, or changing. In this sense, a sentence or a presentation in language classrooms can be creative. To be creative, students need opportunities for intentional cognition; appropriate support, scaffolding, and feedback; and control over language aspects that they will use in their production. Working with others (see chapter 4) often facilitates creativity.

Creative Commons License

Share This Book

Feedback/errata.

Comments are closed.

  • Increase Font Size

Logo for TRU Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

9.1 Evidence for Speech Production

Dinesh Ramoo

The evidence used by psycholinguistics in understanding speech production can be varied and interesting. These include speech errors, reaction time experiments, neuroimaging, computational modelling, and analysis of patients with language disorders. Until recently, the most prominent set of evidence for understanding how we speak came from speech errors . These are spontaneous mistakes we sometimes make in casual speech. Ordinary speech is far from perfect and we often notice how we slip up. These slips of the tongue can be transcribed and analyzed for broad patterns. The most common method is to collect a large corpus of speech errors by recording all the errors one comes across in daily life.

Perhaps the most famous example of this type of analysis are what are termed ‘ Freudian slips .’ Freud (1901-1975) proposed that slips of the tongue were a way to understand repressed thoughts. According to his theories about the subconscious, certain thoughts may be too uncomfortable to be processed by the conscious mind and can be repressed. However, sometimes these unconscious thoughts may surface in dreams and slips of the tongue. Even before Freud, Meringer and Mayer (1895) analysed slips of the tongue (although not in terms of psychoanalysis).

Speech errors can be categorized into a number of subsets in terms of the linguistic units or mechanisms involved. Linguistic units involved in speech errors could be phonemes, syllables, morphemes, words or phrases. The mechanisms of the errors can involve the deletion, substitution, insertion, or blending of these units in some way. Fromkin (1971; 1973) argued that the fact that these errors involve some definable linguistic unit established their mental existence at some level in speech production. We will consider these in more detail in discussing the various stages of speech production.

An error in the production of speech.

An unintentional speech error hypothesized by Sigmund Freud as indicating subconscious feelings.

9.1 Evidence for Speech Production Copyright © 2021 by Dinesh Ramoo is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

What is Creative Production?

Welcome to the exciting world of creative production! Whether you're a person just starting out or an experienced professional, creative production is a rewarding field that offers endless opportunities to create and innovate. From planning out stunning videos and digital campaigns to designing intricate print materials and websites, creative professionals play a vital role in helping people communicate their message and reach their target audience.

No matter if you're interested in working with a creative agency or leading an in-house team, there are endless opportunities to make an impact and shape the future.

Let's get started!

Defining Creative Production

In its simplest definition, creative production is everything that brings a creative concept or idea to fruition. At its most complex, one could include everything from developing a marketing strategy and creating marketing materials to producing promotions and delivering a plethora of other related forms of content.

That being said, the production processes for creative projects can vary widely depending on the specific goals and needs of the client. It may involve a number of different steps, such as research and planning, concept development and promotions, design and execution, and testing and refinement.

Creative production requires a wide range of skills, generally including creative problem-solving, design, writing, and production management. It's important for creative professionals to be able to work well as part of a team and to be able to adapt to any changing needs.

And with the similarities in names, creative production and creative operations are often times mistakenly swapped.

Professional Creative Titles

Is there a difference between Creative Operations and Creative Production?

Yes, creative production and creative operations are two distinct areas of focus within the creative industry. While they may overlap in some ways, they have distinct characteristics and goals. Read our in-depth piece on what creative operations is, here .

Creative operations refer to the overall management and organization of the creative teams and processes. This may also include things like project management , resource allocation, quality control, and process improvement. Creative production on the other hand refers to the process of bringing a creative concept or idea to fruition. This involves a wide range of activities, from research and planning to design and execution.

Think of creative operations as the how-to create and creative production as the act of creating.

Creativity is an obvious and essential element of both. In creative operations, it is focused on finding ways to improve and optimize the overall creative process, while in creative production, it's focused on the development and execution of specific projects.

Remember, the creative industry is constantly evolving and the roles of creative production and creative operations are likely to change and adapt over time. However, the distinction between these two areas remains an important one, and understanding the difference can help creative professionals navigate the ever-changing landscape of the creative industry.

In-House vs. External

In-House vs. External Creative Production

Creative agencies and in-house teams both play important roles in the creation of marketing materials, advertising campaigns, and other forms of content. However, there are some key differences between the two approaches.

In-house teams refer to the process of creating marketing, digital advertising campaigns, and other forms of content within an organization. This may involve working with an in-house team of designers, writers, and videographers, or it may involve outsourcing certain tasks to freelancers. In-house creative production can be more cost-effective, as they do not need to pay for external services. However, in-house teams may not have the same level of specialization and expertise as external creative production agencies.

External teams refer to the process of creating content using an external creative agency or other partners. External creative production agencies are typically staffed by professionals with varied backgrounds and they may be able to offer a higher level of specialization and expertise than in-house teams. External production can also be more expensive than in-house production, as organizations need to pay for the services of the external agency.

Flexibility and scalability are important considerations when deciding which way to go. Creative agencies are typically able to adapt their processes to the changing needs of their employer and quickly scale up or down as needed. In-house creative production teams, on the other hand, may be more limited in their ability to adapt to changing circumstances and may have more difficulty scaling up or down as needed.

Costs are a significant factor to consider when comparing agencies and in-house teams. Agencies often charge fees for their services, which can be a significant expense. In-house creative production teams, on the other hand, may be more cost-effective in the long run.

Ultimately, the decision between using an agency or an in-house team will depend on the specific project and goals of an organization. Both approaches have their own unique strengths and drawbacks. As with most things, it's important to carefully consider the pros and cons of each option before making a decision.

Research, Planning, Design, Execution

Jobs and Careers in Creative Production

The following positions are a few of the professional titles you'll find throughout creative production. All of them are responsible in some way for overseeing and managing the production efforts of an organization. Don't forget that the creative production umbrella is huge and can include titles across writing, video, design, project management, strategy, and any other positions that help complete the ask.

Here's a list of creative production positions in no particular order: 

  • Creative Production [Head of, Director, Manager, Coordinator, Assistant, Intern]
  • Creative Producer
  • Art Director
  • And the list goes on

10 Guidelines

How to Future-Proof your Creative Production Team

The industry is constantly evolving but by following these 10 simple guidelines, and evolving with the industry, your business can hopefully weather most storms.

Plan ahead: Proper planning can make the process much smoother and more efficient. By anticipating potential challenges and developing contingency plans, you can avoid last-minute crises and keep your company on track.

Set clear goals and objectives: It's important to have a clear understanding of what you want to produce. This will help you stay focused and make better decisions about how to allocate resources and time.

Collaborate and communicate: Let's be honest, collaboration and communication are critical to the success of most creative projects. By working closely with your team and stakeholders, you can ensure that everyone is on the same page (or video) and that everyone's needs are being met.

Embrace technology: This goes back to continuously evolving. There is a wide range of tools and technologies available (or coming soon) to support the creative production process. By using these effectively, you can streamline your workflows and increase your productivity. Don't be afraid to try new things.

Be flexible and adaptable: At the employee level and the company level. It's important to be able to adapt to changing circumstances. By being flexible and open to new ideas, you can stay ahead of the curve and ensure that your efforts take you in the right direction.

Foster a culture of what-ifs: Innovation is critical to the success of any creative production. Sometimes the biggest investment you can make isn't with money, it's with time and contact with your team asking (or getting them to ask) what if we did x, y, z? 

Hire the right people: Speaking of your team. The quality of your output is heavily dependent on the people and leadership you have on your team. By hiring talented and skilled professionals, you can increase your chances of success. Period.

Manage your budget effectively: By managing your costs effectively, you can ensure that you have the resources you need to develop exactly what you need to lead a successful team.

Process improvement: Does this mean video? Web programs? Design? Project management capabilities? Yes. This is an ongoing process and it's important to continually look for ways to improve and optimize the process. This may lean a little more into creative operations, but nonetheless, it's important to keep an eye on it in any creative endeavor.

Industry trends and best practices: Staying up-to-date on trends and best practices is important for any creative production professional. This may involve keeping track of new technologies, techniques, and approaches, and incorporating these into your work as appropriate. Keep an eye out for events to take part in.

Wrapping Up

That's a Wrap

Creative production is a complex and multifaceted field that requires a wide range of different skills and expertise. From developing marketing strategies to creating marketing materials and advertising campaigns, creative professionals play a vital role in helping organizations communicate their message and reach their target audience. Whether working with clients as part of an agency or leading an in-house team, these professionals must be able to create high-quality content that meets the needs of their clients and helps to achieve the desired goals.

Looking at the future of creative production, it feels like things are only going to speed up. More clients, looking for more content, faster than ever before. If you haven't already, start finding ways to do just that, it's time. This is the year of automation, video, and quicker workflows. Faster, high-quality creative production is needed now more than ever. How will you make it happen? 

Thank you for reading and we'd love to hear your thoughts on creative production. Feel free to reach out on LinkedIn ! 

Related articles

Use well-defined roles to lead a studio team to improve performance, 3 key takeaways from episode 4 of the creative operations podcast., 5 smart diversity ideas for model casting in every respect, zappos art director jessica lopez expands our view of inclusive representation on camera., join the community.

Want to get notified when we have fresh, new content ready for you? Become a member of our community - it's free and you can unsubscribe anytime you want.

By signing up, you agree to receive emails from Creative Force and Creative Operations.

Neurocognition of Language/Speech Comprehension and Speech Production

creative speech production meaning

  • 1 Introduction
  • 2.1 Phoneme perception
  • 2.2 Word identification
  • 3.1 Speech errors
  • 3.2 Conceptual planning
  • 3.3 Lexicalisation
  • 3.4 Grammatical planning
  • 3.5 Articulation
  • 5 Further readings
  • 6 References

Introduction [ edit | edit source ]

As language is defined as a system of symbolic representations of meaning, the term does not restrict itself to a particular means of communication but applies to speech as well as to several other forms such as writing or deaf people’s signing language, but also for example logically based computer languages. Nevertheless, the core of our everyday understanding of language is speech. It is the form in which human language evolved and even today only about 200 of the estimated 6000 to 7000 spoken languages also exist in a written form. This chapter deals with the cognitive efforts people take each time they are engaged in a conversation: Speech production and comprehension are two complementary processes that mediate between the meaning of what is said (believed to be represented independent of language) and the acoustic signal that speakers exchange. Both ways of transformation include many steps of processing the different units of speech, like phonologic features, phonemes, syllables, words, phrases and sentences. Putting it simple, those steps are performed in top-down order in speech production and in bottom-up order in speech comprehension. In spite of scientific consensus about the overall structure of speech comprehension and production, there are many competing models. The chapter will present some of these models along with evidence from experimental psychology. The comprehension section of this chapter starts with an account of how sound waves turn into phonemes, the smallest units of meaning, and goes on to address processing on the word level. Sentence comprehension – which follows word processing in the hierarchy of language – has not yet been addressed in this chapter and the Wiki-community is requested to add a section dealing with this issue. In the section treating speech production, the reader is introduced to the planning involved in transforming a message into a sequence of words, the problem of lexicalization (i.e., finding the right word) and the final steps up to the motor generation of the requested sounds.

Speech Comprehension [ edit | edit source ]

There is no doubt that speech comprehension is a highly developed ability in humans because despite the high complexity of the speech signal, it happens almost automatically and notably fast: We can understand speech at the rate of 20 phonemes per second, while in a sequence of non-speech sounds, the order of sounds can only be distinguished if they are presented slower than 1,5 sounds per second (Clark & Clark, 1977). This is a first hint that there must be mechanisms that utilise the additional information speech contains, compared to other sounds, in order to facilitate processing.

Phoneme perception [ edit | edit source ]

The starting point of our way to comprehend an utterance is the acoustic sequence reaching our auditory system. As a first step, we have to separate the speech signal from other auditory input. This can be done because speech is continuous, which background noise normally is not, and because throughout our life our auditory system learns to utilise acoustic properties like frequency to assign sounds to a possible source (Cutler & Clifton, 1999). Next, we have to identify as segments the individual sounds that form the sequential speech signal, so that we can relate them to meaning. This early part of speech comprehension is also referred to as decoding . The smallest unit of meaning is the phoneme , which is a distinguishable single sound that in most cases corresponds to a particular letter of the alphabet. However, letters can represent more than one phoneme like “u” does in “hut” and “put”. Linguists define phonemes by the particular way of articulation, that is, the parts of the articulatory tract and the movements involved in producing them. For example, the English phoneme /k/ is defined as a “velar voiceless stop consonant” (Harley, 2008). Phonemes that share certain articulatory and therefore phonological features appear more similar and are more often confused (like /k/ and /g/ that sound more alike than /k/ and /d/ or /e/).

Decoding Translating the acoustic speech signal into a linguistic representation.

Phoneme The smallest linguistic unit that has an impact on meaning and at the same time the smallest phonological unit perceived as a distinct sound. Often linked to a particular letter of the alphabet.

Categorical perception of phonemes

Although we perceive speech sounds as phonemes, it is wrong to assume that the acoustic and the phonemic level of language are identical; rather, by acquisition of our first language, phonemes shape our perception. Phonemes are categories that comprise sounds with varying acoustic properties and they differ between languages. Maybe the best-known example is that Japanese native speakers initially have difficulties to tell apart the European phonemes /r/ and /l/ which constitute one phoneme in Japanese. Likewise, Hindi speakers distinguish between two sounds which Europeans both perceive as /k/. The learned pattern of distinction is applied strictly, causing categorical perception of phonemes: People usually perceive a sound as one phoneme or another, not as something in between. This effect was demonstrated when artificial syllables that varied acoustically along a continuum between two phonemes were presented to participants. At a certain point there was a “boundary” between the ranges in which either one of the two phonemes was perceived (Liberman, Harris, Hoffman & Griffith, 1957). Nevertheless, it was found that if requested to do so, we are able to detect minimal differences between sounds that belong to the same phoneme category (Pisoni & Tash, 1974). Accordingly it seems that categorical perception is not a necessity of an early level of processing but a habit employed to simplify comprehension. There is still controversy about how categorical perception actually happens. It seems unlikely that perceived sounds are compared with a “perfect exemplar” of the respective phoneme because a large number of “perfect exemplars” of each phoneme would be necessary to apply to speakers of different age, gender, dialect and so on (Harley, 2008).

Coarticulation

Anyway, theories that assume phoneme perception can rely on phonologic properties alone are confronted with two basic problems: The invariance and the segmentation problem. The invariance problem is that the same phoneme can sound differently depending on the syllable context in which it occurs. This is due to co-articulation, which means that while one sound is articulated, the vocal tract is already adapting for the position required for the next sound. It has been argued therefore that syllables are more invariant than phonemes. Maybe this can partly account for the finding that participants take longer to respond to a single phoneme than they take to respond to a syllable (Savin & Bever, 1970), which lead the authors to believe syllables were processed first. Whether this is the case or not, it is without doubt that listeners make use of the information each phoneme contains about the surrounding ones, as experimentally produced mismatches in co-articulatory information (that is, phonemes were pronounced in a way that did not fit to the phoneme that came next) lead to longer reaction times for phoneme recognition (Martin & Brunell, 1981). The invariance problem therefore points to syllables as units that intercede between the meaning bearing role of phonemes and their varying acoustic form.

Segmentation

The segmentation problem refers to the necessary step of dividing the continuous speech signal into its component parts (i.e., phonemes, syllables, and words). Segmentation cannot be done by use of the signal’s physical features alone, because sounds slur together and cannot easily be separated, which is not only the case within words but also across words (Harley, 2008). If we look at the spectrographic display of an acoustic speech signal, we can hardly tell the boundaries between phonemes, syllables, or words (Figure 1). Maybe the best starting point for segmentation we find in the speech signal is its prosody , especially the rhythm. It differs between languages, inducing different strategies of segmentation: French has a very regular rhythm with syllables that do not contract or expand much, which allows for syllable-based segmentation. English, in contrast, strongly distinguishes between stressed and unstressed syllables and those can be expanded or contracted to fit them to the rhythm. Therefore, rhythmic segmentation in English is stress based, yielding units also called phonological words that consist of one stressed syllable and associated unstressed syllables, and that do not necessarily correspond with lexical words (for example, “select us” is one phonological word; Harley, 2008). The importance of stress structure for speech recognition was shown when participants had to detect a certain string within a non-speech signal. Their reaction was faster when the target was within one phonological word than when it crossed the boundaries of phonological words (Cutler & Norris, 1988). In the same task, the use of syllable knowledge could also be demonstrated because strings that corresponded to a normal syllable were detected faster than those that were shorter or longer (Mehler, Dommergues, Frauenfelder & Segui, 1981). It seems to be another aim of segmentation to assign all of the perceived sounds to words. Thus, detecting words embedded into non-words was more difficult (as it happened more slowly) when the remaining sounds did not have a word-like phonological structure (like f egg , as opposed to maff egg ; Norris, McQueen, Cutler & Butterfield, 1997).

Segmentation Dividing the continuous speech signal into its underlying linguistic units like phonemes, syllables and words.

Prosody Tonal and rhythmic features of speech such as stress pattern and intonation. Prosody may contain meaning in spoken language; it is absent in written language.

Phonological word A unit of prosody that contains one stressed syllable. It may consist of one lexical word, but can be longer or shorter.

Lexical word A word as a unit of meaning; corresponds to the everyday understanding of “word”.

Top-down feedback

The processes just described operate on higher-level units, but there is evidence for (and yet strong controversy about) top-down feedback from those units onto phoneme identification. One line of evidence is the so-called lexical identification shift that occurs in research designs that examine categorical perception of phonemes with sounds that vary on a continuum between two phonemes. If these are presented in a word context, participants shift their judgement in favour of the phoneme that will create a meaningful word (for example towards /k/ and not /g/ within the word *iss) (Ganong, 1980). Phoneme restoration can be observed if a phoneme of a word spoken within a sentence is cut out and replaced with a cough or a tone. Participants typically do not report that there was a phoneme missing – they perceive the phoneme expected in the word even if they are told it has been replaced. If the same word with a cough replacing a phoneme is inserted into different sentences, each giving contextual support for a different word (for example “The *eel was on the orange.” vs “The *eel was on the axle”), participants report to have perceived the phoneme needed for the contextually expected word (Warren & Warren, 1970). Note that the words carrying the content information appear after the phoneme to be restored. Therefore it has been questioned whether it is actually phoneme perception or some later stage of processing that is responsible for the restoration effect. It seems that while there is truly perceptual phoneme restoration (as participants cannot distinguish between words with restored and actually heard phonemes), the context effects must be accounted for by a stage of processing after retrieval of word meaning (Samuel, 1981). The dual-code-theory of phoneme identification (Foss and Blank, 1980) states that there are two different sources of information we can use in order to identify individual phonemes: The prelexical code , which is computed from the acoustic information, and the postlexical code that comes from processing of higher-level units like words and creates top-down feedback. Outcomes of different study designs are interpreted as resulting from use of either one of the information sources. The recognition of a certain phoneme in a non-word is as fast as in a word, suggesting use of the prelexical code, while when presented in a sentence, the phoneme is recognized faster if it is part of a word expected from sentence context than if it is part of an unexpected word, which can be seen as a result of postlexical processing. However, evidence for use of the postlexical code is too limited to support the idea that it is a generally applied strategy.

Prelexical code The way speech is encoded before the words are identified. It is based merely on phonological information.

Postlexical code The way speech is encoded after the words are identified. It contains semantic and syntactic information.

Word identification [ edit | edit source ]

The identification of words can be seen as a turning point in speech comprehension, because it is the word level at which the semantic and syntactic information is represented which we need to decipher the meaning of the utterance. In the terminology just introduced in the last paragraph, this word-level information is the postlexical code. Here, the symbolic character of language comes into play: Contrary to the prelexical code, the postlexical code is not derived from the acoustic features of the speech signal but from the listener’s mental representation of words (including meaning, grammatical properties etc). Most models propose a mental dictionary, the lexicon . The point when a phonological string is successfully mapped onto an entry in the lexicon (and the postlexical code becomes available) is called lexical access (Harley, 2008; Cutler & Clifton, 1999). It is controversial how much word identification overlaps with processing on other levels of the speech comprehension hierarchy. Phoneme identification and word recognition could take place at the same time, at least research has shown that phoneme identification does not have to be completed before word recognition can begin (Marslen-Wilson & Warren, 1994). Concerning the role of context for word identification, theories can be located in between two extreme positions: The autonomous position states that context can only interact with the postlexical code, but does not influence word identification as such. Most specifically, there should be no feedback from later levels of processing (that is, phrase or sentence level) to earlier stages. Exactly this structural context, however, is used for word recognition according to the interactive view. Out of many models of word identification, two shall be introduced in some detail here: The cohort model and TRACE.

Lexicon A mental dictionary from which the meaning and syntactic properties of each word is recalled once the word is identified.

The cohort model

The cohort model (original version: Marslen-Wilson & Welsh, 1978; later version: Marslen-Wilson, 1990) proposes three phases of word recognition: The access stage starts once we hear the beginning of a word and draw from our lexicon a number of candidates that could match, the so-called cohort. Therefore the beginning of a word is especially important for understanding. The initial cohort is formed in a bottom-up manner unaffected by context. With more parts of the word being heard, the selection stage follows, in which the activation levels of candidates that no longer fit decay until the one best fitting word is selected. Not only phonological evidence but also syntactic and semantic context is used for this selection process, especially in its late phase. These first two phases of word recognition are prelexical. The recognition point of the word can, but often does not, coincide with its uniqueness point, which is the point at which the initial sequence is unique to one word. If context information can be utilised to dismiss candidates, the recognition point might occur before the uniqueness point, while if there is no helping context information and the acoustic signal is unclear, it might occur after the uniqueness point. With the recognition point, the third and postlexical phase, the integration stage begins, in which the semantic and syntactic properties of the chosen word are utilized, for example to integrate it into the representation of the sentence. Consistent with the model, experimentally produced mistakes are more often ignored by participants who have to repeat a phrase if they appear in the last part of the word and if there is strong contextual information, while they cause confusion if they appear in the beginning and if the context is ambiguous (Marslen-Wilson & Welsh, 1978).

The TRACE model

TRACE (McClelland & Elman, 1986) is a connectionist computer model of speech recognition, which means it consists of many connected processing units. Those units operate on different levels, representing phonological features, phonemes and words. Activation spreads bidirectionally between units, allowing for both bottom-up and top-down processing. Between units of the same processing level, there are inhibitory connections that make those units compete with each other and simulate phenomena such as the categorical perception of phonemes. Also at the word level there is evidence for competition (that is, mutual inhibition) between candidates: Words embedded in non-word strings take longer time to be detected if the non-word part bears similarity with some other existing word than if it does not. This effect was shown to co-occur with and to be independent of the influence of stress-based segmentation discussed earlier (McQueen, Norris & Cutler, 1994). TRACE is good in simulating some features of human speech perception, especially context effects, while in other points it differs from human perception, such as tolerance for mistakes: Words that have been varied in phonemic details (such as “smob” derived from “smog”) are identified as the related words by TRACE while to humans they appear as non-words (Harley, 2008). Other researchers have criticised the supposed high amount of top-down feedback as a version of TRACE that did not contain top-down-feedback simulated speech perception equally well as the original version (Clifton & Cutler, 1999).

Speech Production [ edit | edit source ]

The act of speaking involves steps of processing similar to the act of listening, but those steps are performed in reverse order, from sentence meaning to phonological features. Speaking can also be seen as bringing ideas into a linear form (as a sentence is a one-dimensional sequence of words). According to Levelt (1989), speakers deal with three main issues. The first is conceptualisation, that is determining what to say and selecting relevant information to construct a preverbal message. The next is formulating this preverbal message in a linguistic form, including selection of individual words, syntactic planning and encoding of the words as sounds. The third issue is execution, which means implementing the linguistic representation on the motor articulatory system. Figure 2 gives an overview of Levelt’s model, with its particular features being addressed in the following sections. There is evidence that speech production is an incremental process, which means that planning and articulation take place at the same time and early steps of processing “run ahead” of later ones in the verbal sequence we prepare. For example, if a short sentence with two nouns shall be formed to describe a picture, an auditive distractor delays the onset of speaking if it is semantically related to either of the two nouns or if it is phonologically related to the first, but not to the second noun (Meyer, 1996). This supports the notion that before speaking starts all nouns of the sentence are prepared semantically but that only the first one is already encoded phonetically. It also seems that planning takes place in periodically returning phases, as it was found pauses occur after every five to eight words in normal conversations. Periods of fluent speech alternate with more dysfluent periods, both connected with different patterns of gestures and eye contact. This has been interpreted as ‘cognitive cycles’ that structure speaking (Henderson, Goldman-Eisler & Skarbeck (1966). There is much less literature on speech production than there is on comprehension; most research on speech production has focused on collecting speech errors (and asking the speaker what the intended utterance was) in order to find out how we organise speech production. Experimental studies, utilizing for example picture naming tasks, are a rather new field (Harley, 2008). Therefore there is a paragraph about speech errors before the steps of conceptual and syntactic planning, lexicalization and articulation are treated.

Incremental processing Sequential steps of processing take place at the same time in a fashion that while material that has already undergone step 1 is going through step 2, new material goes through step 1.

Speech errors [ edit | edit source ]

All kinds of linguistic units (i. e., phonological feature, phoneme, syllable, morpheme, word, phrase, sentence) can be subjects to speech errors that happen in everyday life as well as in laboratory tasks. Those errors involve different mechanisms like blend, substitution, exchange, addition or deletion of linguistic units (Harley, 2008). To give a more plastic idea of the classification of speech errors, here are some (self-created) examples:

The finding that speech errors involve particular linguistic units has been interpreted as an argument that those units are not only descriptive categories of linguists but also subject to actual cognitive steps of speech processing (Fromkin, 1971). Research has shown that errors do not happen at random: If people are confronted with materials that make mistakes likely (for example if they are asked to read out quickly texts that include tongue twisters), errors that form lexically correct words happen more often than errors that do not. Errors that form taboo words are less likely than other possible errors. Nevertheless materials that include the possibility to accidentally form a taboo word cause elevated galvanic skin responses, as though speakers monitor those possible errors internally (Motley, Camden & Baars, 1982).

Garrett´s model of speech production

A general model of speech production based on speech error analysis was proposed by Garrett (1975, 1992). His basic assumption is that processing is serial, and different stages of processing do not interact with each other. Phrase planning takes place in two steps: At the functional level , in which the content and main syntactic roles like subject and object are determined, and at the positional level that includes determining the final word order and phonological specification of the words used. Content words (nouns, verbs and adjectives) are selected at the first level, function words (like determiners and prepositions) only at the second level. Phonological specification of content word stems therefore takes place before phonological specification of function words or grammatical forms (like plural or past forms of verbs). According to the theory, word exchanges occur at the first level and are therefore influenced by semantic relations, but much less by the distance between the words in the completed sentence. Contrary, sound exchanges as products of phonological encoding occur at a later stage in which the word order already is determined, which makes them constrained by distance. Also in accordance with the theory, sounds typically exchange across short distances, whereas words can exchange across the whole phrase. Garret’s theory also predicts that elements should only exchange if they are part of the same processing level. This is supported by the robust finding that content words and function words almost never exchange with each other (Harley, 2008). Other speech errors are more difficult to explain with Garret’s model: Word blends, like “quizzle” from “quiz” and “puzzle” make it seem probable that two words have been drawn from the lexicon at the same time, contrary to Garret’s idea that language production is serial and not a parallel process. Even more problematic, word blends and even blends of whole phrases seem to be facilitated by phonological similarity. That is, intruding contents and intended contents merge more often at points where they share phonemes or syllables than would be expected by chance. This should not be the case if planning at the functional level and phonological processing are indeed separated stages with no interaction between them (Harley, 1984).

Conceptual planning [ edit | edit source ]

It has already been mentioned that speaking involves linearising ideas. That is because even if what we want to say involves concepts related to each other in complex ways (for example like a network), we have to address them one by one. This is a main object of conceptual preparation , a step which – according to Levelt (1999) – takes place even before the ideas are transferred into words, thus yielding a preverbal message. Macroplanning is the part of conceptual preparation that can be described as management of the topic. The speaker has to ensure the listener can keep track when he leads his attention from one item to the next. When people go through a set of items in a conversation, they normally pick items directly related to the previous one; if this is not possible they go back to a central item that they can relate to the next one or they start with a simple item and advance to more difficult ones. Ideas that we express in sentences normally include relations between referents . To get those relations in the linear form required in speech, we have to assign the referents to grammatical roles like subject and object that are in most languages related to certain positions in the sentence. This is called microplanning . Often it is possible to express the same relation through various syntactic constructions, resembling different angles of view and we have to choose one before we can begin to speak. For example, if a cat and a dog sit next to each other we could say “The cat sits right to the dog” as well as “The dog sits left to the cat” (Levelt, 1999). It has been proposed that the overall structure of a sentence (like active versus passive or adverbial at the beginning or at the end) is determined somewhat independent of the content, maybe with the help of a “syntactic module”. Evidence comes from syntactic priming which occurs for example when participants have to describe a picture after reading an unrelated sentence. They choose a syntactic structure resembling the previously read sentence more often than expected by chance. Other aspects like choice of words and their grammatical forms do not interact with this priming (Bock, 1986).

Conceptual preparation Transferring relations between concepts into a sequence of syntactic relations.

Referent The person, object or concept a word refers to.

Lexicalisation [ edit | edit source ]

The concepts selected during conceptual planning have to be turned into words with defined grammatical and phonological features so we can construct the sentence that we finally encode phonologically for articulation. This “word selection” is called lexicalisation and Levelt (1999) postulates it is a two-step process: First a semantically and syntactically specified representation of the word, the so-called lemma is drawn, which does not contain phonological information, then the lemma is linked to its phonological form, the lexeme . The tip-of-the-tongue state could be an everyday life example for successful lemma selection but a disrupted phonological processing: The phonological form of a word can not be found, even though its meaning and even grammatical or phonological details are known to the speaker. The model of separate semantic and phonologic processing in lexicalisation is supported by evidence from picture-naming tasks with distractors: The time window in which auditory stimuli that were phonologically related to the item had to be presented to slow down naming differed from the time window in which semantically related stimuli interfered with naming (note that both could, presented in other time windows, also speed processing). According to these findings it takes roughly 150 ms to process the picture and activate a concept, about 125 ms to select the lemma and another 250 ms for phonological processing (Levelt et al., 1991). Other researchers argue that there is overlap between these phases, allowing for processing in cascade : Information from semantic processing can be utilised for phonological processing even before lemma selection is completed. Peterson and Savoy (1998) found mediated priming in picture naming tasks, which means that presentation of a phonological relative of a semantic relative of the target word (like “soda” related to the target “couch” via “sofa”) at a certain moment in time, facilitated processing. Another finding in favour of processing in cascade is that word substitution errors in which the inserted word is related both semantically and phonologically to the intended one (like catalogue to calendar) occur above chance level (Harley, 2008). Controversy goes even further, questioning the existence of lemmas. As an alternative model Caramazza (1997) proposes a lexical-semantic, a syntactic and a phonological network between which information is exchanged during lexicalisation.

Lexicalisation Word selection in speech production.

Lemma A representation of the meaning and syntactic properties of a word that does not contain its phonological features.

Lexeme The representation of a word’s phonological form.

Grammatical planning [ edit | edit source ]

For each word, grammatical features become accessible by lemma selection (or activation of relevant elements in the syntactic network according to Caramazza’s model), constraining opportunities of integrating it into the sentence. Each word can be conceptualised as a knot in a syntactic network, and to complete the sentence structure, a pathway that connects all those knots has to be found. Idiomatic terms are a special case as they are connected to very strong constraints. Therefore it is presumed that they are stored as separate entries (in addition to the entries for the single words they are constituted of) in our mental lexicon (Levelt, 1999). In many languages, also the morphological form of a word has to be defined to integrate it into the sentence, taking into account its syntactic relations and additional information the word contains (like tense and number). Morphological transformations can be implemented by addition of affixes to the word stem (like ‘speculat ed ’ or ‘plant s ’) or by changes of the word stem (like ‘swim-swam’ or ‘mouse-mice’). The amount and complexity of morphological transformations in English is moderate compared to languages like German, Russian or Arabian, while in other languages like Chinese there are no morphological transformations at all.

Morphology The capability of words to adapt different grammatical forms with distinct phonological forms.

Articulation [ edit | edit source ]

When the phonological information about the words in their appropriate morphological form is available and word order has been determined, articulation can begin. Keep in mind that these processes are incremental so the sentence need not be prepared as a whole before its beginning is articulated. The issue is to produce the required speech sounds in the right order with the right prosody. There are different models of how this is achieved. Scan-copier models (Shattuck-Hufnagel, 1979) constitute a classic approach, proposing that a framework of syllable structure and stress pattern is prepared. Phonemes are inserted into this framework by a ‘copier’ module and the progress is instantly checked. Speech errors like phoneme exchange, phoneme deletion or perseveration can be explained by failures in certain points of the copy- and checking processes. According to the competitive queuing model (Hartley & Houghton, 1996), which adopts the idea of a framework and a copier, the phonemes that are to be inserted form a queue, and the order of insertion is controlled by activating and inhibiting connections between them and particular units that mark the beginning and the end of a word. Thus, the phoneme with the strongest connection to the beginning unit will be inserted in the first position.

The role of syllables for articulation

WEAVER++ (Levelt, 2001) is a two-step model that assumes that by lexeme identification, a sequence of phonemes representing the whole word is drawn simultaneously. This is supported by the finding that in naming tasks, auditorily presented distractors that prime parts of the target word speed naming, no matter which position the primed part has in the target word (Meyer & Schriefers, 1991). As the next step, syllables, which are not a part of the lexicon representation, are formed sequentially. Because of co-articulation, syllables are required as input for the articulatory process. The formation of syllables is believed to be facilitated by a reservoir of frequent syllables, the syllabary . Even in languages like English with a high number of different syllables (more than 12.000) a much smaller number accounts for most syllables in a given utterance. Those syllables (and in languages with only a few hundred different syllables, like Chinese or Japanese, maybe all syllables) form highly automatised motor sequences that could (according to Rizzolatti & Gentilucci, 1988) be stored in the supplementary motor area. A finding in favour of the existence of a syllabary is that pseudo-words (constructed from normal Dutch syllables) with high-frequency syllables were processed faster than pseudo-words with low-frequency syllables in an associative learning task (Cholin, Levelt & Schiller, 2006). Syllable forming can also depend on prosody. In stress-assigning languages like English phonological words are formed by associating unstressed syllables with neighbouring stressed syllables. These phonological words seem to be prepared before speaking begins, as for sentences including more phonological words the onset of speaking takes longer. In articulation, syllables bind together only within, but not across phonological words. For example, in the sentence “Give me a beer, if the beer is cold” the ‘r’ of beer is bound to the following ‘i’ only in the second part of the sentence (“bee-ris cold”), as the comma marks a boundary between phonological words (Harley, 2008). The example also shows that syllables are not determined by lexical words, as phonemes can change from the syllable they are part of when the lexical word stands alone to a syllable belonging to another lexical word.

Syllabary A “dictionary” of frequent syllables that is used for syllable preparation in speech production.

Acoustic speech parameters

During articulation, we not do only manipulate the phonematic properties of the sounds we produce, but also parameters like volume, pitch, and speed. Those parameters depend on the overall prosody of the utterance, respective the position of a given syllable within it. While prosody can be regulated directly in order to transport meaning independent of the words used (consider that different stress can make the same sentence sound like a statement or like a question), some acoustic parameters can give hints about the speakers emotional state: Key is the variation of pitch within a phrase and is influenced by the relevance of the phrase to the speaker and his emotional involvement. Register is the basic pitch and is influenced by the speakers current self-esteem (use of the lower chest register indicating higher self esteem than use of the head register), (Levelt, 1999).

Monitoring of speech production

According to the standard model of speech production (Levelt, 1999), monitoring takes place throughout all phases of speech production. Levelt assumes that for monitoring of syntactic arrangement we utilise the same ‘parsing’ mechanisms we employ to analyse the syntax of a heard sentence. Although speech production and speech comprehension involve different brain areas (there is activation in temporal auditory regions during listening and activation in motor areas during speaking; see the chapter about the biological basis of language), monitoring of one’s own speech also seems to involve temporal areas included in listening to other persons. Therefore a ‘perceptual loop’ for phonetic monitoring has been proposed (Levelt, 1999), although it is not clear yet whether this loop processes the auditory signal we produce or some earlier phonetic representation, a kind of ‘inner’ speech.

Summary [ edit | edit source ]

Speech comprehension starts with the identification of the speech signal against an auditory background and its transformation to an abstract representation, also called decoding. Speech sounds are perceived as phonemes, which form the smallest unit of meaning. Phoneme perception is not only influenced by the acoustic features, but also by the word and sentence context. To analyse its meaning, it is necessary to segment the continuous speech signal. This is done with the help of the rhythmic pattern of speech. In the following processing step of word identification, the prelexical code that only contains phonetic information of a word is complemented with the postlexical code, that is, the semantic and syntactic properties of the word. It is proposed that a mental dictionary exists, the lexicon, from which candidates for the heard word are singled out. With the integration of the postlexical code of the single words, the meaning of the sentence can be deciphered. The endpoint of speech comprehension – the conceptual message – is the starting point of speech production. Ideas have to be organised in a linear form, as speech is a one-dimensional sequence, and have to be expressed in syntactic relations. Words for the selected concepts have be chosen, a process called lexicalisation, which is the reverse of word identification because here the semantic and syntactic representation of the word (the lemma) is selected first and has to be linked to the phonemic representation (the lexeme). The syntactic properties of the single words can be seen as constraints for their integration into the sentence, so a syntactic structure has to be constructed that meets all constraints. Also the morphological forms of the words have to be specified before the sentence can be encoded phonologically for articulation. To plan articulation, syllables are constructed from the lexical words in tune with the phonological words that result from the sentence’s stress pattern. Generally, speech production is an incremental process, which means that articulation and different stages of preparation for the following phrases take place simultaneously.

Further readings [ edit | edit source ]

Cutler, A. & Clifton, C. (1999). Comprehending spoken language: a blueprint of the listener. In: C. M. Brown & P. Hagoort (1999). The Neurocognition of Language. Oxford: Oxford University Press.

Levelt, W. J. M. (1999). Producing spoken language: a blueprint of the speaker. In: C. M. Brown & P. Hagoort (1999). The Neurocognition of Language. Oxford: Oxford University Press.

Fromkin, V. A. (1971) The non-anomalous nature of anomalous utterances. Language, 51, 696-719

References [ edit | edit source ]

Bock, J. K. (1986). Syntactic persistence in language production. Cognitive Psychology, 18, 355-387.

Caramazza, A. (1997). How many levels of processing are there in lexical access? Cognitive Neuropsychology, 14, 177-208.

Cholin, J., Levelt, W. J. M. & Schiller, N. O. (2006). Effects of syllable frequency in speech production. Cognition, 99, 205-235.

Clark, H. H. & Clark, E. V. (1977). Psychology and language: An introduction to psycholinguistics. New York: Harcourt Brace Jovanovich.

Cutler, A. & Norris, D. G. (1988). The role of strong syllables in segmentation for lexical access. Journal of Experimental Psychology: Human Perception and Performance, 14, 113-121.

Foss, D. J. & Blank, M. A. (1980). Identifying the speech codes. Cognitive Psychology, 12, 1-31.

Fromkin, V. A. (1971) The non-anomalous nature of anomalous utterances. Language, 51, 696-719.

Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6, 110-125.

Garrett, M. F. (1975). The analysis of sentence production. In: G. Bower. The psychology of learning and motivation (Vol. 9, pp. 133-177). New York: Academic Press.

Garrett, M. F. (1992). Disorders of lexical selection. Cognition, 42, 143-180.

Harley, T. A. (1984). A critique of top-down independent levels models of speech production: Evidence from non-plan-internal speech production. Cognitive Science, 8, 191-219.

Harley, T. A. (2008). The Psychology of Language: From Data to Theory. Third Edition. Hove: Psychology Press.

Hartley, T. & Houghton, G. (1996). A linguistically constrained model of short-term memory for non-words. Journal of Memory and Language, 35, 1-31.

Henderson, A., Goldman-Eisler, F. & Skarbeck, A. (1966). Sequential temporal patterns in speech. Language and speech, 8, 236-242.

Libermann, A. M., Harris, K. S., Hoffmann, H. S. & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 53, 358-368.

Levelt, W. J. M. (1989). Speaking: From intention to articulation. Cambridge, MA: MIT Press.

Levelt, W. J. M. (2001). Spoken word production: A theory of lexical access. Proceedings of the National Academy of Sciences, 98, 13464-13471.

Levelt, W. J. M., Schriefers, H., Vorberg, D., Meyer, A. S., Pechmann, T. & Havinga, J. (1991). The time course of lexical access in speech production: A study of picture naming. Psychological Review, 98, 122-142.

Marslen-Wilson, W. D. (1990). Activation, competition, and frequency in lexical access. In: G. T. M. Altmann (1990). Cognitive models of speech processing. Cambridge, MA: MIT Press.

Marslen-Wilson, W. D. & Warren, P. (1994). Levels of perceptual representation and process in lexical access: Words, phonemes and features. Psychological Review, 101, 653-675.

Marslen-Wilson, W. D. & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continous speech. Cognitive Psychology, 10, 29-63.

Martin, J. G. & Brunell, H. T. (1982). Perception of anticipatory coarticulation effects. Journal of the Acoustical Society of America, 69, 559-567.

McClelland, J. L. & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86.

McQueen, J. M., Norris, D. G. & Cutler, A. (1994). Competition in spoken word recognition: Spotting words in other words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 621-638.

Mehler, J., Dommergues, J.-Y., Frauenfelder, U. H. & Segui, J. (1981). The syllable’s role in speech segmentation. Journal of Verbal Learning and verbal Behavior, 20, 298-305.

Meyer, A. S. (1996). Lexical access in phrase and sentence production: Results from picture-word interference experiments. Journal of Memory and Language, 35, 477-496.

Meyer, A. S. & Schriefers, H. (1991). Phonological facilitation in picture-word interference experiments: Effects on stimulus onset asynchrony and types of interfering stimuli. Journal of Experimental Psychology: learning, Memory, and Cognition, 17, 1146-1160.

Motley, M. T., Camden, C. T. & Baars, B. J. (1982). Covert formulation and editing of anomalies I speech production: Evidence from experimentally elicited slips of the tongue. Journal of Verbal Learning and Verbal Behaviour, 21, 578-594.

Norris, D. G., McQueen, J. M., Cutler, A. & Butterfield, S. (1997). The possible-word constraint in the segmentation of continuous speech. Cognitive psychology, 34, 191-243.

Peterson, R. R. & Savoy, P. (1998). Lexical selection and phonological encoding during language production: Evidence for cascaded processing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 539-557.

Pisoni, D. B. & Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Perception and Psychophysics, 15, 285-290.

Rizzolatti, G. & Gentilucci, M. (1988). Motor and visual-motor functions of the premotor cortex. In: P. Rakic & W. Singer. Neurobiology of neocortex. Chichester: Wiley.

Samuel, A. G. (1981). Phonemic restoration: Insights from a new methodology. Journal of Experimental Psychology: General, 110, 474-494.

Savin, H. B. & Bever, T. G. (1970). The non-perceptual reality of the phoneme. Journal of Verbal Learning and Verbal Behavior, 9, 295-302.

Shattuck-Hufnagel, S. (1979). Speech errors as evidence for a serial ordering mechanism in speech production. In: W. E. Cooper & E. C. T. Walker. Sentence processing: Psycholinguistic studies presented to Merrill Garrett (pp. 295-342). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Warren, R. M. & Warren, R. P. (1970). Auditory illusions and confusions. Scientific American, 223, 30-36.

creative speech production meaning

  • Book:Neurocognition of Language

Navigation menu

.css-s5s6ko{margin-right:42px;color:#F5F4F3;}@media (max-width: 1120px){.css-s5s6ko{margin-right:12px;}} AI that works. Coming June 5, Asana redefines work management—again. .css-1ixh9fn{display:inline-block;}@media (max-width: 480px){.css-1ixh9fn{display:block;margin-top:12px;}} .css-1uaoevr-heading-6{font-size:14px;line-height:24px;font-weight:500;-webkit-text-decoration:underline;text-decoration:underline;color:#F5F4F3;}.css-1uaoevr-heading-6:hover{color:#F5F4F3;} .css-ora5nu-heading-6{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;-webkit-box-pack:start;-ms-flex-pack:start;-webkit-justify-content:flex-start;justify-content:flex-start;color:#0D0E10;-webkit-transition:all 0.3s;transition:all 0.3s;position:relative;font-size:16px;line-height:28px;padding:0;font-size:14px;line-height:24px;font-weight:500;-webkit-text-decoration:underline;text-decoration:underline;color:#F5F4F3;}.css-ora5nu-heading-6:hover{border-bottom:0;color:#CD4848;}.css-ora5nu-heading-6:hover path{fill:#CD4848;}.css-ora5nu-heading-6:hover div{border-color:#CD4848;}.css-ora5nu-heading-6:hover div:before{border-left-color:#CD4848;}.css-ora5nu-heading-6:active{border-bottom:0;background-color:#EBE8E8;color:#0D0E10;}.css-ora5nu-heading-6:active path{fill:#0D0E10;}.css-ora5nu-heading-6:active div{border-color:#0D0E10;}.css-ora5nu-heading-6:active div:before{border-left-color:#0D0E10;}.css-ora5nu-heading-6:hover{color:#F5F4F3;} Get early access .css-1k6cidy{width:11px;height:11px;margin-left:8px;}.css-1k6cidy path{fill:currentColor;}

  • Product overview
  • All features
  • App integrations

CAPABILITIES

  • project icon Project management
  • Project views
  • Custom fields
  • Status updates
  • goal icon Goals and reporting
  • Reporting dashboards
  • workflow icon Workflows and automation
  • portfolio icon Resource management
  • Time tracking
  • my-task icon Admin and security
  • Admin console
  • asana-intelligence icon Asana Intelligence
  • list icon Personal
  • premium icon Starter
  • briefcase icon Advanced
  • Goal management
  • Organizational planning
  • Campaign management
  • Creative production
  • Marketing strategic planning
  • Request tracking
  • Resource planning
  • Project intake
  • View all uses arrow-right icon
  • Project plans
  • Team goals & objectives
  • Team continuity
  • Meeting agenda
  • View all templates arrow-right icon
  • Work management resources Discover best practices, watch webinars, get insights
  • What's new Learn about the latest and greatest from Asana
  • Customer stories See how the world's best organizations drive work innovation with Asana
  • Help Center Get lots of tips, tricks, and advice to get the most from Asana
  • Asana Academy Sign up for interactive courses and webinars to learn Asana
  • Developers Learn more about building apps on the Asana platform
  • Community programs Connect with and learn from Asana customers around the world
  • Events Find out about upcoming events near you
  • Partners Learn more about our partner programs
  • Support Need help? Contact the Asana support team
  • Asana for nonprofits Get more information on our nonprofit discount program, and apply.

Featured Reads

creative speech production meaning

  • Project management |
  • What is creative production? The 6-step ...

What is creative production? The 6-step process

Whitney Vige headshot

Creative production is the process of transforming creative concepts into tangible outputs. Learn how to bring creative visions to life in six steps.

Cutting through the market noise with high-quality creative assets is essential to setting your brand apart. Whether it’s captivating video content, compelling graphics, or innovative digital campaigns, every creative asset is a chance to capture your audience’s attention—and keep it. 

That’s where the creative production process comes in. From initial brainstorming to final asset delivery, creative production is the framework that guides your team through the complexities of bringing creative visions to life. In this guide, we’ll walk you through the six steps of the creative production process—and show you how to master them.

What is creative production?

Creative production is the process of developing creative concepts into tangible outputs, like advertisements, digital content, or brand materials. Through creative production, companies can transform ideas into compelling visual and storytelling assets that engage and resonate with their audience. 

Why is creative production important?

Creative production plays a pivotal role in shaping a brand’s identity and market presence. By developing and launching consistent, engaging creative assets, companies can enhance brand visibility, stand out from competitors, and establish a deeper connection with their desired audience.

The ultimate guide to creative production with Asana

Download our free ebook to learn how to use Asana to streamline creative workflows, optimize teamwork, and produce high-impact creative assets.

What is the creative production process?

The creative production process begins with an idea and concludes when that idea has been brought to life. Let’s dive deeper into each phase of the creative production process. 

1. Ideation and conceptualization

Conceptualization takes these ideas a step further by refining them into workable concepts. Following ideation, the team will assess the feasibility of each idea and its alignment with the project’s objectives. In the conceptualization phase, your team will shift their focus from open-ended creativity to structured planning. At the end of the ideation and conceptualization phase, you should have a firm, fleshed-out creative concept. 

2. Planning and resource allocation

Once your team has a firm understanding of their creative concept, it’s time to create the workflow that will bring that concept to life. This phase—planning and resource allocation —is where the project transitions from concept to actionable strategy. 

Typically, the planning phase involves detailed project mapping to determine the project’s scope , budget , timeline, milestones, and required resources. During this stage of the process, project managers and coordinators will work with the creative team—and leverage technology—to outline tasks, assign responsibilities, and set deadlines that take into account available resources and project objectives. At the end of the planning and allocation phase, the project’s action steps and deliverables should be clearly defined and scheduled. 

3. Production and development

Once you’ve nailed down your creative concept and defined your project schedule, the next step is to begin production and development. This stage is where the conceptualized ideas begin to materialize through practical execution. 

Typically, production and development involves a range of activities such as graphic design, content creation, video production, and technical development. Depending on the nature of your project, this phase may include collaborating with in-house creatives, such as writers, designers, and developers, as well as with outsourced industry professionals, such as directors, actors, filmmakers, and musicians. After the production and development phase, your ideas should have transformed into a product that meets your initial goals and appeals to your target audience.

Step 4. Post-production testing and refinement 

After the production phase, the project moves into post-production, where the focus is on testing and refinement. This stage involves reviewing created assets to ensure they align with the initial concept and effectively convey your intended message. 

Depending on the creative assets produced, this stage may include activities like editing, color correction, visual effect creation, sound mixing, and other technical adjustments. If the project involves digital or multimedia content, this stage might also include A/B or user experience testing. Internal production teams should also review the product at this stage, to ensure that the final assets meet the project’s objectives and the brand’s quality standards. At the end of the post-production phase, your creative assets should be in their finalized form.

Step 5: Launch and distribution

The final phase of the creative production process is the launch and distribution of the finished asset. In this stage, you’ll put into action the launch plan originally outlined in the project planning phase. 

Depending on the nature of the project and the assets created, this could involve a coordinated release across platforms such as social media, digital advertising, email marketing, or even physical distribution channels. No matter what you’re launching, effective distribution requires a well-planned strategy to ensure that the asset reaches its target audience through the right channels. 

Step 6: Post-launch tracking and post-mortem 

Following the launch, it’s important to use reporting and feedback tools to determine and analyze outcomes. Using techniques like monitoring the reception of the launch through social media, tracking website traffic, and analyzing customer feedback surveys can help you understand the impact and reach of the project—and walk away with lessons for future improvements.

Creative production best practices

Follow these best practices to ensure your creative production process is efficient—and effective.

Focus on continuous learning: Cultivate an environment of ongoing education and skill development to keep your creative team flexible and innovative. 

Leverage technology: Utilize the latest tools and platforms to streamline the creative process, enhance creativity, and drive efficiency. 

Emphasize collaboration: Foster a collaborative environment that encourages idea sharing and cross-functional teamwork. 

Stay ahead of trends: Keep your finger on the pulse of industry trends to keep your work relevant, engaging, and impactful. 

Be adaptable: Be flexible and open to change when it comes to evolving project needs or market demands.

Centralize communication: Use a centralized communication hub to ensure clear and consistent messaging across all team members and stakeholders. 

Track holistic project progress: Monitor the overall progress of creative projects through integrated tools to ensure alignment with timelines and objectives.

Standardize requests and production: Establish standardized procedures for project requests and production processes to streamline operations and maintain consistency. 

The role of technology in the creative production process

Technology plays a pivotal role in helping your team design, produce, and launch innovative work that’s aligned with your organizational goals. Benefits of leveraging tech for your creative production process include:

Centralized brainstorming: See all your collective ideas in one place, so teams can easily ideate and collaborate together. 

Standardized and automated workflows : Leverage automation to scale reliable, error-proof production processes that allow teams to kick off work instantly. 

Faster review cycles: Reduce context switching with integrations that allow you to design, develop, review, and provide feedback in one place. 

Real-time updates: Easily track your team’s progress and the status of the project , so you can proactively identify blockers and pivot if needed.  

Streamline the creative production process with Asana 

Don’t let creative production overwhelm you. With a work management platform like Asana, you can standardize your creative workflows, centralize project information, and streamline review and approval cycles, ensuring consistency, efficiency, and alignment for every deliverable—from the initial idea to the final asset. 

Creative production FAQs

Still have questions about creative production? We have answers.

What’s the difference between internal and external creative production?

Internal creative production involves leveraging your in-house creative team, such as your graphic designers, web developers, and copywriters, to produce creative assets. External creative production involves outsourcing your creative needs to contractors. This might include professionals whose skills you don’t have on staff—such as voice actors or musicians—or creatives with similar skills to your employees. 

There are pros and cons to both avenues; internal creative production provides greater creative control and ensures brand knowledge, while external creative production is cost-effective and can help with resourcing constraints. 

What are the different types of creative production capabilities?

There are numerous types of creative capabilities you can use to craft compelling assets for your brand, including:

Photography

Graphic design

Film production

Print assets

Marketing materials

Social media and digital media content

SEO, blog, and web content

Audio and podcast development 

What are the common challenges in creative production?

Common challenges companies face when engaging in creative production include: 

Resourcing constraints

Collaboration and communication issues

Ensuring quality control and consistency 

Timeline and deadline challenges

Scaling creative processes and workflows 

Adapting to technological and industry changes

Managing stakeholder expectations

Want to learn more about overcoming these challenges? Download our creative production ebook or see how three world-class teams leverage technology to scale their production.

How can you scale creative production?

The larger your company, the harder it is to scale the creative production process —but it’s not impossible. Follow these best practices for large-scale creative production:

Standardize your creative production with templatized workflows .

See everything at a glance by centralizing your project tasks and communication into a single system of record. 

Take advantage of standardized request forms to streamline creative requests.

Give your whole team visibility into the project , ensuring accountability and fostering an environment of transparency.

Be thoughtful with your resource management approach to effectively allocate resources such as personnel, budget, and tools.

What is a creative production template?

A creative production template is a pre-created framework that can help you standardize creative workflows and track creative production at scale. By kicking off your creative production process with a template, you can eliminate inconsistencies in project procedures and keep everyone on the same page. 

Related resources

creative speech production meaning

How to choose project management software for your team

creative speech production meaning

7 steps to complete a social media audit (with template)

creative speech production meaning

3 visual project management layouts (and how to use them)

creative speech production meaning

Grant management: A nonprofit’s guide

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

9.3 Speech Production Models

The dell model.

Speech error analysis has been used as the basis for the model developed by Dell (1986, 1988). Dell’s spreading activation model (as seen in Figure 9.3) has features that are informed by the nature of speech errors that respect syllable position constraints. This is based on the observation that when segmental speech errors occur, they usually involve exchanges between onsets, peaks or codas but rarely between different syllable positions. Dell (1986) states that word-forms are represented in a lexical network composed on nodes that represent morphemes, segments and features. These nodes are connected by weighted bidirectional vertices.

A depiction of Dell’s spreading activation model, composed on nodes illustrating the morphemes, segments, and features in a lexical network.

As seen in Figure 9.3, when the morpheme node is activated, it spreads through the lexical network with each node transmitting a proportion of its activation to its direct neighbour(s). The morpheme is mapped onto its associated segments with the highest level of activation. The selected segments are encoded for particular syllable positions which can then be slotted into a syllable frame. This means that the /p/ phoneme that is encoded for syllable onset is stored separately from the /p/ phonemes encoded for syllable coda position. This also accounts for the phonetic level in that instead of having two separate levels for segments (phonological and phonetic levels), there is only one segmental level. In this level, the onset /p/ is stored with its characteristic aspiration as [ph] and the coda /p/ is stored in its unaspirated form [p]. Although this means that segments need to be stored twice for onset and coda positions, it simplified the syllabification process as the segments automatically slot into their respective position. Dell’s model ensures the preservation of syllable constraints in that onset phonemes can only fit into onset syllable slots in the syllable template (the same being true for peaks and codas). The model also has an implicit competition between phonemes that belong to the same syllable position and this explains tongue-twisters such as the following:

  • “She sells sea shells by the seashore” ʃiː sɛlz siːʃɛlz baɪ ðiː siːʃɔː
  • “Betty Botter bought a bit of butter” bɛtiː bɒtə bɔːt ə bɪt ɒv bʌtə

In these examples, speakers are assumed to make errors because of competition between segments that share the same syllable position. As seen in Figure 9.3, Dell (1988) proposes a word-shape header node that contains the CV specifications for the word-form. This node activates the segment nodes one after the other. This is supported by the serial effects seen in implicit priming studies (Meyer, 1990; 1991) as well as some findings on the influence of phonological similarity on semantic substitution errors (Dell & Reich, 1981). For example, the model assumes that semantic errors (errors based on shared meaning) arise in lemma nodes. The word cat shares more segments with a target such as mat ((/æ/nu and /t/cd) than with sap (only /æ/nu). Therefore, the lemma node of mat will have a higher activation level than the one for sap creating the opportunity for a substitution error. In addition, the feedback from morpheme nodes leads to a bias towards producing words rather then nonword error. The model also takes into account the effect of speech rate on error probability (Dell, 1986) and the frequency distribution of anticipation-, perseveration- and transposition- errors (Nooteboom, 1969). The model accounts for differences between various error types by having an in-built bias for anticipation. Activation spreads through time. Therefore, upcoming words receive activation (at a lower level than the current target). Speech rate also has an influence on errors because higher speech rates may lead to nodes not having enough time to reach a specified level of activation (leading to more errors).

While the Dell model has a lot of support for it’s architecture, there have been criticisms. The main evidence used for the model, speech errors, have themselves been questioned as a useful piece of evidence for informing speech production models (Cutler, 1981). For instance, the listener might misinterpret the units involved in the error and may have a bias towards locating errors at the beginning of words (accounting for the large number of word-onset errors). Evidence for the CV header node is limited as segment insertions usually create clusters when the target word also had a cluster and CV similarities are not found for peaks.

The model also has an issue with storage and retrieval as segments need to be stored for each syllable position. For example, the /l/ in English needs to be stored as [l] for syllable onset, [ɫ] for coda and [ḷ] when it appears as a syllabic consonant in the peak (as in bottle ). However, while this may seem redundant and inefficient, recent calculations of storage costs based on information theory by Ramoo and Olson (2021) suggest that the Dell model may actually be more storage efficient than previously thought. They suggest that one of the main inefficiencies of the model are during syllabification across word and morpheme boundaries. During the production of connected speech or polymorphic words, segments from one morpheme or word will move to another (Chomsky & Halle, 1968; Selkirk, 1984; Levelt, 1989). For example, when we say “walk away” /wɔk.ə.weɪ/, we produce [wɔ.kə.weɪ] where the /k/ moves from coda to onset in the next syllable. As the Dell model codes segments for syllable position, it may not be possible for such segments to move from coda to onset position during resyllabification . These and other limitations have led researchers such as Levelt (1989) and his colleagues (Meyer, 1992; Roelofs, 2000) to propose a new model based on reaction time experiments.

The Levelt, Roelofs, and Meyer (LRM) Model

The Levelt, Roelofs, and Meyer or LRM model is one of the most popular models for speech production in psycholinguistics. It is also one of the most comprehensive in that it takes into account all stages from conceptualization to articulation (Levelt et al., 1999). The model is based on reaction time data from naming experiments and is a top-down model where information flows from more abstract levels to more concrete stages. The Word-form Encoding by Activation and VERification (WEAVER) is the computational implementation of the LRM model developed by Roelof (1992, 1996, 1997a, 1997b, 1998, 1999). It is a spreading activation model inspired by Dell’s (1986) ideas about word-form encoding. It accounts for the syllable frequency effect and ambiguous syllable priming data (although the computational implementation has been more successful in illustrating syllable frequency effects rather than priming effects).

An illustration of the Levelt, Roelofs, and Meyer model. Illustrates the lexical level, the lemma level, and the lexeme level within the upper, “lexicon” portion of the diagram, with the syllabary and articulatory buffer contained below under “post-lexical”.

As we can see in Figure 9.4, the lemma node is connected to segment nodes. These vertices are specified for serial position and the segments are not coded for syllable position. Indeed, the only syllabic information that is stored in this model are syllable templates that indicate the stress patterns of each word (which syllable in the word is stressed and which is not). These syllabic templates are used during speech production to syllabify the segments using the principle of onset-maximization (all segments that can legally go into a syllable onset in a language are put into the onset and the leftover segments go into the coda). This kind of syllabification during production accounts for resyllabification (which is a problem for the Dell model). The model also has a mental syllabary which is hypothesized to contain the articulatory programs that are used to plan articulation.

The model is interesting in that syllabification is only relevant at the time of production. Phonemes are defined within the lexicon with regard to their serial position in the word or lemma. This allows for resyllabification across morpheme and word boundaries without any difficulties.  Roelofs and Meyer (1998) investigated whether syllable structures are stored in the mental frame. They employed an implicit priming paradigm where participants produced one word out of a set of words in rapid succession. The words were either homogenous (all words had the same word onsets) or heterogeneous. They found that priming depended on the targets having the same number of syllable and stress patterns but not the same syllable structure. This led them to conclude that syllable structure was not a stored component of speech production but computed during speech (Choline et al., 2004). Costa and Sebastian-Galles (1998) employed a picture-word interference paradigm to investigate this further. They asked participants to name a picture while a word was presented after 150 ms. They found that participants were faster to name a picture when they shared the same syllable structure with the word. These results challenge the view that syllable structure is absent as an abstract encoding within the lexicon. A new model has challenged the LRM model’s assumptions on this with a Lexicon with Syllable Structure (LEWISS) model.

The Lexicon with Syllable Structure (LEWISS) Model

Proposed by Romani et al. (2011), the Lexicon with Syllable Structure (LEWISS) model explores the possibility of stored syllable structure in phonological encoding. As seen in Figure 9.5 the organisation of segments in this model is based on a syllable structure framework (similar to proposals by Selkirk, 1982; Cairns & Feinstein, 1982). However, unlike the Dell model, the segments are not coded for syllable position. The syllable structural hierarchy is composed of syllable constituent nodes (onset, peak and coda) with the vertices having different weights based on their relative positions. This means that the peak (the most important part of a syllable) has a very strongly weighted vertex compared to onsets and codas. Within onsets and codas, the core positions are more strongly weighted compared to satellite position. This is based on the fact that there are positional variations in speech errors. For example, onsets and codas are more vulnerable to errors compared to vowels or peaks. Within onsets and codas, the satellite positions are more vulnerable compared to core positions. For example, in a word like print , the /r/ and /n/ in onset and coda satellite positions are more likely to be the subjects of errors than the /p/ and /t/ which are core positions. The main evidence for the LEWISS model comes from the speech errors of aphasic patients (Romani et al., 2011). It was observed that not only did they produce errors that weighted syllable positions differently, they also preserved the syllable structure of their targets even when making speech errors.

A diagram of the Lexicon with Syllable Structure model, which illustrates how the organization of segments can be based on syllable structure.

In terms of syllabification, the LEWISS model syllabifies at morpheme and word edges instead of having to syllabify the entire utterance each time it is produced. The evidence from speech errors supports the idea of having syllable position constraints. While Romani et al. (2011) have presented data from Italian, speech error analysis in Spanish also supports this view (Garcia-Albea et al., 1989). The evidence from Spanish is also interesting in that the errors are mostly word-medial rather than word-initial as is the case for English (Shattuck-Hufnagel, 1987, 1992). Stemberger (1990) hypothesised that structural frames for CV structure encoding may be compatible with phonological systems proposed by Clements and Keyser (1983) as well as Goldsmith (1990). This was supported by speech errors from German and Swedish (Stemberger, 1984). However, such patterns were not observed in English. Costa and Sebastian-Gallés (1998) found primed picture-naming was facilitated by primes that shared CV structure with the targets. Sevald, Dell and Cole (1995) found similar effects in repeated pronunciation tasks in English. Romani et al. (2011) brought these ideas to the fore with their analysis of speech errors made by Italian aphasic and apraxic patients. The patients did repetition, reading, and picture-naming tasks. Both groups of patients produced errors that targeted vulnerable syllable positions such as onset- and coda- satellites consistent with previous findings (Den Ouden, 2002). They also found that a large proportion of the errors preserved syllable structure even in the errors. This is noted by previous findings as well (Wilshire, 2002). Previous findings by Romani and Calabrese (1996) found that Italian patients replaced geminates with heterosyllabic clusters rather than homosyllabic clusters. For example, /ʤi.raf.fa/ became /ʤi.rar.fa/ rather than /ʤi.ra.fra/ preserving the original syllable structure of the target. While the Dell model’s segments coded for syllable position can also explain such errors, it cannot account for errors that moved from one syllable position to another. More recent computational calculations by Ramoo and Olson (2021) found that the resyllabification rates in English and Hindi as well as storage costs predicted by information theory do not discount LEWISS based on storage and computational costs.

Language Production Models

creative speech production meaning

  • This is the non-verbal concept of the object that is elicited when we see a picture, read the word or hear it.
  • An abstract conceptual form of a word that has been mentally selected for utterance.
  • The meaningful unit (or units) of the lemma attached to specific segments.
  • Syllable nodes are created using the syllable template.
  • Segment nodes are specified for syllable position. So, [p onset] will be a separate segment from [p coda].
  • This node indicates that the word is singular.
  • This node specifies the CV structure and order of the word.
  • A syllable template is used in the syllabification process to indicate which segments can go where.
  • The segment category nodes are specified for syllable position. So, they only activate segments that are for onset, peak or coda syllable positions. Activation will be higher for the appropriate segment.

creative speech production meaning

  • Segment nodes are connected to the morpheme node specified for serial position.
  • The morpheme is connected to a syllable template that indicates how many syllable are contained within the phonological word. It also indicates which syllables are stressed and unstressed.
  • Post-lexical syllabification uses the syllable template to syllabify the phonemes. This is also the place where phonological rules can be implimented. For example, in English, unvoiced stops will be aspirated in output.
  • Syllabified representations are used to access a Mental Syllabary of articulatory motor programs.
  • The final output.

LEWISS Model

creative speech production meaning

  • The syllable structure nodes indicate the structure of the word’s syllable structure. They also specify syllable stress or tone. In addition, the connections are weighted. So, core positions and peak positions are strongly weighted compared to satellite positions.
  • Segment nodes are connected to the morpheme node. They are also connected to a syllable structure that keeps them in place.
  • Post-lexical syllabification syllabify the phonemes at morpheme and word boundaries. This is also the place where phonological rules can be implimented. For example, in English, unvoiced stops will be aspirated in output.

Navigate to the above link to view the interactive version of these models.

Media Attributions

  • Figure 9.3 The Dell Model by Dinesh Ramoo, the author, is licensed under a  CC BY 4.0 licence .
  • Figure 9.4 The LRM Model by Dinesh Ramoo, the author, is licensed under a  CC BY 4.0 licence .
  • Figure 9.5 The LEWIS Model by Dinesh Ramoo, the author, is licensed under a  CC BY 4.0 licence .

The process of putting individual segments into syllables based on language-specific rules.

The process by which segments that belong to one syllable move to another syllable during morphological changes and connected speech.

The structure of the syllable in terms of onset, peak (or nucleus) and coda.

Psychology of Language Copyright © 2021 by Dinesh Ramoo is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Share This Book

creative speech production meaning

IMAGES

  1. Creative Speech Production: Eldin D. Camposo

    creative speech production meaning

  2. Stages of Speech Production (aka Levels of Linguistic Representation)

    creative speech production meaning

  3. The Process of Speech Production by DEIDIER LUIS DIAZ MARTINEZ on Prezi

    creative speech production meaning

  4. Speech Production Process

    creative speech production meaning

  5. Creative Speech Production(Declamation, Monologue, Group Interpretation)

    creative speech production meaning

  6. Macbeth Creative Speech

    creative speech production meaning

VIDEO

  1. Speech Production Videos Caterpillar Passage

  2. How We Film Speakers, Keynotes, and Presentations

  3. Alex-Creative Speech-Demo

  4. CREATIVE pronunciation • How to pronounce CREATIVE

  5. PRODUCTION MEANING AND FUNCTION,MBA S2 KU PRODUCTION AND OPERATION MANAGEMENT

  6. Thol.Thirumavalavan mp/ mass creative speech

COMMENTS

  1. Speech production

    Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus.Speech production can be spontaneous such as when a person creates the words of a conversation, reactive such as when they name a ...

  2. 9.2 The Standard Model of Speech Production

    Figure 9.2 The Standard Model of Speech Production. The Standard Model of Word-form Encoding as described by Meyer (2000), illustrating five level of summation of conceptualization, lemma, morphemes, phonemes, and phonetic levels, using the example word "tiger". From top to bottom, the levels are:

  3. Speech Production

    Definition. Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved. Speaking is in essence the by-product of a necessary bodily process ...

  4. Speech Production

    Speech production is a complex process that includes the articulation of sounds and words, relying on the intricate interplay of hearing, perception, and information processing by the brain and ...

  5. How Does Creativity Affect Second Language Speech Production? The

    Skehan's framework of speech processing demands can help researchers relate task characteristics to different speech production processes. Given the potential role of creativity in conceptualization processes suggested by previous studies, task characteristics affecting demands on conceptualization might differentiate how creativity ...

  6. Articulating: The Neural Mechanisms of Speech Production

    2. Models and Theories of Speech Production. In summarizing his review of the models and theories of speech production, Levelt (1989, p. 452) notes that "There is no lack of theories, but there is a great need of convergence."This section first briefly reviews a number of the theoretical proposals that led to this conclusion, culminating with the influential task dynamic model of speech ...

  7. An approach to creative speaking activities in the young learners

    An approach to creative speaking ... (2013) comes to similar conclusions. Here, the children's speech production was also characterised by the use of single words and formulaic sequences (cf. also Roos 2007). However, in her study, Lenzing was also able to show that a slow but gradual ... One definition of a 'communicative task' that ...

  8. Speech Production

    Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics, Acoustic Phonetics and Speech Perception, which are all studying various elements of language and are part of a broader field of Linguistics.

  9. Speech Production

    A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. 1 Such an account must address several issues. Two central issues are discussed here. One issue concerns the nature of language ...

  10. Single-neuronal elements of speech production in humans

    Neuropixels recordings from the language-dominant prefrontal cortex reveal a structured organization of planned words, an encoding cascade of phonetic representations by prefrontal ...

  11. Creativity and Production in Language Learning

    Producing language assists the language learning process in many ways, but production does not in and of itself promote learning. For example, production can include relatively meaningless activities such as reciting answers to uncontextualized grammar drills. Creativity implies something more—doing something original, adapting, or changing.

  12. 9.1 Evidence for Speech Production

    The evidence used by psycholinguistics in understanding speech production can be varied and interesting. These include speech errors, reaction time experiments, neuroimaging, computational modelling, and analysis of patients with language disorders. Until recently, the most prominent set of evidence for understanding how we speak came from.

  13. (PDF) Editorial: Models and Theories of Speech Production

    nature of the speech production 1 system is to observe speech movements as parts of words or larger. chunks of speech such as phrases or sentences. The intention to produce a lexical item involves ...

  14. What is Creative Production?

    Creative operations refer to the overall management and organization of the creative teams and processes. This may also include things like project management, resource allocation, quality control, and process improvement. Creative production on the other hand refers to the process of bringing a creative concept or idea to fruition.

  15. Neurocognition of Language/Speech Comprehension and Speech Production

    The cohort model. The cohort model (original version: Marslen-Wilson & Welsh, 1978; later version: Marslen-Wilson, 1990) proposes three phases of word recognition: The access stage starts once we hear the beginning of a word and draw from our lexicon a number of candidates that could match, the so-called cohort. Therefore the beginning of a word is especially important for understanding.

  16. Speech Production From a Developmental Perspective

    As in an information-processing approach to speech production, a developmental approach requires a perceptual-motor map, specifically a mapping between auditory speech and articulatory movement that is likely mediated by somatosensory information (e.g., Guenther, 1995; Guenther et al., 2006; Perkell et al., 1993 ).

  17. Speech Production

    Speech Production. J. Harrington, C. Mooshammer, in Encyclopedia of Language & Linguistics (Second Edition), 2006 Exemplar Theory. Weaver and many other speech production models based on performance errors adopt the idea from generative phonology that there is a phonological grammar and a component for phonetic implementation that is independent of the words in the lexicon.

  18. Creative Speech Production: Eldin D. Camposo

    CREATIVE-SPEECH-PRODUCTION - Free download as PDF File (.pdf), Text File (.txt) or read online for free.

  19. An approach to creative speaking activities in the young learners

    Here, the children's speech production was also characterised by the use of single words and formulaic sequences (cf. also Roos Citation 2007). However, in her study, Lenzing was also able to show that a slow but gradual development towards less formulaic speech and more productive utterances took place after two years of instruction (Lenzing ...

  20. What is creative production? The 6-step process

    The 6-step process. Creative production is the process of transforming creative concepts into tangible outputs. Learn how to bring creative visions to life in six steps. Cutting through the market noise with high-quality creative assets is essential to setting your brand apart. Whether it's captivating video content, compelling graphics, or ...

  21. 9.3 Speech Production Models

    Dell's spreading activation model (as seen in Figure 9.3) has features that are informed by the nature of speech errors that respect syllable position constraints. This is based on the observation that when segmental speech errors occur, they usually involve exchanges between onsets, peaks or codas but rarely between different syllable positions.

  22. CREATIVE SPEECH PRODUCTION.docx

    View CREATIVE SPEECH PRODUCTION.docx from ENGLISH 1413 at Bataan Peninsula State University in Balanga. ENGL 1413 NAME: Leceta, Pearl Jeremaeh Alexis P. Research on the following kinds of creative ... To share with the audience, the reader should draw meaning f rom the choice. In interpretive reading, all the abilities to read aloud, including ...