The Least Understood Language Families: A Survey of Underdocumented Groupings and What Their Better Understanding Would Require

Abstract

The world’s language families are unevenly served by scholarship. A handful — Indo-European above all, then Sino-Tibetan, Austronesian, Afro-Asiatic, Uralic, Dravidian, and a few others — have been the focus of sustained comparative work for a century or more, with reconstructed proto-languages, comprehensive cognate sets, well-developed subgrouping debates, and large-scale academic infrastructure. A second tier has received serious but more limited attention. Beyond these tiers lies a substantial body of language families and proposed groupings whose internal structure, external relationships, and sometimes even basic membership remain genuinely unclear despite decades of fieldwork. This paper surveys the families in this third category, considers why they have remained underdocumented, and outlines what their better understanding would require in terms of resources, research programs, and institutional architecture. The argument is not that resolution is around the corner — for most of these families, definitive answers are not currently available — but that the current pace of work is slower than the pace of language loss, and that targeted investment could produce substantial improvement on a generation-scale timeline.

I. What “Least Understood” Means

Before surveying the families themselves, the working sense of “least understood” deserves clarification. A language family can be poorly understood in several distinct senses, and the kind of work needed to improve understanding depends on which sense applies.

A family may be poorly understood at the level of individual languages, meaning that the member languages themselves lack reference grammars, dictionaries, and text corpora adequate to support comparative work. The bottleneck is documentation. The remedy is fieldwork, often urgent because the speaker populations are aging.

A family may be poorly understood at the level of internal subgrouping, meaning that individual languages are reasonably documented but the relationships among them remain contested. The bottleneck is comparative analysis. The remedy is sustained reconstruction work, often using existing materials more thoroughly than has yet been done.

A family may be poorly understood at the level of external relationships, meaning that internal structure is reasonably clear but proposals to link the family to other families remain unresolved. The bottleneck is methodology and time depth. The remedy is partly the development of better methods and partly the recognition that some questions may not be resolvable on current evidence.

A family may be poorly understood at the level of historical context, meaning that the linguistic structure is reasonably mapped but the historical processes that produced the current distribution — migrations, contact relationships, substrate effects — are unclear. The bottleneck is the integration of linguistic evidence with archaeology, genetics, and historical reconstruction.

Most of the families surveyed below are poorly understood in more than one of these senses simultaneously, which compounds the difficulty.

II. The Papuan Sphere: Multiple Families, Pervasive Uncertainty

The non-Austronesian languages of New Guinea and adjacent regions constitute the largest concentration of comparatively underdocumented and underanalyzed languages on Earth. The label “Papuan” is itself negative — these are the languages of the region that are not Austronesian — and covers something between twenty and forty proposed families plus a residue of isolates. Even the count is contested.

The Trans-New Guinea family, in its various formulations, claims to group several hundred languages across the highlands and adjacent regions, supported by a recurrent pronominal pattern and a smaller set of putative lexical cognates. Whether this constitutes a single genetic family, several related families, or a contact-induced typological grouping has been debated for half a century without resolution. The Sepik, Torricelli, Lower Sepik-Ramu, Sko, Border, Senagi, Left May, East Bird’s Head, West Papuan, and various other proposed families are individually better defined but remain incompletely worked out at the level of internal reconstruction.

The bottlenecks are several. Documentation of individual languages remains uneven, with hundreds of languages represented in the comparative literature only by short word lists collected once. The time depth of the families is great — likely tens of thousands of years for any genuine deep groupings — placing the relationships at or beyond the limits of standard comparative methodology. Contact effects are pervasive across genetic boundaries, generating false signal for relatedness and obscuring true signal. The institutional infrastructure for sustained work is thin, with a small number of academic departments worldwide producing trained Papuanists and even fewer producing speakers of the languages who are also trained linguists.

Improvement would require, at minimum, sustained fieldwork on the languages still inadequately documented, comparative reconstruction within the better-defined families to produce proper proto-language statements, and explicit testing of the deeper hypotheses against null models that account for chance resemblance and areal convergence. The realistic timeline is multi-generational. The realistic outcome is partial resolution: stronger statements about which groupings are securely established, which are plausible but not provable on current evidence, and which should be abandoned.

III. The Amazonian Sphere: Comparable Diversity, Comparable Underdocumentation

The Amazon basin and adjacent regions of South America contain approximately three hundred languages distributed across more than fifty proposed families and isolates. The major groupings — Arawakan (or Maipurean), Tupian, Cariban, Macro-Jê, Pano-Tacanan, Tucanoan, Witotoan, Chibchan along the western edge, and several smaller families — are individually established with varying degrees of confidence. Beyond these, a substantial number of smaller families and isolates resist confident grouping.

The deeper question — whether some or all of these families share more remote common ancestors — has been pursued in proposals like Greenberg’s Amerind, which is rejected by most specialists, and in more cautious proposals linking specific pairs or small groups of families. None of the deeper proposals has achieved consensus.

The documentation situation has improved substantially in recent decades through sustained work by Brazilian, Colombian, Peruvian, and other Latin American linguists, often in partnership with North American and European researchers, and increasingly led by indigenous researchers from the communities themselves. Reference grammars have been produced for many languages that lacked them a generation ago. Yet many languages remain incompletely documented, and many of the smaller families have not received the kind of sustained comparative attention that has been given to the larger groupings.

The bottlenecks are partly logistical — fieldwork in the Amazon basin is genuinely difficult and expensive — and partly institutional. The major comparative work has been distributed across academic communities in several countries that do not always coordinate, with the result that proposals about the same languages sometimes proceed in parallel without engagement. Funding for sustained comparative work, as opposed to documentation of individual languages, has been limited.

Improvement would require continued fieldwork on the languages still inadequately documented, comprehensive comparative reconstruction within the established families, and explicit testing of proposed deeper relationships using methods adequate to the time depths involved. Indigenous-led research programs, which have grown substantially in countries like Brazil, are likely to produce the most durable progress, both because they sustain longer engagement than externally funded fieldwork can and because they integrate community knowledge in ways that outsider research rarely matches.

IV. The Australian Sphere: One Likely Family, Pervasive Difficulty

The Aboriginal languages of Australia — approximately two hundred and fifty languages at the time of European contact, of which perhaps a hundred and twenty have substantial documentation and considerably fewer remain in active intergenerational transmission — present a different kind of difficulty. The Pama-Nyungan family, covering most of the continent, is reasonably well established. The non-Pama-Nyungan languages of the north, comprising perhaps fifteen smaller families, are individually established but their relationships to each other and to Pama-Nyungan are unclear.

The deeper question of whether all Australian languages descend from a single ancestor — sometimes called Proto-Australian — has been pursued seriously and remains unresolved. The pronominal evidence is suggestive but not conclusive, and the time depth at which any common ancestor would have to be placed exceeds the conventional reach of the comparative method.

The Australian situation differs from the Papuan and Amazonian cases in one important respect: documentation is comparatively better, owing to sustained work over the twentieth century by R.M.W. Dixon, Ken Hale, Geoffrey O’Grady, Stephen Wurm, and others, and to substantial Australian government and university support. The comparative materials are substantial. What remains incomplete is the integration of these materials into reconstruction at sufficient depth.

The bottlenecks are partly methodological. The age of the human occupation of Australia — likely sixty thousand years — places linguistic reconstruction at depths that may simply not be tractable on current methods. Areal contact has been extensive over millennia. The languages exhibit typological features (small phonemic inventories, complex morphology, extensive paradigmatic suppletion) that complicate standard comparative procedures.

Improvement would require continued work on the non-Pama-Nyungan languages where speakers remain, sustained comparative work on the relationship between Pama-Nyungan and non-Pama-Nyungan, and methodological development around the question of whether and how reconstruction at extreme time depth is possible. The Australian institutional infrastructure is among the better-developed in the relevant regions, which is an advantage; the bottleneck is the inherent difficulty of the questions rather than the absence of resources.

V. The Khoisan Question: Probably Not One Family

The “Khoisan” languages of southern Africa — the click languages spoken by populations that include the Khoekhoe, the various San groups, and others — were once treated as a single family. Current scholarship treats them as at least three distinct families (Tuu, Khoe-Kwadi, and Kx’a) plus the isolates Sandawe and Hadza in East Africa. Whether any of these are genetically related to each other beyond chance remains contested.

The comparative work has been conducted by a small number of specialists over several decades. Documentation of individual languages has improved substantially through the work of Bonny Sands, Tom Güldemann, Anthony Traill, and others, but coverage remains uneven and several languages have very few remaining speakers. The shared typological feature — clicks — that originally motivated the unified Khoisan hypothesis is now understood to be areal rather than genetic, occurring across families through long contact relationships.

The bottlenecks include the small number of specialists worldwide, the urgency of the documentation situation for several of the languages with few remaining speakers, and the deep time depth at which any common ancestor of the various proposed families would have to be placed. Sandawe and Hadza are particularly puzzling: both are isolates within the click-language area, and proposals connecting them to each other or to other families have not been substantiated.

Improvement would require sustained documentation while speakers remain, comprehensive comparative work within the established families, and explicit testing of deeper relationships. The total speaker population for several of the languages is now small enough that documentation is approaching a hard deadline. The number of trained specialists is small enough that scaling the work would require deliberate investment in training.

VI. The Nilo-Saharan Question: A Family or a Hypothesis?

Nilo-Saharan, as proposed by Joseph Greenberg in the 1960s, groups approximately eighty languages across a band of north-central Africa from Mali to Tanzania. The grouping has been controversial since its proposal. Some specialists treat it as a valid family with internal subgrouping that needs further work; others treat it as a residual category that gathers together languages that are not Afro-Asiatic, Niger-Congo, or Khoisan, without strong evidence for positive relatedness.

The internal structure includes proposed groupings like Eastern Sudanic (which is itself debated), Central Sudanic, Saharan, Songhai, Maban, Fur, Kunama, Berta, and others. Some of these are well-established; others are themselves contested. The family as a whole has not received the kind of comprehensive comparative reconstruction that would establish it on the same footing as Indo-European or Austronesian, and whether such a reconstruction is possible remains an open question.

The bottlenecks include the political instability of much of the relevant region (parts of Sudan, Chad, the Central African Republic, and the Democratic Republic of the Congo have been difficult for sustained fieldwork), the small number of specialists, and the absence of the kind of large institutional commitment that has supported comparable work on Bantu within Niger-Congo.

Improvement would require sustained fieldwork in regions where it is currently difficult to conduct, comprehensive comparative work within the established sub-families, and a serious decision about whether the deeper Nilo-Saharan hypothesis can be tested on current methods or whether the question should be deferred until better methods or more material become available. The current situation, in which the family is sometimes treated as established and sometimes as speculative without clear adjudication, is unsatisfactory.

VII. The Dené-Yeniseian Hypothesis and Beyond

The Dené-Yeniseian hypothesis, developed primarily by Edward Vajda over the past two decades, proposes that the Yeniseian languages of Siberia (now reduced to Ket and a few moribund varieties) and the Na-Dené languages of North America (Athabaskan, Eyak, and Tlingit) descend from a common ancestor. The hypothesis has gained substantial acceptance among specialists, supported by morphological correspondences and a body of cognates that has grown over the years of work.

If accepted, Dené-Yeniseian would be one of the few demonstrated trans-Beringian linguistic relationships. Its further extensions — proposed connections to the Burushaski isolate of northern Pakistan, to Sino-Tibetan, or to other Old World families under the broader “Dené-Caucasian” hypothesis — remain speculative and have not achieved consensus.

The bottlenecks are different from those facing the Papuan or Amazonian cases. Documentation of the languages is reasonably good, particularly on the North American side. The bottleneck is comparative work at extreme time depth, the verification and extension of Vajda’s proposals, and the careful evaluation of further proposed relationships. The work is necessarily slow because each proposed cognate requires evaluation against multiple alternative explanations (chance, borrowing, areal convergence) and because the time depth involved makes the signal weak relative to the noise.

Improvement would require sustained comparative work by specialists in both Yeniseian and Na-Dené, methodological development around the evaluation of cognate proposals at extreme time depth, and continued documentation of the remaining Yeniseian varieties before they are lost. The institutional infrastructure is small but stable; the question is whether sustained attention over decades will produce further consolidation of the hypothesis and whether it can be extended responsibly to other proposed connections.

VIII. The Caucasian Families: Three Established, Relationships Disputed

The languages of the Caucasus mountain region fall into three established families: Northwest Caucasian (Abkhaz-Adyghean), Northeast Caucasian (Nakh-Daghestanian), and South Caucasian or Kartvelian (Georgian, Mingrelian, Laz, Svan). Each is well established internally. Whether any two or all three are genetically related to each other remains contested.

The North Caucasian hypothesis, linking Northwest and Northeast Caucasian, has been pursued seriously by Sergei Starostin and others, with a proposed reconstruction of Proto-North-Caucasian. Specialists are divided on whether the evidence supports the hypothesis. The further connection of Caucasian families to Hurro-Urartian (the language of the Mitanni and the Urartian kingdom) is sometimes proposed but is even more speculative. The “Dené-Caucasian” macro-family that would link these to Sino-Tibetan, Yeniseian, Burushaski, and others is treated as speculation by most specialists.

The bottlenecks are several. The Northeast Caucasian family alone contains some thirty-plus languages, many with small speaker populations and limited documentation despite recent improvements. Reconstruction at the depth required for the North Caucasian hypothesis is genuinely difficult. The political situation in much of the region has constrained fieldwork over various periods.

Improvement would require continued documentation of the less-described Northeast Caucasian languages, sustained comparative work within and across the families, and methodologically careful evaluation of the proposed deeper relationships. The Russian and Georgian linguistic traditions have produced much of the existing comparative work; sustaining and extending this work, ideally with broader international engagement, is the realistic path forward.

IX. The Indo-Pacific Question: Largely Abandoned but Not Resolved

Joseph Greenberg’s Indo-Pacific hypothesis, proposing a vast family linking Papuan languages, Andamanese languages, and Tasmanian languages, has been largely abandoned by specialists. The Papuan side has fragmented into the multiple families discussed above. The Andamanese languages of the Andaman Islands constitute their own small family (Great Andamanese is now nearly extinct; Onge, Jarawa, and Sentinelese remain). The Tasmanian languages are extinct, and their documentation is so poor that meaningful classification may not be possible.

The lingering question is whether the Andamanese languages have any demonstrable external relatives. Proposals linking them to Papuan languages, to Austroasiatic, or to other regional families have not been substantiated. The languages are typologically distinctive and their isolation, both linguistic and geographic, is striking.

The bottlenecks are documentation of the surviving Andamanese languages, which has improved through recent work but remains incomplete, and the inherent difficulty of testing isolation hypotheses. An isolated language can never be definitively shown to have no relatives; the absence of demonstrable relationships is always provisional. What can be done is to document the language well enough that any future proposed relationships can be evaluated on adequate data.

X. The Macro-Family Hypotheses: Largely Speculative

Several proposed macro-families would, if accepted, dramatically reduce the number of established language families globally. Nostratic (linking Indo-European, Uralic, Altaic, Kartvelian, Dravidian, and Afro-Asiatic in various formulations), Eurasiatic (a related but distinct proposal by Greenberg), Dené-Caucasian (linking Caucasian, Sino-Tibetan, Yeniseian, Na-Dené, Burushaski), and Amerind (linking most American languages) have been proposed at various times. None has achieved general acceptance. Some specialists treat them as productive working hypotheses; others treat them as methodologically unsound.

The status of these proposals is best understood as a question of method. The standard comparative method produces reliable results at time depths up to roughly six to eight thousand years. The macro-families would require demonstrating relationships at fifteen thousand years or more. Whether this is possible at all on current methods is contested. Some scholars (the “long-rangers”) argue that careful work can establish relationships at these depths; others argue that the signal at such depths is below the noise floor of the available methods and that apparent positive results reflect chance and methodological artifacts rather than genuine relationships.

The realistic assessment is that the macro-family hypotheses are likely to remain unresolved on current methods. Their better evaluation would require methodological development that is itself uncertain. Resources directed toward macro-family work compete with resources that could be directed toward better-grounded comparative work on established families, and the comparative cost-benefit favors the latter for the foreseeable future.

XI. The Smaller Underdocumented Families

Beyond the major cases discussed above, a number of smaller language families remain underdocumented or undercomparatively-analyzed despite their accessibility.

The Hmong-Mien family of southern China and Southeast Asia has reasonably good documentation of major languages but limited comparative reconstruction, and its external relationships (proposals link it to Austronesian, Tai-Kadai, or Sino-Tibetan) remain unsettled. The Tai-Kadai (Kra-Dai) family is reasonably well established but its external relationships, particularly the proposed connection to Austronesian, remain contested. The Austroasiatic family is established but its internal subgrouping has been the subject of recurring debate.

The Mande languages of West Africa are an established family within Niger-Congo (or possibly outside it; the position is debated), with reasonable documentation but uneven comparative work. The Songhai languages of the West African Sahel are sometimes treated as Nilo-Saharan and sometimes as an independent family. The Kordofanian languages of Sudan are sometimes grouped with Niger-Congo and sometimes treated separately.

The Chumashan languages of California, the Yuki-Wappo grouping, the Salishan and Wakashan families of the Pacific Northwest, the Algic family (which includes Algonquian), the Iroquoian family, and several others are individually established but their relationships to each other and to other North American families remain partly open. The Penutian and Hokan groupings, proposed in the early twentieth century to organize many California and adjacent languages, are now largely abandoned, but the residual question of which California languages are related to which others has not been fully resolved.

These cases collectively represent a category of work where progress is feasible on current methods, where speaker populations and documentation are adequate to support comparative work, and where the bottleneck is sustained scholarly attention rather than fundamental methodological limitation.

XII. The Resource Question

The pattern across these cases is that the bottlenecks are partly intrinsic and partly resource-driven. Some questions — the macro-families, the deepest Australian and Papuan groupings — may not be resolvable on current methods regardless of investment. Others — the better-defined Papuan and Amazonian families, the Caucasian relationships, the Khoisan situation — could plausibly be advanced substantially with sustained investment in fieldwork and comparative work.

Several patterns emerge across the cases. Documentation funding has improved over the past several decades through programs like the Endangered Languages Documentation Programme, the National Science Foundation’s documentation initiatives, and various national programs in countries with significant linguistic diversity, but remains structurally inadequate to the scale of the need. Comparative reconstruction work is funded much less generously than documentation, and many established families lack comprehensive proto-language reconstructions because no one has been resourced to produce them. Training of specialists is concentrated in a small number of institutions and produces a number of trained scholars per year that is small relative to the work to be done. Indigenous-led research has expanded substantially but remains under-resourced relative to its potential, and infrastructure for community-based comparative work — as opposed to community-based documentation — is particularly thin.

The total global resources devoted to work on the language families discussed in this paper are difficult to estimate but are probably under fifty million dollars per year, distributed across documentation, comparative work, archival infrastructure, and training. This is a small sum relative to what would be required for comprehensive coverage on a generation timeline.

XIII. What Investment Could Achieve

A useful exercise is to consider what realistic investment could accomplish on a twenty-year timeline.

For the Papuan situation, sustained funding for fieldwork on the remaining undocumented languages, combined with a coordinated comparative initiative across the established families, could plausibly resolve internal subgrouping for most of the better-defined families and produce a clearer assessment of which deeper groupings are securely established. Total cost estimate would be in the low hundreds of millions of dollars over the period, distributed across many institutions and projects.

For the Amazonian situation, continued investment in indigenous-led documentation programs combined with comparative work on the established families could produce comparable results. The Brazilian institutional infrastructure is well-developed; expansion to comparable investment in Peru, Colombia, Venezuela, Ecuador, and Bolivia would address current gaps.

For the Australian situation, the documentation work is largely done within the constraints of declining speaker populations. The remaining work is primarily comparative and could be supported by sustained funding for a small number of comparative projects over the period.

For the Khoisan situation, urgent documentation funding for the languages with declining speaker populations, combined with comparative work by the small number of specialists, could advance the situation substantially. The total cost is small in absolute terms but the timeline is short.

For the Caucasian situation, continued documentation of the less-described Northeast Caucasian languages and sustained comparative work could advance the internal questions. The deeper relationships may not be resolvable on current methods regardless of investment.

For Nilo-Saharan, the political conditions of much of the region constrain what is possible regardless of funding. Where fieldwork is possible, sustained investment could produce comparative reconstructions of the established sub-families.

These figures and timelines are rough, but they establish that the work to be done is finite and that resources at the scale of, for instance, a single mid-sized scientific instrument or a single major archaeological project would substantially advance the global situation. The reason this investment has not been made is not that it would be impossible or even unaffordable; it is that the institutional structures that fund science do not currently treat comparative linguistics as a priority commensurate with the questions involved.

XIV. Institutional Prerequisites

Beyond direct funding for fieldwork and comparative work, several institutional prerequisites would substantially affect what investment could achieve.

Sustained training pipelines are essential. The number of specialists currently being trained in each of the relevant areas is small. Expanding training, particularly for researchers from the regions where the languages are spoken, would multiply the long-term capacity for the work. The current pattern, in which training is concentrated in a small number of institutions in wealthy countries, produces a workforce that is too small and that often has weaker connections to the relevant communities than would be ideal.

Integrated databases are essential. The pattern in which each comparative project produces its own dataset, uses it, and leaves it in a form that subsequent projects cannot readily use is wasteful. Investment in shared infrastructure — Glottolog, the Cross-Linguistic Data Formats initiative, Lexibank, language-family-specific databases — has begun to address this but remains underfunded relative to its potential.

Regional coordination is essential. Work on the Papuan, Amazonian, African, and other regional concentrations of underdocumented languages would benefit from coordination structures that the current institutional landscape does not provide. National-level coordination exists in some cases (Australia, Brazil) but international coordination across regions where the same comparative questions cross national boundaries is largely absent.

Indigenous research infrastructure is essential. The shift toward indigenous-led research is real and welcome, but the supporting infrastructure — university programs designed for indigenous students, scholarships and grants targeted at indigenous researchers, archival infrastructure that serves community needs as well as academic needs — remains thin. Investment here would multiply the effective capacity of the field.

XV. The Time Frame

The time frame for the work discussed in this paper is constrained by the timeline of language loss. Many of the languages whose better documentation would advance the comparative questions have small or aging speaker populations. Documentation that does not happen in the next twenty to thirty years, in many cases, will not happen at all. Comparative work that depends on that documentation has the same effective deadline.

This is a different time frame from that which constrains, for instance, work on classical Indo-European or Sino-Tibetan, where the documentary situation is stable and the pace of work can be set by scholarly considerations alone. For the families surveyed in this paper, the pace is set externally by demographic and social processes that the field does not control.

The implication is that resource allocation decisions made now have outsize effects on what will be possible thirty years from now. A doubling of investment in Papuan documentation over the coming decade would produce results that no amount of investment in subsequent decades could replicate, because the speakers needed for the work will not be available later. The same is true in varying degrees for the Khoisan, Australian, and Amazonian situations.

XVI. Conclusion

The least understood language families share several features. They are typically located in regions that are difficult or expensive for sustained fieldwork. They are typically associated with small populations that have limited political voice in the institutions that fund linguistic work. They are typically of greater time depth than the well-studied families, placing comparative questions near or beyond the limits of current methods. They are typically served by small numbers of specialists rather than large institutional communities.

These features are correlated with each other and with patterns of underinvestment that reflect the broader political and economic geography of academic research. The resource allocation that would address them is not in itself extraordinary; it is comparable in scale to investments routinely made in other scientific domains. The reasons it has not been made are institutional rather than substantive.

The realistic outcome of sustained investment over a generation would not be the resolution of every contested question. Some questions, particularly at extreme time depths, may not be resolvable on current methods. What sustained investment could produce is a substantial advance on the current state of knowledge: better documentation of languages whose loss is otherwise inevitable, comprehensive comparative reconstructions for the better-defined families, clearer assessments of which deeper groupings are securely established and which should be treated as unresolved, and a global picture of human linguistic history that is materially closer to comprehensive than the current picture is.

The work is not glamorous and it is not fast. It is also not impossible, and the gap between what is currently being accomplished and what could be accomplished with deliberate investment is large enough that the underinvestment is itself a choice with consequences. Languages that are not documented in the coming decades will not be documented later. Comparative questions that are not pursued by the small number of specialists currently working on them will not be answered by their successors, because their successors will not be trained in adequate numbers without sustained institutional commitment. The choice between continued underinvestment and a serious response is being made now by default. A more deliberate choice, in the direction of serious response, is achievable and would change what humanity knows about its own linguistic history within a generation.

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31