A Language Differentiation System for the Iroquoian Language Family
Author: jakemns
— course project — 1 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-jake-gabreila |
|---|---|
| Demo URL (optional) | |
| Team name | Jake and Gabriela |
Project description
For this project, we created a tool to identify between languages in the Iroquoian family. We chose this project to better understand techniques for lanugage differentiation in genereal. We chose the Iroquoian language family as it is a low-resource language family with little work published on it. One of the largest challenges for the project was finding sufficient text examples for each langauge and preparing the text to use for our tool.
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Gabriela Quinones | Created database and added text files to database. Created user interface to allow for input |
| Jake Mains | Searched for texts for database. Created code file to differentiate languages |
Results
Using the text files that we found and processed, we were able to create a sufficient identification system. This system returned the correct langugae prediction when given a short string of text, about the size of one setntence. We included 6 Iroquoian languages as well as English as a baseline. Predictions were based off of language script as well as common words found in the language. With the language prediction, a short biography of the identified language is also returned.
Error analysis
This system works well with the data it's trained on, but may have some limitations on held-out data. This system relies on the target language to be in a certain script for proper identification. For example, Cherokee in particular is trained on a specific script and will not likely be identified if working with another script. In addition, some langugaes were trained on a more generic script, so we used common words in the language to identify them, for instance "tsha'" in Onondaga. If the text entered does not have one of these words, it may not identify the language.
Reproducibility
Results are reproducable through the files accessable in the team repo.
Future improvements
Our main limitation for this project was a lack of text from each lnaguage in this family. Some future improvements could involve sourcing more texts for more accurate language prediction. This could also involve a deeper language analysis that allows for proper identification no matter the script or if the text given is missing a key word. This could involve other metrics like average word length or consonant patterns unique to the language. Lastly, building a user interface for more practical use would be a major improvement.