[gquinones] course project

Author: gquinones

12/10/2025 — course project — 3 min read

Course Project Info
Code Repository URL	https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-jake-gabriela.git
Demo URL (optional)
Team name	Jake and Gabriela

Project description

See the rubric. Use subsections to organize components of the description.

Project Approach - Jake Mains and Gabriela Quiñones

Outline of Goals
- Identify an indigenous language family of the Americas with sufficient documentation for project -Train a system/model that can identify the language of a given text. This will require a sample text of each language. Ideally multiple pages of text, but at least a few hundred words.
- After training, the model will identify the language of a given text. The text will need to be a minimum length, potentially 10 words/2-3 sentences.
- Aim for a minimum of 3 languages to differentiate. Ideally have 5+ languages to expand usability of the model
- If possible, outline an approach for training more languages in the future
Training approach
- Train model by identifying key statistics of sample text for each language.
  - Average word length
  - Unique characters
  - Unique vocabulary
  - General word order (would first require POS tagging)
- Create an algorithm that combines these factors and outputs a guess with a percentage
  - i.e. “Language A, 85% likely”, “Language C, 34% likely”
  - Include a message if the percentage is too low, under 50%?
    - i.e. “Unable to make a reliable estimate. Try again with a longer text.”
- Potential language families:
  - Iroquoian (NE U.S.)
    - (https://library.si.edu/digital-library/book/hymnalinsenecala00bata)
  - Cherokee
    - (https://language.cherokee.org/learning-materials/)
- Timeline
  - Week 0 (11/16: Approach Due)
    - Identify language family
    - Identify potential computing approaches
    - Submit project approach
  - Week 0.5 (11/19)
    - Gather sample texts, clean them, add to database for testing
    - Create initial code for testing and training
    - Identify more feasible approaches based on initial coding
  - Week 1 (11/23)
    - Refine testing and identify major issues to be addressed
    - Seek feedback for additional testing approaches if necessary
    - Build up database with additional texts if necessary
    - Code rough output messages based on results
  - Week 1.5 (11/26)
    - Work out bugs in model to improve accuracy
    - Establish minimum text length that would be feasible for identification
  - Week 2 (11/30)
    - Continue coding
    - Identify additional languages to train if model is working well at this point
    - Less work completed in this timeframe due to Thanksgiving Break
  - Week 2.5 (12/3)
    - Begin training of additional languages, if possible
    - Work on write up of project summary, individual contributions, etc.
    - Begin analyzing results, limitations, future improvements, etc.
  - Week 3 (12/7) PROJECT DUE!
    - Make final touches on model
    - Organize code and resources, create READEME.md file and put in-line comments, for easy understanding on grader side
    - Finish write-up and discussion

Summary of individual contributions

Team member	Role/contributions
Jake Mains	Created the code to identify the language sample, Searched for sample texts of the languages and Created short biographies of the selected languages.
Gabriela Quiñones	Created a database to store the languages, sample texts and information about the langueges, Created test files to make sure everything was progressing correctly, Helped search for sample texts on the languages.

Results

Using the text files that we found and processed, we were able to create a sufficient identification system. This system returned the correct language prediction when given a short string of text, about the size of one setntence. We included 6 Iroquoian languages as well as English as a baseline. Predictions were based off of language script as well as common words found in the language. With the language prediction, a short biography of the identified language is also returned.

Error analysis

This system works well with the data it's trained on, but may have some limitations on held-out data. This system relies on the target language to be in a certain script for proper identification. For example, Cherokee in particular is trained on a specific script and will not likely be identified if working with another script. In addition, some langugaes were trained on a more generic script, so we used common words in the language to identify them, for instance "tsha'" in Onondaga. If the text entered does not have one of these words, it may not identify the language.

Reproducibility

Results are reproducable through the files accessable in the team repo.

You can access it here

Future improvements

Our main limitation for this project was a lack of text from each lnaguage in this family. Some future improvements could involve sourcing more texts for more accurate language prediction. This could also involve a deeper language analysis that allows for proper identification no matter the script or if the text given is missing a key word. This could involve other metrics like average word length or consonant patterns unique to the language. Lastly, building a user interface for more practical use would be a major improvement.