[gquinones] course project
Author: gquinones
— course project — 3 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-jake-gabriela.git |
|---|---|
| Demo URL (optional) | |
| Team name | Jake and Gabriela |
Project description
See the rubric. Use subsections to organize components of the description.
Project Approach - Jake Mains and Gabriela Quiñones
- Outline of Goals
- Identify an indigenous language family of the Americas with sufficient documentation for project -Train a system/model that can identify the language of a given text. This will require a sample text of each language. Ideally multiple pages of text, but at least a few hundred words.
- After training, the model will identify the language of a given text. The text will need to be a minimum length, potentially 10 words/2-3 sentences.
- Aim for a minimum of 3 languages to differentiate. Ideally have 5+ languages to expand usability of the model
- If possible, outline an approach for training more languages in the future
- Training approach
- Train model by identifying key statistics of sample text for each language.
- Average word length
- Unique characters
- Unique vocabulary
- General word order (would first require POS tagging)
- Create an algorithm that combines these factors and outputs a guess with a percentage
- i.e. “Language A, 85% likely”, “Language C, 34% likely”
- Include a message if the percentage is too low, under 50%?
- i.e. “Unable to make a reliable estimate. Try again with a longer text.”
- Potential language families:
- Iroquoian (NE U.S.)
- Cherokee
- Timeline
- Week 0 (11/16: Approach Due)
- Identify language family
- Identify potential computing approaches
- Submit project approach
- Week 0.5 (11/19)
- Gather sample texts, clean them, add to database for testing
- Create initial code for testing and training
- Identify more feasible approaches based on initial coding
- Week 1 (11/23)
- Refine testing and identify major issues to be addressed
- Seek feedback for additional testing approaches if necessary
- Build up database with additional texts if necessary
- Code rough output messages based on results
- Week 1.5 (11/26)
- Work out bugs in model to improve accuracy
- Establish minimum text length that would be feasible for identification
- Week 2 (11/30)
- Continue coding
- Identify additional languages to train if model is working well at this point
- Less work completed in this timeframe due to Thanksgiving Break
- Week 2.5 (12/3)
- Begin training of additional languages, if possible
- Work on write up of project summary, individual contributions, etc.
- Begin analyzing results, limitations, future improvements, etc.
- Week 3 (12/7) PROJECT DUE!
- Make final touches on model
- Organize code and resources, create READEME.md file and put in-line comments, for easy understanding on grader side
- Finish write-up and discussion
- Week 0 (11/16: Approach Due)
- Train model by identifying key statistics of sample text for each language.
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Jake Mains | Created the code to identify the language sample, Searched for sample texts of the languages and Created short biographies of the selected languages. |
| Gabriela Quiñones | Created a database to store the languages, sample texts and information about the langueges, Created test files to make sure everything was progressing correctly, Helped search for sample texts on the languages. |
Results
Using the text files that we found and processed, we were able to create a sufficient identification system. This system returned the correct language prediction when given a short string of text, about the size of one setntence. We included 6 Iroquoian languages as well as English as a baseline. Predictions were based off of language script as well as common words found in the language. With the language prediction, a short biography of the identified language is also returned.
Error analysis
This system works well with the data it's trained on, but may have some limitations on held-out data. This system relies on the target language to be in a certain script for proper identification. For example, Cherokee in particular is trained on a specific script and will not likely be identified if working with another script. In addition, some langugaes were trained on a more generic script, so we used common words in the language to identify them, for instance "tsha'" in Onondaga. If the text entered does not have one of these words, it may not identify the language.
Reproducibility
Results are reproducable through the files accessable in the team repo.
You can access it here
Future improvements
Our main limitation for this project was a lack of text from each lnaguage in this family. Some future improvements could involve sourcing more texts for more accurate language prediction. This could also involve a deeper language analysis that allows for proper identification no matter the script or if the text given is missing a key word. This could involve other metrics like average word length or consonant patterns unique to the language. Lastly, building a user interface for more practical use would be a major improvement.