The End goal of this project to to help people learn ( spanish) words from TV series subtitle using Quizlet (or other flash card project ):
1- Use Github
2- documentation ( How-to use )
3- Use OpenSource/freeware/shared script/program when possible
4- This project will have many phases:
phase 1- python script ( create a word list and translate ) [ + Create an dictionary per language / per season = summ of all word list — for future application / statistics
phase 2- website: subtitle search, check if already exist and create a Quizlet
phase 3- script apply to many subtitle languages / TV series with many episode
phase 4- graph on the website ( statistics of words usage per single subtitle files and multiple subtitle files )
phase 5- Integrate learning progress from Quizlet ( or other ) to create new list of words/ per subtitle files
Today python script: Phase 1
1- srt substitute file >>> remove the time stamps create 2 files : list of text subtitle + list of "description information" example
...
3- More advance: count only the same word for: masculin/feminim, also "la + noums" /"el + noums" = 1 words , various tense verbs = 1 words
3- create a list + translation
1- translate the words, order based on the count per subtitle
2- more advance:
+ add le the letter "V" ( verb) and add the infinitive form of the verb
+ add the letter "N" ( noums ) ,.... >>> for the Noums add " el" or "la"
+ may be other . . . ????
2- documentation ( How-to use )
usage will like something like that:
wordlistandwordtranslate.py <TV_serie_S0xE0x.srt> -h --help
Options:
-o <serieS01E01.csv>
-d <description_file.csv>
-e translate expression as one word: "Qué tal" and not "Qué" "tal"
--dbo <other output format, database format, to be defile in the future >
--aato <add article to nouns to origin language: + el, la >
--aivto <add infinite verb form to origin language, format "<SINGLE SPACE> INFINITE_VERB_FORM" >
--gsv <group same verb >
--dwi <display word_info: N for Nouns and V for Verb >
--oon output only the nouns
--oov output only the verbs
--ootr output only the rest
--ol <original_language: esp >
--tl <target_language: eng>
--st <second language: romanization or transliteration : PinYin , .... >
--ra <remove article: el, la, una, ... >
--bstat Basic statistics : only summary of count of words
--fstat Full statistics report: with everything from below
--swstat Single words statistics
--swnastat Single words no articles statistic ???? is it needed ????
--swonstat Single words only nouns statistic
--swovstat Single words only verbs statistic
--swotrstat Single words only the rest statistic
--swiestat Sinle words including expression ( like "Qué tal" ) statistics
--mfl Multi-file option: list of file >
--mfd Multi-file directory >
--mfso Multi-file single output file .csv
--mfmo Multi-file multiple output file .csv
--mstat Multi-file stats >
Future:
1- create an website to search / download subtitle file ( srt )
...
3- compare list of file >>> define the deta delta/diff >> create only the new words
...