- Language Researcher, Expert Center, AWATERA, a language service provider (June 2023 – present)
The Expert Center is responsible for research, new technologies and other non-routine tasks. My tasks included the following:
- Large language models (LLM), such as OpenAI GPT and its analogs:
- Using LLMs for text translation and editing, translation quality assessment, and glossary compilation
- Prompt engineering
- Fine-tuning
- LLM study and comparative analysis (LangChain, Llama, SeamlessM4T, Mistral, etc.)
- Fine-tuning of transformer models (BERT, RoBERTa).
- Machine translation quality assessment. Comparison of various MT quality metrics (hLEPOR, COMET)
- Natural language processing
- Tokenization. Tokenizer comparison and selection for different tasks (NLTK, SpaCy, tiktoken, BPE, WordPiece)
- Named entity recognition (NER)
- Morphological segmentation
- Statistics
- Experiment design
- Data preparation for blind testing
- Sample size estimation with numerical modeling (bootstrap)
- Non-parametric statistical significance testing (bootstrap) of mean values, correlation coefficients, machine translation quality metrics, etc.
- Speech recognition
- Processing data in the industry-specific formats (tmx, xliff).
- Pet projects
- Previous experience: translator and interpreter
Skills
- Python: Pandas, NumPy, Matplotlib, SciPy, StatsModels, Jupyter Notebooks, Jupyter Lab, etc.
- Machine learning:
- Large language models:
- Prompt engineering
- Fine-tuning
- Deep learning (PyTorch):
- Recurrent neural networks (GRU, LSTM)
- Transformers
- Beam search
- Classical ML (libraries: sklearn, xgboost):
- Regression: linear, nonlinear, regularization (lasso, ridge)
- Classification: logistic regression, KNN, SVM
- Clustering
- Tree-based methods: random forest, boosting
- Cross-validation, bootstrap.
- Hyperparameter optimization (Optuna)
- Feature transformation (PCA, SVD)
- Ensembles, pipelines
- Natural language processing (NLTK, SpaCy):
- Machine translation
- Machine translation quality assessment
- Tokenization
- Named entity recognition (NER)
- Morphology
- Statistics:
- Hypothesis testing
- ANOVA
- Multiple testing
- Git
- SQL
Background
Higher education: Faculty of Physics, Moscow State University. Degree: diploma of higher education (equivalent of Master of Physics).
Data Science reskilling course. Tomsk State University.
Achievements:
- I was the first to complete training out of ~160 students
- Students learning faster presented their projects to other students to help them progress on the course. I presented four projects ouf of 16
- As a top student, I also checked other students' graduation projects (normally done by staff)
Other training:
April–May 2023. Data Engineer course in Sapiens Academy (ELT/ETL, DWH, Greenplum, Airflow, Clickhouse, Superset).
May 2023. Recommender Systems in Practice bootcapmp. Higher School of Economics/Magnit.
December 2022. Uplift Modeling course at Open Data Science (December 2022).
Stepic courses. All courses were completed with distinction, ranking among 1 to 6% top students. Certificates.
- Python – Basics and Application
- Programming in Python
- Data Analysis in R
- Basic Programming in R
- Basics of Statistics, parts 1, 2 and 3
- Intro to Data Science and Machine Learning
- Interactive SQL Simulator
- Intro to Linux