Data Scientist

Data Scientist. Denis V. Kazakov

	e-mail: dvkazakov @ gmail.com (remove spaces on both sides of @)	Phone/WhatsApp: +7-916-909-7864	Telegram: @denis_v_kazakov
GitHub	Skype: denis.v.kazakov		Русский

Work Experience

Language Researcher, Expert Center, AWATERA, a language service provider (June 2023 – present)

Large language models (LLM), such as OpenAI GPT and its analogs:

Using LLMs for text translation and editing, translation quality assessment, and glossary compilation
Prompt engineering
Fine-tuning
LLM study and comparative analysis (LangChain, Llama, SeamlessM4T, Mistral, etc.)

Fine-tuning of transformer models (BERT, RoBERTa).

Machine translation quality assessment. Comparison of various MT quality metrics (hLEPOR, COMET)

Natural language processing

Tokenization. Tokenizer comparison and selection for different tasks (NLTK, SpaCy, tiktoken, BPE, WordPiece)
Named entity recognition (NER)
Morphological segmentation

Statistics

Experiment design
Data preparation for blind testing
Sample size estimation with numerical modeling (bootstrap)
Non-parametric statistical significance testing (bootstrap) of mean values, correlation coefficients, machine translation quality metrics, etc.

Speech recognition

Processing data in the industry-specific formats (tmx, xliff).

Pet projects

Previous experience: translator and interpreter

Skills

Python: Pandas, NumPy, Matplotlib, SciPy, StatsModels, Jupyter Notebooks, Jupyter Lab, etc.

Machine learning:

Large language models:

Prompt engineering
Fine-tuning

Deep learning (PyTorch):

Recurrent neural networks (GRU, LSTM)
Transformers
Beam search

Classical ML (libraries: sklearn, xgboost):

Regression: linear, nonlinear, regularization (lasso, ridge)
Classification: logistic regression, KNN, SVM
Clustering
Tree-based methods: random forest, boosting

Cross-validation, bootstrap.
Hyperparameter optimization (Optuna)
Feature transformation (PCA, SVD)
Ensembles, pipelines

Natural language processing (NLTK, SpaCy):

Machine translation
Machine translation quality assessment
Tokenization
Named entity recognition (NER)
Morphology

Statistics:

Hypothesis testing
ANOVA
Multiple testing

Git
SQL

Background

Higher education: Faculty of Physics, Moscow State University. Degree: diploma of higher education (equivalent of Master of Physics).

Data Science reskilling course. Tomsk State University.

Achievements:

I was the first to complete training out of ~160 students

Students learning faster presented their projects to other students to help them progress on the course. I presented four projects ouf of 16

As a top student, I also checked other students' graduation projects (normally done by staff)

Other training:

April–May 2023. Data Engineer course in Sapiens Academy (ELT/ETL, DWH, Greenplum, Airflow, Clickhouse, Superset).

May 2023. Recommender Systems in Practice bootcapmp. Higher School of Economics/Magnit.

December 2022. Uplift Modeling course at Open Data Science (December 2022).

Stepic courses. All courses were completed with distinction, ranking among 1 to 6% top students. Certificates.

Python – Basics and Application
Programming in Python
Data Analysis in R
Basic Programming in R
Basics of Statistics, parts 1, 2 and 3
Intro to Data Science and Machine Learning
Interactive SQL Simulator
Intro to Linux