Data Scientist

Interests: Data mining, machine learning, NLP

Technical Skills

Data processing

Python data stack (numpy, scipy, matplotlib, pandas, ...)
NLTK, scikit-learn
Tensorflow

Languages

Python / C++ (proficient)
Java, Shell Script, Erlang, Ruby, … (strong knowledge)

Databases

MySQL, MongoDB, PostgreSQL, ...
Cassandra / Bigtable
RabbitMQ, Redis
ElasticSearch

Professional Experience

Yubo

2017-2021

Yubo is a social app to meet new friends around the world and talk to them via a live video stream.

Projects:

Define, develop, and deploy a system to automatically flag profile pictures and video stream not respecting the platform rules (Tensorflow).
Improve live streaming quality by exploring various approaches (WebRTC, network stack)
Prototype, develop and deploy a new version of the swipes (10k req/s on an Elasticsearch cluster)
Crunch all data (billions of rows) to improve usage of the application
Model revenue given the usage to help fundraising
Plan, execute and write a report about CIR projects

Management:

Define two new positions, prepare technical tests, and participate in two recruitements not related to my position
Create, recruit, and manage a team of 4 data scientists

Zento

2016-2017

Zento was a company dedicated to give brand insights at Point of Sweat. Running brands sell shoes without knowning too much about their customers, and even less about people using them during sport events. Our goal was to take pictures at the finish line, analyse them, and distribute reports such as market share analysis, performance reports, and trends.

Project:

Scan pictures from mass sport events to detect & count shoes (Tensorflow)
Report as a market share analysis
Link shoe brand & model with bib number
Segment market share by times, age, gender, ...
Report with a webapp (Django, Angular, ES)

C-Radar

2012-2016

4 years: Company data crunching

C-Radar is a company focusing in identifying new opportunities thanks to B2B predictive marketing. It automatically links traditionnal administrative companies data to their websites. From these websites, customers can add semantic value to their existing lists of prospects; and many structured fields are extracted in order to sort those lists by any given axis. Finally, using machine learning, one can predict new prospects given the current customers of a B2B company.

Project:

Get, normalize, analyze, and output data for client companies.
Python, MongoDB, RabbitMQ, Cassandra, PostgreSQL, Elastic Search, …
{Web, Data, Text}-Mining, Web-Scraping, …
Machine learning (sklearn): classification, regression, clustering.
Distributed System, Scalability, Batch Processing, …
Millions of pages crawled every day
Billions of daily company data processed
Also sysop, devop (salt), pre-sale phase (proposal, specification), ...

Management:

Supervise two interns

Plizy

2011-2012

1 year: Video recommendation from heterogeneous sources

Plizy is an application created to discover, enjoy and share videos. To have an efficient discovery system, a good understanding of users and videos is important. Videos metadata are extracted through dedicated scrapers. The challenging system is user understanding, done using data gathered from users watching videos on the platform; but also using data from other sources such as Facebook. Data acquired from Facebook enables one to get insights of users interests and relationships to other users, useful to help a user discover new videos based on what they like. Data from our platform is best to understand what kind of videos a user really watch, and thus to recommend new videos based on what was previously watched.

Project:

Python, MySQL, Cassandra, MongoDB, Redis, Hadoop, Elastic Search…
Clustering, Data Mining, Machine Learning, …
Distributed System, Scalability, High Performance, Large Data Sets, Real Time Processing, …
Create various systems to recommend videos (similar videos, user-based and item-based recommendation, …).
10 millions Facebook Profiles and 400 millions Facebook Likes scraped, using the Real Time API from Facebook

Management:

Recruit and manage another data scientist

Twenga

2008-2011

3 years: Automatic data extraction from heterogeneous web pages

Twenga is a shopping comparison site. In order to compare similar offers, it is essential that one efficiently extracts product reference as well as the product price. One also needs to extract the picture and its category to present the results to the final user. Using structural and semantical analysis, we were able to present a range of presumptions to an operator, who in turn would choose the correct ones within a matter of seconds.

Project:

C / C++, Shell Script, Python, MySQL, …
Clustering, Data Mining, NLP, …
Distributed System, Scalability, High Performance, Large Data Sets, …
Develop a tool to extract designation, price, description, category and picture of products from retail web sites.
200 000 web sites crawled, 300 millions products extracted.

Google

2008

6-months internship: Finding Text Orientation, Script and Language with Tesseract

To be able to process documents correctly, an OCR must use language-specific files on a well-oriented image (0 / 90 / 180 / 270). The internship objective was the development of a tool to detect orientation, script and language on an image. This tool had the constraint to be extensible, i.e. one can add any script or language to the training data and the accuracy must remain near 100%. The work was accomplished using some of the high level components of Tesseract, an Open-Source OCR, developed by Google, and some clustering and energy-minimization techniques.

Project:

C / C++, 6 months, Google
Build some tools to find orientation, script, and language used in an image. These tools use Tesseract, an open-source OCR engine, to extract connected components and classify them.

Other Projects (during scholarship)

Homomorphisms Library
- C++, Generic Programming, Meta-programming, 2k lines, 6 months, LRDE.
- Designed from scratch this library makes definitions of operations on Decision Diagrams easier by defining high-level structures and algorithms over Decision Diagrams.
Generic Decision Diagram Library
- C++, Generic Programming, Meta-programming, 3k lines, 1 year, LRDE.
- Designed from scratch this library makes definitions of new Decision Diagrams types easier; it also allows for mixing different Decision Diagrams types.
Distributed Data Decision Diagrams Framework
- Erlang, 2k lines, 6 months, LRDE.
- From scratch, enables learning of Decision Diagrams and the distribution problem. Nodes of a graph are represented by Erlang-light-processes to increase parallelism.

Education

EPITA (School of Computer Science and Advanced Technologies).

2005-2008

French private university. Master of Science Degree in Computer Science and Engineering. Specialization in Scientific Computing and Image Processing.
EPITA, 94270 Le Kremlin-Bicêtre, France.

LRDE (EPITA Research and Development Laboratory).

2005-2008

Student Researcher: a dozen of students (in the first decile of their class) are involved in research projects in parallel with their education, supervised by researchers. This implies:

scientific presentations;
bibliography work;
technical reports writing;
collaboration with "professional" researchers.

I worked on Decision Diagrams Distribution during 6 months and on the design of a Generic Decision Diagrams Library.

Undergraduate studies in computer science

2003-2005

EPITA.(French Mathématiques Supérieures, Mathématiques Spéciales)

High school diploma

2003

(French Baccalaureat S), major in Mathematics with honors.

Bibliography

polyDD: Towards a Framework Generalizing Decision Diagrams

Decision Diagrams are now widely used in model checking as extremely compact representations of state spaces. Many Decision Diagram categories have been developed over the past twenty years based on the same principles. Each one targets a specific domain with its own characteristics. Moreover, each one provides its own definition. It prevents sharing concepts and techniques between these structures. This paper aims to propose a basis for a common Framework for Decision Diagrams. It should help users of this technology to define new Decision Diagram categories thanks to a simple specification mechanism called Controller. This enables the building of efficient Decision Diagrams dedicated to a given problem.

http://www.computer.org/portal/web/csdl/doi/10.1109/ACSD.2010.17

Decision Diagrams and Homomorphisms

Decision diagrams are structures used to represent large data sets. Common data of the set elements are shared. This enables a big memory compacity. Various types of Decision Diagrams exist, with each one its implementation In this report, the Decision Diagram Library is presented. This library generalizes the concept of Decision Diagram to implement every possible types of Decision Diagram. Because algorithms are hard to define on Decision Diagrams, this report also present the work to add high-level dynamic structures on top of Decision Diagrams, and algorithms frequently used on these structures.

Data Decision Diagrams Distribution.

Decision diagrams are structures used in several domains where memory usage is critical. Data Decision Diagrams (DDD) are a kind of decision diagrams used in model-checking for example. However, they bring a solution to the memory problem that is not always sufficient. To overcome memory limit, a solution is to distribute memory. Some implementations exist for BDD (Binary Decision Diagrams), but are neither really efficient nor maintained. In this report, new distribution algorithms for decision diagrams are presented, based on DDD properties. An implementation in Erlang of a distributed DDD package is explained; then some results about distribution are given and discussed, based on this implementation.

Samuel Charron

Data Scientist

Interests: Data mining, machine learning, NLP

Technical Skills

Data processing

Languages

Databases

Professional Experience

Yubo

Zento

C-Radar

4 years: Company data crunching

Plizy

1 year: Video recommendation from heterogeneous sources

Twenga

3 years: Automatic data extraction from heterogeneous web pages

Google

6-months internship: Finding Text Orientation, Script and Language with Tesseract

Other Projects (during scholarship)

Education

EPITA (School of Computer Science and Advanced Technologies).

LRDE (EPITA Research and Development Laboratory).

Undergraduate studies in computer science

High school diploma

Bibliography

polyDD: Towards a Framework Generalizing Decision Diagrams

Decision Diagrams and Homomorphisms

Data Decision Diagrams Distribution.

Miscellaneous

Languages

Hobbies