Big Data Engineer

Interests: Data mining, machine learning, large data structures, data distribution

Objective: A challenging R&D job in big data.

Skills

Languages

Projects

Education

EPITA (School of Computer Science and Advanced Technologies).

2005-2008

French private university. Master of Science Degree in Computer Science and Engineering. Specialization in Scientific Computing and Image Processing.
EPITA, 94270 Le Kremlin-Bicêtre, France.

LRDE (EPITA Research and Development Laboratory).

2005-2008

Student Researcher: a dozen of students (in the first decile of their class) are involved in research projects in parallel with their education, supervised by researchers. This implies:

I worked on Decision Diagrams Distribution during 6 months and on the design of a Generic Decision Diagrams Library.

Undergraduate studies in computer science

2003-2005

EPITA.(French Mathématiques Supérieures, Mathématiques Spéciales)

High school diploma

2003

(French Baccalaureat S), major in Mathematics with honors.

Professional Experience

Data Publica

2012-…

Until now: Data crunching

Data Publica is a company focusing in developing datasets. It creates datasets by identifying sources, extracting data from these sources, transforming unstructured data in structured data, and finally delivering data to the final customer.

Plizy

2011-2012

1 year: Video recommendation from heterogeneous sources

Plizy is an application created to discover, enjoy and share videos. To have an efficient discovery system, a good understanding of users and videos is important. Videos metadata are extracted through dedicated scrapers. The challenging system is user understanding, done using data gathered from users watching videos on the platform; but also using data from other sources such as Facebook. Data acquired from Facebook enables one to get insights of users interests and relationships to other users, useful to help a user discover new videos based on what they like. Data from our platform is best to understand what kind of videos a user really watch, and thus to recommend new videos based on what was previously watched.

Twenga

2008-2011

3 years: Automatic data extraction from heterogeneous web pages

Twenga is a shopping comparison site. In order to compare similar offers, it is essential that one efficiently extracts product reference as well as the product price. One also needs to extract the picture and its category to present the results to the final user. Using structural and semantical analysis, we were able to present a range of presumptions to an operator, who in turn would choose the correct ones within a matter of seconds.

Google

2008

6-months internship: Finding Text Orientation, Script and Language with Tesseract

To be able to process documents correctly, an OCR must use language-specific files on a well-oriented image (0 / 90 / 180 / 270). The internship objective was the development of a tool to detect orientation, script and language on an image. This tool had the constraint to be extensible, i.e. one can add any script or language to the training data and the accuracy must remain near 100%. The work was accomplished using some of the high level components of Tesseract, an Open-Source OCR, developed by Google, and some clustering and energy-minimization techniques.

Bouygues Telecom (3rd mobile phone operator in France)

2006

6-months internship: Information System Visualization Software.

Bouygues Telecom was in need of a tool to represent its large Information System. This Information System is stored in a CMDB, being modeled by a M1-meta model.The internship objective was the development of a platform to represent the Information System with the constraints of the genericity towards models and meta-models, and extensibility with the use of the Eclipse Rich Client Platform and its plug-in system.

Bibliography

polyDD: Towards a Framework Generalizing Decision Diagrams

Decision Diagrams are now widely used in model checking as extremely compact representations of state spaces. Many Decision Diagram categories have been developed over the past twenty years based on the same principles. Each one targets a specific domain with its own characteristics. Moreover, each one provides its own definition. It prevents sharing concepts and techniques between these structures. This paper aims to propose a basis for a common Framework for Decision Diagrams. It should help users of this technology to define new Decision Diagram categories thanks to a simple specification mechanism called Controller. This enables the building of efficient Decision Diagrams dedicated to a given problem.

http://www.computer.org/portal/web/csdl/doi/10.1109/ACSD.2010.17

Decision Diagrams and Homomorphisms

Decision diagrams are structures used to represent large data sets. Common data of the set elements are shared. This enables a big memory compacity. Various types of Decision Diagrams exist, with each one its implementation In this report, the Decision Diagram Library is presented. This library generalizes the concept of Decision Diagram to implement every possible types of Decision Diagram. Because algorithms are hard to define on Decision Diagrams, this report also present the work to add high-level dynamic structures on top of Decision Diagrams, and algorithms frequently used on these structures.

http://lrde.epita.fr/~charron/index.php?id=seminar08

Data Decision Diagrams Distribution.

Decision diagrams are structures used in several domains where memory usage is critical. Data Decision Diagrams (DDD) are a kind of decision diagrams used in model-checking for example. However, they bring a solution to the memory problem that is not always sufficient. To overcome memory limit, a solution is to distribute memory. Some implementations exist for BDD (Binary Decision Diagrams), but are neither really efficient nor maintained. In this report, new distribution algorithms for decision diagrams are presented, based on DDD properties. An implementation in Erlang of a distributed DDD package is explained; then some results about distribution are given and discussed, based on this implementation.

http://lrde.epita.fr/~charron/index.php?id=seminar06

Miscellaneous

Languages

Hobbies