-
API design for machine learning software: experiences from the scikit-learn project
Authors:
Lars Buitinck,
Gilles Louppe,
Mathieu Blondel,
Fabian Pedregosa,
Andreas Mueller,
Olivier Grisel,
Vlad Niculae,
Peter Prettenhofer,
Alexandre Gramfort,
Jaques Grobler,
Robert Layton,
Jake Vanderplas,
Arnaud Joly,
Brian Holt,
Gaël Varoquaux
Abstract:
Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and p…
▽ More
Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library.
△ Less
Submitted 1 September, 2013;
originally announced September 2013.
-
Scikit-learn: Machine Learning in Python
Authors:
Fabian Pedregosa,
Gaël Varoquaux,
Alexandre Gramfort,
Vincent Michel,
Bertrand Thirion,
Olivier Grisel,
Mathieu Blondel,
Andreas Müller,
Joel Nothman,
Gilles Louppe,
Peter Prettenhofer,
Ron Weiss,
Vincent Dubourg,
Jake Vanderplas,
Alexandre Passos,
David Cournapeau,
Matthieu Brucher,
Matthieu Perrot,
Édouard Duchesnay
Abstract:
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distribute…
▽ More
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.org.
△ Less
Submitted 5 June, 2018; v1 submitted 2 January, 2012;
originally announced January 2012.
-
Cross-Lingual Adaptation using Structural Correspondence Learning
Authors:
Peter Prettenhofer,
Benno Stein
Abstract:
Cross-lingual adaptation, a special case of domain adaptation, refers to the transfer of classification knowledge between two languages. In this article we describe an extension of Structural Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation. The proposed method uses unlabeled documents from both languages, along with a word translatio…
▽ More
Cross-lingual adaptation, a special case of domain adaptation, refers to the transfer of classification knowledge between two languages. In this article we describe an extension of Structural Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross-lingual adaptation. The proposed method uses unlabeled documents from both languages, along with a word translation oracle, to induce cross-lingual feature correspondences. From these correspondences a cross-lingual representation is created that enables the transfer of classification knowledge from the source to the target language. The main advantages of this approach over other approaches are its resource efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment classification involving English as source language and German, French, and Japanese as target languages. The results show a significant improvement of the proposed method over a machine translation baseline, reducing the relative error due to cross-lingual adaptation by an average of 30% (topic classification) and 59% (sentiment classification). We further report on empirical analyses that reveal insights into the use of unlabeled data, the sensitivity with respect to important hyperparameters, and the nature of the induced cross-lingual correspondences.
△ Less
Submitted 25 August, 2010; v1 submitted 4 August, 2010;
originally announced August 2010.