Home

Linalgo is a Python module to help Machine Learning team create and curate datasets for Natural Language Processing. It tries to follow the W3C Web Annotation Data Model and to provides a powerful system to add metadata to most commonly used text and image formats: TXT, PDF, HTML, etc.

Installation

You can install linalgo using pip:

pip install linalgo

For other options, see Installation.

Getting started

Examples

Examples will be available in the examples/ folder.

Tutorials

Getting task data

from linalgo.client import LinalgoClient

client_id = '<YOUR_CLIENT_ID_HERE>'
client_secret = '<YOUR_SECRET_HERE>'
linalgo_client = LinalgoClient(client_id, client_secret)

Training a binary classifier

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipelin

task_id = 1
tasks = linalgo_client.get_task(task_id)

label = 4
docs, labels = task.transform(target='binary',  label=label)

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=42)

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression()),
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

Indices and tables