GSOC 2014 Tatoeba Proposal: A Complete Python Rewrite of Tatoeba and Revamping of Its Architecture

1. Abstract

1.1 Summary

The current codebase is hard to maintain and work with, not to mention it didn't see any major improvements in the past year or two. A rewrite of the codebase to use a higher level language and framework will make it more maintainable, cut down on development time, and attract more developers. Also, a move towards a graph database or the use of graph algorithms on top of a relational database will greatly reduce server load and enhance page response time. Finally, an API will greatly reduce the complexity of interacting with the website and will allow the growth of an ecosystem of external tools and programs around tatoeba. This project aims to achieve all of the above and demonstrate the power of this design through a full featured crossplatform javascript interface that should run on major desktop platforms, a number of mobile platforms, and in major browsers.

1.2 Goals

  • Replicate the current functionality of the website:
    • CRUD operations on sentences, comments, and wall posts
    • logging of all operations on sentences
    • user profiles, status, and permissions
    • browsing of sentences, tags, and users
    • messaging
    • searching for sentences
    • integration of transliteration
    • internationalization of the entire website
  • Expand the website's functionality:
    • a corrections framework
    • a corrections page with latest corrections that have been accepted, rejected, forced, or are overdue
    • uploading of audio
    • advanced queries
    • fetching translations up to the nth depth
    • a user proficiency web of trust
    • a user status web of trust
    • speed blocking/deletion of malicious users
    • a full fledged forum that can also be viewed as a wall
    • requesting words and translations
    • sentence subscription system
    • a user following system
    • a whole new customisable user page that acts as a notification system for latest actions on subscribed sentences and latest actions from followd users, etc...
    • rss and atom feeds for most pages that are essentially list views
  • Enhance the database schema by using graph algorithms and/or integrating a graph database
  • Fully cache database queries and templates using memcached, and integrate an outside caching system such as varnish for server generated pages
  • Build a set of python functions to manipulate the database and perform all functionalities that constitute an inner api, and use it in all view code and in future modules that extend the website
  • Build a RESTful api on top of the python api and django orm through tastypie
  • Rewrite the UI to be crossplatform, dynamic, client-side, API compliant code.
  • Write a battery of tests that cover all the codebase
  • Provide administrative scripts for tasks such as importing/exporting db fixtures, importing/exporting csvs, adding new languages, extracting/updating mo/po files, search indexing, deployment on a development or production machine using vagrant/ansible
  • Provide Help bot scripts that clean up corrections, among other things
  • Provide an interface to all the admin scripts and help bot scripts where they can be executed manually or a cron job can be added and tweaked using the interface for them

1.3 Advantages

  • A more maintainable and easily expandable codebase
  • Continuous integration in the coding process thanks to the test suite
  • Better engagement of the users thanks to the social networking features
  • Ease of administration and deployment
  • Faster response time
  • Less server load
  • Better queries and thus better linking and translation experience
  • Easier audio contribution process which will hopefully attract more users to add audio
  • Setting the basis for native mobile apps and other third party apps to easily integrate tatoeba thanks to the RESTful api
  • Setting the basis for more modules and scripts that extend the website thanks to the python api

2. Design

2.1 Philosophy

The codebase needs to be:

  • modular
  • pluggable through a python api
  • reuse available apps
  • pep8 compliant
  • self documenting
  • tested and documented

2.2 Specification

Pytoeba core module

models:

  • Sentence
    • sentence_id: The id of the sentence as in the current tatoeba data, kept for compatibility.
    • p2p_id: A sha1 hash of a string concatenating the user, timestamp, and text of the sentence.
    • lang: The iso code of the language of the sentence, which will be limited to a set of choices from a tuple of supported language codes. 'und' will be set for an unknown language.
    • text: The actual sentence stored as a utf-8 string. An upper limit is set at 500 characters.
    • hash: 40 char field storing a hex encoding of a sha1 hash of the text field for fast detection of duplicates.
    • simhash: integer field, stores a uint64 of a charikar hash of the text field. Used for fast retrieval of potential near-duplicates.
    • added_by: A foreign key to django.contrib.auth.models.User object representing the user that added the sentence.
    • added_on: Python datetime object automatically added by django upon creation
    • modified_on: Python datetime object automatically added by django upon updating the object
    • owner: A foreign key to django.contrib.auth.models.User object representing the user that currently owns the sentence.
    • tag: A many to many relationship to the Tag model through the TagMeta model
    • link: A many to many relationship to self through the Link model
    • editable: A boolean field that indicates if the sentence is locked from editing or not, for use by moderators only in cases of controversy or possibly in the future to lock well trusted sentences.
    • correctible: A boolean field that indicates if a sentence has a pending correction
    • length: Integer field. Calculated automatically when the object is saved.
  • Link:
    • side1: Foreign key to the Sentence model. Represents one side of a translation link.
    • side2: Foreign key to the Sentence model. Represents the other side of a translation link.
    • level: Integer field representing the depth of the translation link
  • Correction:
    • sentence: Foreign key to the Sentence model. Stores a Sentence object that a given Correction object is a proposed correction of.
    • text: The proposed correction stored as a UTF-8 string. Maximum length of 500 characters.
    • added_by: A foreign key to django.contrib.auth.models.User object representing the user that added the correction.
    • added_on: Python datetime object automatically added by django upon creation
    • modified_on: Python datetime object automatically added by django upon updating the object.
    • reason: Text field that holds a description of why the correction is necessary.
  • Audio:
    • sentence: Foreign key to the Sentence model.
    • file: Django file field. Holds a reference to the location of the file on disk for easy retrieval. Also django creates an upload widget for this by default.
    • added_by: A foreign key to django.contrib.auth.models.User object representing the user that added the audio file.
    • added_on: Python datetime object automatically added by django upon creation
    • bitrate: Integer field. Should be automatically detected and filled out using custom code.
    • accent: Text field filled by the user to indicate which region or accent he has. Could possibly be moved to the UserProfile model...
  • Log:

    • sentence: foreign key to the Sentence model.
    • info: text field with more info about the change, highly variable depending on the action.
    • type: 2 or 3 letter code for the type of operation performed limited to a set of choices mapping to the following list:
      • sentence_edited
      • link_added
      • link_removed
      • comment_added
      • comment_removed
      • comment_edited
      • tag_added
      • tag_removed
      • user_subscribed
      • user_unsubscribed
      • correction_added
      • correction_removed
      • correction_rejected
      • correction_accepted
      • correction_forced
      • correction_edited
      • sentence_locked
      • sentence_unlocked
      • sentence_owner_changed
      • sentence_lang_changed
      • audio_added
      • audio_modified
      • audio_removed
  • Tag:

    • text: Text field containing the tag. 100 characters maximum.
    • lang: ISO code of the language the tag is in as a 3 letter string limited to the list of supported languages.
    • added_by: A foreign key to django.contrib.auth.models.User object representing the user that added the tag.
    • added_on: Python datetime object automatically added by django upon creation.
    • category: A text field for grouping tags.
    • note: A text field that holds any additional info about the tag
  • TagTranslation:
    • tag: Foreign key to the Tag model.
    • lang: ISO code of the language the tag is in as a 3 letter string limited to the list of supported languages.
    • text: Text field containing the translated tag in the target language.
    • added_by: A foreign key to django.contrib.auth.models.User object representing the user that added the translation.
    • added_on: Python datetime object automatically added by django upon creation.
    • note_tl: Text field containing a translation of the note.
  • TagMeta:
    • sentence: Foreign key to the Sentence model.
    • tag: Foreign key to the tag model.
    • added_by: A foreign key to django.contrib.auth.models.User object representing the user that tagged this specifc sentence with the given tag.
    • added_on: Python datetime object automatically added by django upon creation.
    • reason: Text field for why this tag is necessary. (optional)

views:

  • Main page view: GET /home/

    • Displays:
      • Latest 10 sentences
      • Latest 10 comments
      • Shortened versions of the latest 5 wall posts
      • 1 random sentence with translations to the 2nd depth
    • Template also includes:
      • a navbar with links to other views organized under appropriate categories
      • login/register links to the user views
      • links to paginated 'latest' views
      • links to detail views of listed items
  • Latest Sentences view: GET /latest/sentences/?pg=(\d+)

    • Displays:
      • Latest sentences added paginated by 15
    • Template extras:
      • links to detail view for each item
  • My page view: GET /myhome/
    • Displays:
      • Latest 10 log entries for sentences in the user's subbed sentences list (nested query using the __in= filter, log level in this view is also prefiltered using some variable from the settings)
      • Latest 10 comments on subbed sentences.
      • Latest 10 corrections added on subbed sentences.
      • Latest 10 corrections overdue on subbed sentences.
      • Latest 10 sentences added from the user's list of followed users.
      • Latest 10 comments added by users from the user's list of followed users.
    • Template extras:
      • detail links for each item
      • links to paginated list of relevant 'latest' views
  • Sentence detail view: GET /sentence/(\d+)/
    • Displays:
      • Sentence object (text, id, adoption status(owner), lock status, subscribe status)
      • set of all translations linked to this sentence (defaults to 2nd depth)
      • set of all tags linked to this sentence
      • set of all corrections linked to this sentence
      • set of all log entries linked to this sentence (possibly paginated?)
      • set of all comments linked to this sentence
    • Template extras:
      • input box for depth
      • link to SentenceLink view for translations of lvl>1
      • link to SentenceUnlink view for translations of lvl=1
      • link to SentenceTranslate view
      • link to SentenceAdopt view
      • link to SentenceEdit view
      • link to SentenceDelete view
      • link to TagAdd view
      • links to TagRemove view (for each tag)
      • link to CorrectionAdd view
      • links to CorrectionDelete/Reject/Edit view for each corection
      • link to CommentAdd view (django.contrib.comments)
      • links to CommentEdit/delete view
      • link to next/previous sentence detail views by id
      • input box for going to a specific sentence detail view
  • Sentence link view: POST /sentence/link/ data: id1&id2
    • login_protected
    • calls utils.sentence_link(id1, id2)
  • Sentence unlink view: POST /sentence/unlink/ data: id1 & id2
  • Sentence translate view:
    • login protected
    • calls utils.sentence_unlink(id1, id2)
  • Sentence edit view: POST /sentence/edit/ data: id & text
    • login protected
    • calls utils.is_owner(request.user)
    • calls utils.sentence_edit(id, text)
  • Sentence add view: POST /sentence/add/ data: text
    • login protected
    • calls utils.sentence_add(text, request.user)
  • Sentence delete view: POST /sentence/delete/ data: id
    • login protected
    • calls utils.is_owner(request.user)
    • calls utils.is_moderator(request.user)
    • calls utils.sentence_delete(id)
  • Sentence adopt view: POST /sentence/adopt/ data: id
    • login protected
    • calls utils.has_owner(id)
    • calls utils.sentence_assign_owner(request.user)
  • Sentence browse view:
    • Displays:
      • Paginated sentence querysets with the possibility to add filters by:
        • lang
        • translation lang by depth
        • exclude translation lang by depth
        • sent length
        • users
        • date range
        • tags
  • Main tag view:
    • Displays:
      • Paginated tag querysets filtered by:
        • alphabetical order
        • date
        • user
        • sentence count
    • Template extras:
      • links to create tags etc
  • Tag create view: POST /tag/create/ data: text
    • login protected
    • calls utils.tag_is_created_by(request.user)/utils.is_moderator(request.user)
    • calls utils.tag_create(text)
  • Tag edit view: POST /tag/edit/ data: id & text
    • login protected
    • calls utils.is_moderator(request.user)
    • calls utils.tag_edit(id, text)
  • Tag delete view: POST /tag/delete data: id
    • login protected
    • calls utils.is_moderator(request.user)
    • calls utils.tag_delete(id)
  • Tag add view: POST /tag/add data: tag_id & sent_id
    • login protected
    • calls utils.tag_add(tag_id, sent_id)
  • Tag remove view: POST /tag/remove data: tag_id & sent_id
    • login protected
    • calls utils.tag_remove(tag_id, sent_id)
  • Main corrections view:
    • Displays:
      • Latest 10 corrections
      • Latest 10 overdue corrections
      • Latest 10 accepted/rejected/applied corrections
    • Template extras:
      • links to views with paginated querysets of corrections
  • Correction add view: POST /correction/add data: sent_id & text
    • login protected
    • calls utils.correction_add(sent_id, text)
  • Correction edit view: POST /correction/edit data: corr_id & text
    • login protected
    • calls utils.is_owner(request.user)/ utils.owns_correction(request.user)
    • calls utils.tag_edit(corr_id, text)
  • Correction delete view: POST /correction/delete data: corr_id
    • login protected
    • calls utils.owns_correction(request.user)
    • calls utils.correction_remove(corr_id)
  • Correction reject view: POST /correction/reject data: corr_id
    • login protected
    • calls utils.is_owner(request.user)
    • calls utils.correction_remove(corr_id)
  • Correction accept view: POST /correction/accept data: corr_id
    • login protected
    • calls utils.is_owner(request.user)
    • calls utils.correction_apply(corr_id, sent_id)
  • Correction force view: POST /correction/force data: corr_id
    • login protected
    • calls utils.is_moderator(request.user)
    • calls utils.correction_apply(corr_id, sent_id)

Tatousers module

models:

  • UserMeta:
    • user: Foreign key to django.contrib.auth.models.User .
    • following: ManytoMany field to django.contrib.auth.models.User
    • subbed_to: ManytoMany field to the Sentence model.
    • status: Text field that allows a 2 letter code from a list of choices for 'user', 'trusted_user', 'moderator', 'admin'.
    • langs: ManytoMany field to UserLang.
    • joined_on: Python datetime object that's automatically added
    • country: Text field from a list of country choices.
    • birthday: Python datetime field, entered by the user.
  • UserLang:
    • lang: Text field from a list of supported ISO codes.
  • LangMeta:
    • user_meta: Foreign key to User
    • user_lang: Foreign key to UserLang
    • proficiency: Text field from a list of codes for choices 'beginner', 'intermediate', 'fluent', 'native'
    • votes: Integer field for holding votes on user proficiency from other users. views:
  • User browse view: GET
    • Displays:
      • Filtrable paginated set of users by:
        • join date
        • status
        • latest activity
        • langs
  • User settings view: (TBD)
  • User meta info view: (TBD)
  • The rest of the user related views and messaging will be integrated from userena and django-social-auth

Sphinx module [pytoeba.sphinx]

Implements the search function that uses sphinx's api python bindings.

Python API module [pytoeba.utils]

Wraps common database operations used in views in a higher level abstraction that is implemented as standalone functions and mirrored on the database manager and queryset manager(if it makes sense) as well.

RESTful API module [pytoeba.api]

Wraps all models and utils functions in tastypie resource classes.

Angular UI module

  • Home: This view mimics tatoeba's home page. It shows the latest lists of objects in dedicated boxes that scroll infintely as you can see in image 1 . There's a nav bar with a notifications box that contains what the myhome view would usually contain and when clicked expands to the full dimensions of the page to display an infinitely scrolling list. Also clicking on any of the container boxes expands it to display a bigger infinitely scrolling list, see image 2.
  • Corrections: This mimics the Home view but uses querysets of corrections and would have a similar layout with expandable boxes for lists of latest querysets that's infinitely scrollable.
  • Sentences: This would use a modified version of ng-grid preferably that's fully sortable by header. There's a filtering bar with dropdown boxes and date pickers (as provided by angular-ui through select2.js) above that regenrates the querysets based on certain parameters as described in the Sentence browse view. see image 3
  • Tags: This would mimic the Sentences view above but with tag querysets.
  • Users: This would mimic the Sentences view but with UserMeta querysets.

Scripts module

  • import.py: uses a recursive graph algorithm to build a translation graph from a given csv and save it to the database.
  • export.py: queries the database for sentences and outputs a flat csv for sentences, tags, etc...
  • db_dump.py: can dump json database fixtures/ and specific sql dumps.
  • db_load.py: knows how to load a json fixture/ specific sql dump
  • add_lang.py: adds an entry in the list of iso langs in pytoeba.models.LANGS
  • extract_po.py: runs get_text extractor
  • update_translations.sh: sends po files to launchpad and commits them to main repo
  • auto_correct.py: forces all overdue corrections. Run by a cron job.
  • auto_ban.py: detects malicious users with a set of criteria and bans them.
  • notify.py: periodically sends all e-mail notifications, etc...
  • setup.py: installs pytoeba on a linux system
  • profile.py: benchmarks for pytoeba.
  • deploy: vagrant script that sets up nginx, uwsgi, and pytoeba
  • generate_sphinx_config.py: generates sphinx config.
  • index.sh: tells sphinx to index the db

Tatorequest module

This will integrate this app into the pytoeba architecture.

Tatowall module

Builds on djangobb to provide a view that would closely mimic the current tatoeba wall as well as the full fledged forum views that it has by default.

Orientdb django backend module (for sometime outside gsoc)

3. Schedule

A full 12 weeks will be dedicated to the project. All work will be done on a public github repo and a weekly report on the tasks will be posted on the wall. Deployment will happen in the beginning of the third week and a full day will be dedicated to fixing bugs each week thereafter.

3.1 Week 1

Day 1-2

  • implement pytoeba models

Day 3

  • test and play with the models and refine them

Day 4-5

  • implement all views relating to sentences with functionality code being implemented in the python api

Day 6

  • test and play with the views and refine them

Day 7

  • take the day off

3.2 Week 2

Day 1-2

  • implement views relating to tags and corrections

Day 3

  • test and refine

Day 4-5

  • integrate userena
  • integrate django-social-auth
  • implement settings panel and user related views

Day 6

  • test and refine

Day 7

  • take the day off

3.3 Week 3

Day 1

  • integrate djangobb and implement tatowall views and test
  • deploy pytoeba and ask for users to use it and report bugs

Day 2-5

  • integrate sphinx
  • write generate_sphix_config.py
  • write index.py

Day 6

  • fix first batch of bugs

Day 7

  • take the day off

3.4 Week 4

Day 1-2

  • implement sentence related methods in the api

Day 2-4

  • implement tag and correction related methods

Day 5-6

  • implement user and wall related methods
  • implement oauth

Day 7

  • fix bugs

3.5 Week 5

Day 1-2

  • implement the rest of the scripts module

Day 3-4

  • implement vagrant and ansible deployment

Day 5-6

  • implement admin interface

Day 7

  • fix bugs

3.6 Week 6

  • Initial setup of angular-ui, ng-grid, bower, grunt, karma, jasmine for the project structure.

3.7 Week 7-9

  • Implement the Home view
  • Tweak the implementation for Corrections

3.8 Week 10-11

  • Implement the Sentence grid view
  • Tweak implementation for Tags and Users

3.9 Week 12

  • implement django tests
  • implement angular tests

4. About me

4.1 Involvement with Programming

I studied programming for 3 yrs in high school, mainly C, C++, and Java. I passed the AP computer science exam with a score of 5 back in the day. I then picked python quite recently (this year) and started making progress again.

4.2 Projects

My projects include an irc bot for learning japanese and interacting with tatoeba's data offline, a japanese romanization server also used by tatoeba, and a minimal django CMS. You can access them at my github account.

4.3 Involvement with Tatoeba

I've been involved with the tatoeba project for a bit over three years now as I love languages. I run jfaptor on the irc channel that interacts with tatoeba sentences. I helped a bit with technical problems on the server, helped migrate the codebase and tickets to github, and helped make the development VM. I also contributed about 7k sentences to tatoeba in 3 languages.

4.4 Linux skills

I've been using linux for the past 2 yrs as well and have ran my own vps with BIND, nginx, ssh, and other services. I also deployed my own django blog using nginx, pypy, and uwsgi.