Skip to content

Kosh Sprint Dec 21 to Jan 22

tarunima edited this page Dec 14, 2021 · 17 revisions

Goal

We wish to host the following datasets on Kosh and open it up to public :

  1. Kazemi et al
  2. FearSpeech Dataset
  3. Factcheck Articles Media Dataset
  4. Reis et al.

Given the large size of some of the datasets, we would also like this data to be searchable via text, image, video and textual queries.

Current Status

Currently, Kosh has a email based signup and authentication using JWT. Using this anyone can signup and login into Kosh and see the hosted datasets. The datasets were added via API calls to kosh but this API currently has no access control, making it unsafe for public access. The process of adding datasets also relies heavily on the involvement of a tattle admin which is a bottleneck to adding more datasets by tattle team members or trusted partners. On the Search side, We have made progress on optimising the memory requirements of our search engine Feluda. It supports indexing text, images and videos (of size < 20mb). This server has tested API endpoints to index and search text, images and video. Work needs to be done to integrate it with a Queue (RabbitMQ) and profile its memory and concurrency.

We have scoped remaining tasks into the following Features that could be worked on independently to make incremental progress towards achieving the goal

  • Secure Public Access
  • Upload Media
  • Index Media
  • Explore Datasets

Any discussions on tweaking and adding to the scope can be found here

Features

Secure Public Access

Domains : backend engineering, API design, frontend engineering, security, database management(sql)

Representative User Stories

  • As an admin I want to create users with the role author and viewer
  • As an admin I want to be able to delete or block users
  • As tattle I want to be sure that unauthorized access to the data is not possible
  • As tattle I want to be sure that a user can't add, edit or delete Media into a dataset that is not associated with them

Upload Media

Domains : backend engineering, API design, frontend engineering, security, database management(sql)

Representative User Stories

  • As an author, i want to manage my bot's access to kosh
  • As an author, I want to write a script that I can run periodically to upload the data I have scraped to Kosh

Relevant Links

Index Media

Domains : scripting(python, javascript), backend engineering(JS), frontend engineering(ReactJS), database management(mongo, sql), devops(Kubernetes, Github Actions), API integration

Representative User Stories

  • As an admin, I want to check if all the data added to kosh has been indexed into Feluda
  • As an admin I want to retry failed index jobs
  • As an admin I want to prevent certain posts from getting Indexed in future

Relevant Links

Explore Datasets

Domains : ml engineering, devops(kubernetes), frontend engineering, api integration

Representative User Stories

  • As an author or viewer, I want to see the datasets hosted on Kosh
  • As an author or viewer, I want to be able to search for a text Media by the text snippets I remember it containing
  • As an author or viewer, I want to be upload an image on kosh and see if its present in Kosh
  • As an author or viewer, I want to upload a video and see duplicate or similar videos on Kosh
  • As an author or viewer, I want to write read what metadata a dataset has and query for it. eg author.name="akaash"

Relevant Links

  • Evaluating Best practices for deploying kosh on kubernetes (discussion)

Onboarding

Semantics

The various entities that you will deal with while working on Kosh have been named and defined here. Familiarising yourself with them will ensure that we all can discuss Kosh requirements and be on the same page.

Prerequisites

There are certain opinionated libraries and frameworks that we use heavily across our software stack that I think you’d benefit from reading up on. Some familiarity with these will help you ramp up on the code and also write your own code.

Web App Rest API Search Server DevOps
Gatsby Express Flask Kubernetes
Grommet Sequelize Transformers Github Actions
ElasticSearch
RabbitMQ

Please reach out to [email protected] or post to #tattle_tech on our Slack