Kosh Project Plan

Kosh Project Update

November 20, 2021

What is Kosh?

The word Kosh means repository or fund in Hindi. Tattle Kosh is a repository of content that circulates on social media in India. The more accurate word for our conception of it is, an archive. Why is that useful, you may ask?

Think of the last time you received a message from 'that relative' on a WhatsApp group who has a message for every festival, every news event, every good morning. Now once in a month you may decide you have the time to respond to that relative's message about worms in face-masks. But how would you do that? Messages on WhatsApp don't have an ID and their history isn't easy to track. It would be good to know how old that message is, whether it was photoshopped or whether it has been fact checked. It also helps to know if that warning about scary strangers in your neighborhood was just last week also circulating in a neighboring country.

We can wax eloquent about the importance of archives in slowing and interrupting the fast flows of social media. But we'll leave that for another wiki. Suffice to say that archives allow for personal annotations and collective sense making that can help all of us grapple with the information disorder we see around us.

Archiving content from Indian social media and chat apps is a daunting task, but we have to get started somewhere. Tattle has been at it for about 18 months and this is what we have so far:

Scrapers for eight+ IFCN certified fact checking sites in India. These scrapers pull text, images and videos of stories written by these sites.
Scraper to retrieve data from WhatsApp exports. You can see a report based on WhatsApp data collection here.
A Telegram bot that people can send content to if they wish to push it to the archive.
A list of databases from Indian social media.
An archive UI with some access control.
A search API for cross-language and image search.

Where Do We Want to Get?

We want to index all the data that Tattle has collected or tracked so far, and continue to collect into a searchable archive.
We also want to build on the datasets that we already have.

Here's our full list of tasks for Kosh release and maintenance:

Write and maintain scrapers for various data sources.
Maintain and improve the Core Search Engine:
1. Core Search Engine is written in python. We need to test Search Features and Deploy Search Engine
2. Elastic Search
3. RabbitMQ
Periodically Index newly scraped dataIndex data from 3 datasources into the search engine
Development and maintenance of Kosh UI Add search related widgets to Kosh UI

For the archive to be immediately useable, we need to tackle 2.1, 3 and 4. Here is the scoped task description for the three in more detail.

2.1:
3.1 : We need to index the existing fact checking sites database, and the three open access datasets from Indian social media into the search API.
4 :

What do we need?

At the moment the entire integration is being handled by one full stack developer who also has other projects at hand. The final release has a UI component and an ML engineering component. Some of the integration challenges are unexpected (for us) so it has been a continual learning process demanding frequent context switching, which is time intensive. Given the right skillset and adequate time, the archive can be released in two weeks. We're looking for the following

A developer to help us take a working search engine software to production. This will involve deploying elasticsearch, rabbitmq and the search engine onto our kubernetes cluster. A developer with python, ML and devops experience is preferred.
A Full Stack Developer to build admin dashboard and features for the archive. We think that long term sustainability of the project will depend on us being able to train non developers to monitor and maintain the archive. Being able to debug and fix data collection or indexing issues via a UI as opposed to cli or scripting is key. Fluency with the JS stack(nodejs, react) is preferred.

Why are we doing this?

Personally, so that we can more easily respond to the WhatsApp Uncles and Aunties in our lives. But more broadly, too much of our understanding of how truths, falsehoods and everything in between circulates on social media is based on Twitter. Twitter has less than 50MAU in India. At least five other platforms (some made in India) have over 100 MAU. It also doesn't help to have data in csv files that only a handful of researchers see. Making it searchable and more easily accessible can enable more reporting and writing on social media patterns in India. We have succeeded to an extent in this goal with the fact checking sites data. See these news story in NewStatesman and BBC based on just the fact checking sites data. Data that is easily viewable and navigable helps.

How Will the Code be Licensed?

All of Tattle's code is licensed under GPL-3.

Contributing

This is the more elaborate contributing guide. Releasing Kosh, however is a priority. So if this project excites you and if you have the bandwidth to contribute, please also email [email protected].

About Tattle

Tattle is a civic tech project that builds tools and datasets for understanding and responding to misinformation in India. You can read more about the project on the website here: https://tattle.co.in/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly