Skip to content

Kosh Project Plan

tarunima edited this page Nov 21, 2021 · 15 revisions

Kosh Project Update November 20, 2021

What is Kosh?

The word Kosh means repository or fund in Hindi. Tattle Kosh is a repository of content that circulates on social media in India. The more accurate word for our conception of it, is an archive. Why is that useful, you may ask?

Think of the last time you received a WhatsApp message from that relative on a WhatsApp group who has a message for every festival, every news event, every good morning. Now once in a month you may decide you have the time to respond to that relative's message. But how would you do that? Messages on WhatsApp don't have an ID and their history isn't easy to track. It would be good to know how old that message is, whether it was photoshopped or whether it has been fact checked. It helps to know if that warning about scary strangers in your neighborhood was just last week also circulating in a neighboring country.

We can wax eloquent about the importance of archives in slowing and interrupting the fast flows of social media. But we'll leave that for another wiki. Suffice to say that archives allow for personal annotations and collective sense making that can help all of us grapple with the information madness we see around us.

Archiving content from Indian social media and chat apps is a daunting task, but we have to get started somewhere. Tattle has been at it for about 18 months and this is what we have so far:

  • Scrapers for eight+ IFCN certified fact checking sites in India. These scrapers pull text, images and videos of stories written by these sites.
  • Scraper to retrieve data from WhatsApp exports
  • A Telegram bot that people can send content to if they wish to push it to the archive.
  • A list of databases from Indian social media.
  • A UI with some access control.
  • A search API for cross-language and image search.

Where Do We Want to Get?

  • We want to index all the data that Tattle has collected so far, and continue to collect into a searchable archive.
  • We also want to build on the datasets that we already have.

Here's our full list of tasks for Kosh release and maintenance:

  1. Write and maintain scrapers for various data sources.
  2. Maintain and improve the Core Search Engine:
    1. Core Search Engine is written in python. We need to test Search Features and Deploy Search Engine
    2. Elastic Search
    3. RabbitMQ
  3. Periodically Index newly scraped dataIndex data from 3 datasources into the search engine
  4. Development and maintenance of Kosh UI Add search related widgets to Kosh UI

For the archive to be immediately useable, we need to tackle 2.1, 3 and 4. Here is the scoped task description for the three in more detail.

2.1: 3.1 : We need to index the existing fact checking sites database, and three open access datasets from Indian social media into the search API. 4 :

What do we need?

At the moment the entire integration is being handled by one full stack developer who also has other projects at hand. The final release has a UI component and an ML engineering component. Some of the integration challenges are unexpected (for us) so it has been a continual learning process demanding frequent context switching which is time intensive. Given the right skillset and adequate time, the archive can be released in two weeks. We're looking a data engineer to help out with:

//list tasks

Why are we doing this?

Personally, so that we can more easily respond to the WhatsApp Uncles and Aunties in our lives. But more broadly, too much of our understanding of how truths, falsehoods and everything in between circulates on social media is based on Twitter. Twitter has less than 50MAU in India. At least five other platforms (some Made in India) have over 100 MAU. It doesn't help to have data in csv files that only a handful of researchers see. Making it searchable and more easily accessible can enable more reporting and writing on social media patterns in India. We have succeeded to an extent in this goal with the fact checking sites data. See these news story in NewStatesman and BBC But we want to scale that impact.

How Will the Code be Licensed?

All of Tattle's code is licensed under GPL-3.

Contributing

This is the more elaborate contributing guide. Releasing Kosh, however is a priority. So if this project excites you and if you have the bandwidth to contribute, please also email [email protected]

About Tattle:

Tattle is a civic tech project that builds tools and datasets for understanding and responding to misinformation in India. You can read more about the website here: https://tattle.co.in/

Clone this wiki locally