Skip to content
This repository has been archived by the owner on Jun 20, 2019. It is now read-only.

Pipeline Architecture #10

Closed
Henni opened this issue Mar 7, 2017 · 11 comments
Closed

Pipeline Architecture #10

Henni opened this issue Mar 7, 2017 · 11 comments
Assignees

Comments

@Henni
Copy link
Contributor

Henni commented Mar 7, 2017

Idea:
Build our application resembling a pipeline.
This would look as follows:

   get sentences from database
-> run relationship extraction
-> [classify relationships]
-> calculate page rank and reputability
-> store result in database

Notes:

  • sentences have to be stored in the database. Otherwise we have to add an additional step to extract them from the given URLs.
  • relationship classification could already be done by the relationship extraction algorithm
  • Page Rank and Reputability idea: factor in how much we trust that page. For example Facebook will probably return worse results than Wikipedia.
@Henni
Copy link
Contributor Author

Henni commented Mar 7, 2017

@MusicConnectionMachine/group-3 if you agree with this approach, i would persist it in the wiki and create separate issues for each step.

@krishenk
Copy link
Contributor

krishenk commented Mar 7, 2017

@Henni looks good to me. One question, Are we going to extract the sentences or it would be done by @MusicConnectionMachine/group-1 ? Also, regarding page rank, @vviro mentioned something regarding that, in issue #5. Please have a look.

@Henni
Copy link
Contributor Author

Henni commented Mar 7, 2017

@krishenk regarding group1 see MusicConnectionMachine/UnstructuredData#40

Regarding page rank: I completely agree with @vviro's comment #5 (comment)
This is also why I added the term reputability (also see MusicConnectionMachine/RelationshipsG4#9 (comment))
We should clear up the terms page rank and reputability at the meeting tomorrow.

@vviro
Copy link

vviro commented Mar 7, 2017

@Henni is it already clear what the page rank and reputability will be based on? Is the idea here to extract the URLs from the HTML and use them as links? Is the code for doing this (going from a set of html documents to their page rank) already available or easily implementable and is it clear how to run it on this dataset? (Maybe this is a wrong issue to ask this question and there is a better place...) I just wonder whether the relationship extraction step will require more attention than would be possible if also the reputability is to be addressed. A word of caution here...

@Henni
Copy link
Contributor Author

Henni commented Mar 7, 2017

@vviro Let me come back to this tomorrow. Our team will meet tomorrow morning and this is a topic I will bring up.

@kordianbruck
Copy link
Contributor

About that page rank: I'm just gonna leave these links here for you to further scout out

Mining the pagerank in a larger scale is against the ToS of Google

@RBirkeland
Copy link
Contributor

It seems google does not provide their pagerank API anymore, depending on the amount of pages, we might have to implement it our self.

@Henni
Copy link
Contributor Author

Henni commented Mar 8, 2017

In my opinion page rank (in whatever way) should be a topic we will handle in the future.
Our next step should be to get the relation extraction going. This should already give some kind of quality indication which might already suffice.

@kordianbruck
Copy link
Contributor

kordianbruck commented Mar 9, 2017

SEOstats (that ugly php script - @sacdallago right?) offers other apis in addition to the pagerank api. Thats why its in there ;)

@kordianbruck
Copy link
Contributor

@Henni progress? done? needs work?

@Henni
Copy link
Contributor Author

Henni commented Mar 28, 2017

Let's count this one as done.
The architecture itself is an ongoing process, but the decisions described in here seem to be fine with everyone.

@Henni Henni closed this as completed Mar 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants