Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand component #1

Open
joel-bernstein opened this issue Jan 10, 2014 · 15 comments
Open

Expand component #1

joel-bernstein opened this issue Jan 10, 2014 · 15 comments
Assignees

Comments

@joel-bernstein
Copy link
Contributor

This issue introduces a new search component called the Expand component. The Expand component implements group expansion for a single page of results collapsed by the CollapsingQParserPlugin

I'll be working this ticket initially in my fork of the Heliosearch project in a branch called "expand".

https://github.com/joelbernstein2013/heliosearch

@ghost ghost assigned joel-bernstein Jan 10, 2014
@krantiparisa
Copy link
Member

  • for this component.

how costly it is in terms of Memory along with Grouping? Is it scalable with 100 groups and each group has 3 sub groups and each sub groups has 100 docs?

And run this on top of an index with 10M docs?

@joel-bernstein
Copy link
Contributor Author

The expand component works with a single page of collapsed results. So if your page has 100 groups, with 3 sub groups, with 100 docs each, the component will have to work with 30,000 documents.

Not an overwhelming number but not a small number.

The 10 million document set will be collapsed by the CollapsingQParserPlugin. How many distinct top level groups are in the index? It sounds like there might be around 33,333 distinct top level groups if each top level group has 300 docs in it. The CollapsingQParserPlugin will eat that for lunch, very little memory used.

@joel-bernstein
Copy link
Contributor Author

Kranti,

I'll be putting the initial implementation up later today or over the weekend. It doesn't cover sub-grouping yet. So if you want to work on that, that would be excellent. We can collaborate on how to add this to the code.

Joel

@krantiparisa
Copy link
Member

How many distinct top level groups are in the index?

  • there could be 300,000 unique top level groups (entity ids) overall in the index (size: 5GB)
  • but considering the filters, queries - the available unique top level groups for a given request could be max. 20,000
  • out of the 20,000 top level groups, the page max could be 100.
  • each group could have 3 sub groups and each sub group might have 100 max docs

can you help me to roughly estimate the memory size and response time
does this have any possible cache hits to get faster responses?

@krantiparisa
Copy link
Member

Sure, I can work with you on this. you might need to answer my stupid questions at times :)

@joel-bernstein
Copy link
Contributor Author

The CollapsingQParserPlugin creates arrays based on the total number of unique values in the field. Rough esitimates for 300,000 unique terms in the field would be 3-5 MB of transient memory per query.

The expanding of groups I haven't measured yet. With such a large page, part of the issue will be retrieving the stored values for all those documents. This can be very expensive.

@krantiparisa
Copy link
Member

if we just need docIds at the docList level, means

group1=>1234567 (the value of the group field)
subgroup1=>catalog1 (the value of the sub group field)
docList=> list of doc ids
subgroup2=>catalog2 (the value of the sub group field)
docList=> list of doc ids
group2=>6764237 (the value of the group field)
subgroup1=>catalog1 (the value of the sub group field)
docList=> list of doc ids
subgroup2=>catalog2 (the value of the sub group field)
docList=> list of doc ids

if we get TopGroups like the above, then metadata can be based on what fields the user wants. I am trying find out the memory and response times for the above structure from the API call.

@krantiparisa
Copy link
Member

Joel,

Is it possible to share the ExpandComponent on Saturday (11 Jan), I can spend good time on Sunday and try to get the Sub Groups. I want to also run few performance tests using traditional grouping and the new implementation for collapsing+expanding in the use cases I was describing above.

@joel-bernstein
Copy link
Contributor Author

Just committed initial implementation of the ExpandComponent at my heliosearch clone in the expand branch:

https://github.com/joelbernstein2013/heliosearch/tree/expand

Initial patch compiles but has not been tested yet.

@VadimKirilchuk
Copy link

I think it's worth to point to commit itself
joel-bernstein@c6db5bc

2014/1/11 joelbernstein2013 [email protected]

Just committed initial implementation of the ExpandComponent at my
heliosearch clone in the expand branch:

https://github.com/joelbernstein2013/heliosearch/tree/expand


Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-32103221
.

@krantiparisa
Copy link
Member

Joel,

I deployed your branch code and started Solr with a pre-populated index having 5M+ documents.

Sample Query:

http://localhost:8983/solr/collection1/select?q=relatedAllIds:8118784557012618112 AND showingType:linear&wt=xml&fq={!collapse field=programId min=windowStart}&fl=programId,windowStart&expand=true&expand.field=showingId&expand.limit=5&expand.rows=1&start=0&rows=2&sort=windowStart asc

Idea is to get the distinct program ids (collapsing/grouping) and sort them based on the windowStart field. Here is the response

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">28</int>
<lst name="params">
<str name="expand.rows">1</str>
<str name="sort">windowStart asc</str>
<str name="fl">programId,windowStart</str>
<str name="expand.limit">5</str>
<str name="start">0</str>
<str name="q">
relatedAllIds:8118784557012618112 AND showingType:linear
</str>
<str name="expand">true</str>
<str name="wt">xml</str>
<str name="fq">{!collapse field=programId min=windowStart}</str>
<str name="rows">2</str>
<str name="expand.field">showingId</str>
</lst>
</lst>
<result name="response" numFound="77" start="0">
<doc>
<long name="programId">8050846173392254112</long>
<long name="windowStart">1389375000000</long>
</doc>
<doc>
<long name="programId">8837586713084788112</long>
<long name="windowStart">1389382200000</long>
</doc>
</result>
<lst name="expanded"/>
</response>

Why is the expanded result is empty? My expectation is, from the collapsed result, for each programId get top 5 showings sorted by windowStart. how to form the query?

yonik pushed a commit that referenced this issue Jan 15, 2014
As a fake commit, this also closes github pull requests #1 #2 #3 #6 #10

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1555587 13f79535-47bb-0310-9956-ffa450edef68
@yonik yonik closed this as completed in f60a042 Jan 15, 2014
@yonik
Copy link
Member

yonik commented Jan 15, 2014

Reopening - looks like my merge-up of trunk closed this accidentally.

@yonik yonik reopened this Jan 15, 2014
@joel-bernstein
Copy link
Contributor Author

Added initial test case:

joel-bernstein@2fb7278

@joel-bernstein
Copy link
Contributor Author

Added a few more tests to cover the basic functionality.

joel-bernstein@a4b688a

My plan now is to add the distributed test cases and test it at scale and then I think this is nearing initial release condition.

Kranti has a few more features he'd like to add (group level paging, subgroup support ) and we can iterate further on these.

@joel-bernstein
Copy link
Contributor Author

Added basic distributed test cases. joel-bernstein@a9e0b4e

Also a small formatting update:joel-bernstein@c7b61a9

Also did some performance testing at scale and the Expand component seems to perform at about the same speed as the CollapsingQParserPlugin. So performing a collapse and expand takes about twice as much time as doing only the collapse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants