Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de-duplicate entities in apoc.export.json.data/query #3930

Open
jexp opened this issue Jan 25, 2024 · 2 comments
Open

de-duplicate entities in apoc.export.json.data/query #3930

jexp opened this issue Jan 25, 2024 · 2 comments
Assignees
Labels
core-functionality Adding new procedure, function or signature to APOC core

Comments

@jexp
Copy link
Member

jexp commented Jan 25, 2024

I'm not sure if we're de-duplicating entities in apoc.export.json.data/query

e.g. if you have a query like

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
RETURN p,r,m

where people and movies can appear multiple times.

or

MATCH (p:Person)-[r:KNOWS]-(p2:Person)
RETURN p1,r,p2

where even relationships can be duplicated.

Are we keeping track in a set of ids or so. Please check.

@jexp jexp moved this to Todo in APOC Extended Larus Feb 22, 2024
@vga91 vga91 moved this from Todo to In Progress in APOC Extended Larus Mar 12, 2024
@vga91 vga91 moved this from In Progress to Todo in APOC Extended Larus Mar 21, 2024
@vga91
Copy link
Collaborator

vga91 commented Apr 30, 2024

@jexp

Yes, entities are duplicated during export.
In fact, executing:

CREATE (p:Person {id: 1})-[r:ACTED_IN]->(m:Movie {foo: 1}) with p 
CREATE (p)-[:ACTED_IN]->(:Movie {foo: 2})

and then:

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.data(nodes, rels, "testData.json", {})
yield file return file

the resulting file has a duplicate Person node:

{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"3","labels":["Person"],"properties":{"id":1}}
{"type":"node","id":"4","labels":["Movie"],"properties":{"foo":1}}
{"type":"node","id":"5","labels":["Movie"],"properties":{"foo":2}}
{"type":"relationship","id":"2","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"4","labels":["Movie"],"properties":{"foo":1}}}
{"type":"relationship","id":"3","label":"ACTED_IN","start":{"id":"3","labels":["Person"],"properties":{"id":1}},"end":{"id":"5","labels":["Movie"],"properties":{"foo":2}}}

The issue also occurs with other procedures, such as csv, Cypher.

Moreover, it happens also with the apoc.export.<type>.graph procedures:

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
with collect(p) + collect(m) as nodes, collect(r) as rels
call apoc.export.json.graph({nodes: nodes, relationships: rels}, "testGraph.json", {})
yield file return file

With the query, such as the following, the result is duplicated, but I think in this case it is right,
since each Cypher row result corresponds to an entry in the json/csv/... file:

call apoc.export.json.query("MATCH path=(p:Person)-[r:ACTED_IN]->(m:Movie) RETURN path", "testQuery.json", {})
yield file return file

So we indeed should keep track of the IDs during the export.

Since the procedures are all in APOC Core, I think you need to create a Trello card, or am I wrong?

@vga91
Copy link
Collaborator

vga91 commented May 14, 2024

Created Trello card, with id VchWnQfd

@vga91 vga91 moved this from Todo to Core issues (with trello core card) in APOC Extended Larus May 14, 2024
@vga91 vga91 added the core-functionality Adding new procedure, function or signature to APOC core label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-functionality Adding new procedure, function or signature to APOC core
Projects
Status: Core issues
Development

No branches or pull requests

3 participants