(english below)
Os dados disponibilizados dizem respeito a notícias e entidades recolhidas no âmbito deste projeto.
Para usar estes datasets deve ter uma instância de mongodb a correr e para onde possa importar dados. Deverá extrair os ficheiros zipados e, de seguida, invocar o comando mongorestore -d desarquivo <PASTA_COM_FICHEIROS>
- Dataset 01 - Conjunto das notícias (depois de limpeza não trivial de duplicados) e entidades extraídas, este dataset foi posteriormente filtrado para gerar o 2o dataset com um número mais limitado de entidades de forma a garantir a sua relevância.
- Dataset 02 - Conjunto final de notícias e entidades diretamente apresentadas na versão atual do desarquivo
Para usar estes dataset deve ter uma instância de neo4j vazia. Deve colocar os ficheiros disponibilizados na pasta /import
dessa instância e correr o comando
Grafo de ligações entre entidades.
Versão 1: Dataset 03 a (mais rápida)
neo4j-admin import --id-type=STRING --nodes=import/i_entities.csv --relationships=rel=import/i_connections.csv
ou
Versão 2: Dataset 03 b (mais lenta)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:PER {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orgs.csv' AS row
MERGE (e:ORG {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS row
MERGE (e:LOC {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///misc.csv' AS row
MERGE (e:MISC {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///news.csv' AS row
MERGE (n:NEWS {_id: row._id, title: row.title});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///connections_1.csv' AS row
MERGE (e1 {_id: row._id1})
MERGE (e2 {_id: row._id2})
WITH row, e1, e2
MERGE (e1)-[:rel{weight: toInteger(row.weight)}]-(e2);
Exemplo de visualização no neo4j-browser do dataset 3:
Grafo de ligações entre entidades e notícias (neste caso não foi preparado o comando com o neo4j-import
mas aconselha-se esse face à opção LOAD CSV
para datasets grandes) os dados são os mesmos do dataset 03 b mas, ao importar, são reorganizados de outra forma gerando um nó no grafo para cada notícia
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:PER {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orgs.csv' AS row
MERGE (e:ORG {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS row
MERGE (e:LOC {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///misc.csv' AS row
MERGE (e:MISC {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///news.csv' AS row
MERGE (n:NEWS {_id: row._id, title: row.title});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///connections_1.csv' AS row
MERGE (e1 {_id: row._id1})
MERGE (e2 {_id: row._id2})
WITH row, e1, e2
MERGE (e1)-[:rel{weight: toInteger(row.weight), news: split(row.news, ',')}]-(e2);
(english below)
The available data are related to news and entities collected in the scope of this project.
To use these datasets, you should have a running instance of mongodb where you can import data. After download, you need to unzip the files and then invoke the command mongorestore -d desarquivo <PATH_TO_FOLDER_WITH_FILES>
- Dataset 01 - Set of news (after a non-trivial de-duplication process) and entities extracted, this dataset was later filtered to generate the 2nd dataset with a more limited number of entities in order to ensure relevance.
- Dataset 02 - Final set of news and entities directly presented in the current version of Desarquivo.
To use these datasets, you need an empty running instance of neo4j. You should place the provided files in the /import
folder of that instance and then run the appropriate command.
Entity relationships graph
Version 1: Dataset 03 a (faster)
neo4j-admin import --id-type=STRING --nodes=import/i_entities.csv --relationships=rel=import/i_connections.csv
or
Version 2: Dataset 03 b (slower)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:PER {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orgs.csv' AS row
MERGE (e:ORG {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS row
MERGE (e:LOC {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///misc.csv' AS row
MERGE (e:MISC {_id: row._id, text: row.text});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///news.csv' AS row
MERGE (n:NEWS {_id: row._id, title: row.title});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///connections_1.csv' AS row
MERGE (e1 {_id: row._id1})
MERGE (e2 {_id: row._id2})
WITH row, e1, e2
MERGE (e1)-[:rel{weight: toInteger(row.weight)}]-(e2);
Example of visualization of dataset 3 in neo4j-browser:
Graph of connection between entities and news (in this case prepared with the neo4j-import
command, which is recommended over LOAD CSV
for larger datasets). The data is the same as in dataset 03 b but, when you import them, they are re-organized differently, generating a node in the graph for each news article, leaving you with a much larger but potentially richer graph.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (e:PER {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orgs.csv' AS row
MERGE (e:ORG {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///locations.csv' AS row
MERGE (e:LOC {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///misc.csv' AS row
MERGE (e:MISC {_id: row._id, text: row.text})
WITH row, e
UNWIND split(row.news, ',') AS news_piece
MERGE (n:NEWS {_id: news_piece})
MERGE (e)-[:liga]-(n);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///news.csv' AS row
MERGE (n:NEWS {_id: row._id, title: row.title});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///connections_1.csv' AS row
MERGE (e1 {_id: row._id1})
MERGE (e2 {_id: row._id2})
WITH row, e1, e2
MERGE (e1)-[:rel{weight: toInteger(row.weight), news: split(row.news, ',')}]-(e2);