Add more popular datasets to graphscope built-in datasets · alibaba/GraphScope#1015

仓库指标

Star: (2,401 star)
PR 合并指标: (平均合并 1分钟) (30 天内合并 8 个 PR)

描述

We have several built-in datasets that can be easily loaded in one-line, located in the dataset directory of Aliyun OSS bucket graphscope, and the corresponding utility function to load them, located in python/graphscope/dataset/. We are planning to enrich the datasets continuously.

There's the procedure to add new datasets:

Find a popular and appropriate dataset, adapt the format to property graph if necessary,
Put all data files inside a folder, give the folder a meaningful name,
Compress the folder, then upload the compressed file together with the original folder to the dataset folder of the OSS bucket. Assume you have a folder named foo/, and two files foo/nodes.csv and foo/edge.csv, after this step, you will have the following file structure in the bucket:

dataset
|-- foo.tar.gz
|-- foo
    |-- nodes.csv
    |-- edge.csv

Write the loading function load_foo in a new file named python/graphscope/dataset/foo.py.
A corresponding unit test is appreciated!

贡献者指南

研究方向: 该问题提供了添加新数据集的清晰分步流程。首先，找到一个流行的图数据集，如有必要，将其适配为属性图格式。然后，创建一个以数据集名称命名的文件夹，其中包含CSV或其他文件，将其压缩为.tar.gz格式，并将压缩文件和原始文件夹上传到阿里云OSS存储桶的数据集文件夹中。接着，在python/graphscope/dataset/下新建一个文件，编写一个Python加载函数，遵循现有加载器的模式。最后，为新加载器添加单元测试。可参考仓库中现有的数据集加载器，如同目录下的那些。
技术栈: python
领域: backenddata
议题类型: 功能
难度: 3
预计时间: 1-2 天
活动状态: 活跃
清晰度: 清晰
前置要求: PythonGit
新手友好度: 80

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。