Add more popular datasets to graphscope built-in datasets · alibaba/GraphScope#1015

倉庫指標

Star: (2,401 star)
PR 合併指標: (平均合併 1分鐘) (30 天內合併 8 個 PR)

描述

We have several built-in datasets that can be easily loaded in one-line, located in the dataset directory of Aliyun OSS bucket graphscope, and the corresponding utility function to load them, located in python/graphscope/dataset/. We are planning to enrich the datasets continuously.

There's the procedure to add new datasets:

Find a popular and appropriate dataset, adapt the format to property graph if necessary,
Put all data files inside a folder, give the folder a meaningful name,
Compress the folder, then upload the compressed file together with the original folder to the dataset folder of the OSS bucket. Assume you have a folder named foo/, and two files foo/nodes.csv and foo/edge.csv, after this step, you will have the following file structure in the bucket:

dataset
|-- foo.tar.gz
|-- foo
    |-- nodes.csv
    |-- edge.csv

Write the loading function load_foo in a new file named python/graphscope/dataset/foo.py.
A corresponding unit test is appreciated!

貢獻者指南

研究方向: 該問題提供了新增資料集的清晰逐步流程。首先，找到一個流行的圖資料集，如有必要，將其調整為屬性圖格式。然後，建立一個以資料集名稱命名的資料夾，其中包含CSV或其他檔案，將其壓縮為.tar.gz格式，並將壓縮檔案和原始資料夾上傳到阿里雲OSS儲存桶的資料集資料夾中。接著，在python/graphscope/dataset/下新建一個檔案，編寫一個Python載入函數，遵循現有載入器的模式。最後，為新載入器新增單元測試。可參考儲存庫中現有的資料集載入器，如同目錄下的那些。
技術棧: python
領域: backenddata
議題類型: 功能
難度: 3
預計時間: 1-2 天
活動狀態: 活躍
清晰度: 清晰
前置要求: PythonGit
新手友善度: 80

倉庫指標

描述

貢獻者指南

每天在信箱收到新鮮 Easy issues。