仓库指标

Star: (10,371 star)
PR 合并指标: (30 天内没有已合并 PR)

描述

Bug description

It takes an extremely long time to load multiwoz v22. With the data already downloaded, the train set takes >200 seconds to get to display_data on my development machine.

There are two reasons for this.

We aren't lazy when loading TOD datasets

As far as I can tell, the TOD teachers load the full dataset into memory before enumerating. I believe this comes from this issue here:

https://github.com/facebookresearch/ParlAI/blob/942952d714ebf425a5adb6483d87e562cdab1a85/parlai/core/tod/tod_agents.py#L98

Note that we list all sets of episodes, so then in setup_data, we don't get DialogTeacher's benefits from the lazy generator:

https://github.com/facebookresearch/ParlAI/blob/942952d714ebf425a5adb6483d87e562cdab1a85/parlai/core/tod/tod_agents.py#L693

Fixing this would make display_data fast, as the second issue would be unnecessary. However, it's complicated with the n_shot stuff.

We are very inefficient in looking up in the multiwoz database

multiwoz v22 has a ton of code to load the database so inform can be computed. After the database is loaded, we need to find entries corresponding to user requests.

We're spending some 92% of our time inside this method: https://github.com/facebookresearch/ParlAI/blob/942952d714ebf425a5adb6483d87e562cdab1a85/parlai/tasks/multiwoz_v22/agents.py#L159-L162

In particular, when we select from the database here: https://github.com/facebookresearch/ParlAI/blob/942952d714ebf425a5adb6483d87e562cdab1a85/parlai/tasks/multiwoz_v22/agents.py#L196-L205

The issue is we're doing a fully linear SELECT operation on line 203: we have to explicitly enumerate every row and see if it matches one of our options. We then do this for every slot and value to continuously select. 😱

To fix, we would need to build an index of (slot, value)->record_id and select from that (repeatedly reducing the set for the multiple conditions).

Alternatively: if we could just move all this into the build.py and cache it, then we would do it all once the first time the dataset is loaded, and have fast loading forever after.

贡献者指南

研究方向: 在TOD教师中实现惰性加载，不要对episodes调用list()，并为multiwoz v22的数据库查找构建索引。
技术栈: python
领域: backenddataperformance
议题类型: 缺陷
难度: 3
预计时间: 半天
活动状态: 活跃
清晰度: 清晰
前置要求: Pythongenerators
新手友好度: 60

仓库指标

描述

Bug description

We aren't lazy when loading TOD datasets

We are very inefficient in looking up in the multiwoz database

贡献者指南

每天在邮箱收到新鲜 Easy issues。