diff --git a/README.md b/README.md index 84a8395..bd8498e 100644 --- a/README.md +++ b/README.md @@ -795,46 +795,36 @@ In case you want to run many of these queries, and you have a lot of disk space, > [!IMPORTANT] > If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files``` -To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run: +To download the crawl index, please use [cc-downloader](https://github.com/commoncrawl/cc-downloader), which is a polite downloader for Common Crawl data: ```shell -mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc' -aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc' +cargo install cc-downloader ``` -If, by any other chance, you don't have access through the AWS CLI: +`cc-downloader` will not be set up on your path by default, but you can run it by prepending the right path. +If cargo is not available or does not install, please check on [the cc-downloader official repository](https://github.com/commoncrawl/cc-downloader). ```shell -mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc' -cd 'crawl=CC-MAIN-2024-22/subset=warc' - -wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz -gunzip cc-index-table.paths.gz - -grep 'subset=warc' cc-index-table.paths | \ - awk '{print "https://data.commoncrawl.org/" $1, $1}' | \ - xargs -n 2 -P 10 sh -c ' - echo "Downloading: $2" - mkdir -p "$(dirname "$2")" && - wget -O "$2" "$1" - ' _ - -rm cc-index-table.paths -cd - +mkdir crawl +~/.cargo/bin/cc-downloader download-paths CC-MAIN-2024-22 cc-index-table crawl +~/.cargo/bin/cc-downloader download crawl/cc-index-table.paths.gz --progress crawl ``` In both ways, the file structure should be something like this: ```shell -tree my_data -my_data -└── crawl=CC-MAIN-2024-22 - └── subset=warc - ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet - ├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet - ├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet -``` - -Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files. +tree crawl/ +crawl/ +├── cc-index +│ └── table +│ └── cc-main +│ └── warc +│ └── crawl=CC-MAIN-2024-22 +│ └── subset=warc +│ ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet +│ ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c001.gz.parquet +``` + +Then, you can run `make duck_local_files LOCAL_DIR=crawl` to run the same query as above, but this time using your local copy of the index files. Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).