Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin}
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
CC-Net is a pipeline for mining and reproducing content from the web. Given the computation cost of running the full pipeline, we distributed the computation on a Slurm cluster using submitit. Submitit will default to spawning processes on your machine if a Slurm cluster is found. You should tweak --task_parallelism
to something adapted to your machine. Defaults are 512 for mining and 20 for reproducing. To run the tasks in-process use --execution debug
.
Output Format
Generated files are compressed JSON files. There is one JSON object per line.
List of Fields:
url
: Webpage URL (part of CC)date_download
: Date of download (part of CC)digest
: SHA1 digest of the webpage (part of CC)length
: Number of charsnlines
: Number of linessource_domain
: Web domain of the webpagetitle
: Page title (part of CC)raw_content
: Webpage content after deduplicationoriginal_nlines
: Number of lines before deduplicationoriginal_length
: Number of chars before deduplicationlanguage
: Language detected by FastText LIDlanguage_score
: Language scoreperplexity
: Perplexity of a LM trained on Wikipedia
Sample JSON Object:
{
"url": "http://www.pikespeakhospice.org/members/1420",
"date_download": "2019-02-15T18:40:25Z",
"digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
"length": 752,
"nlines": 5,
"source_domain": "www.pikespeakhospice.org",
"title": "LeeRoy Aragon",
"raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
"original_nlines": 7,
"original_length": 754,
"language": "en",
"language_score": 0.99,
"perplexity": 255.11
}
You can peak at those files using UNIX tools zcat
and jq
, eg:
zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .
jq
can do some complicated filtering.
JSONQL.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.
License
By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
Situs Slot Gacor Online
- GACOR88JP: Situs Slot Gacor Slot88 Online Daftar Link Slot
- AGEN138: Daftar Situs Slot Gacor Hari Ini & Link Slot Online
TARUHAN SLOTO
Penyedia slot online dengan beragam pilihan game menarik yang memudahkan pemain untuk mencapai jackpot
TARUHAN ONLINE
Pemain dapat mengakses situs web taruhan online dan melakukan taruhan secara online
Situs Web Taruhan Online