This project provides examples how to generate an input file for graph generation. This project is just a Scala version of the hostlinks_to_graph.py of Common Crawl project.
-
Clone project
-
Create shadowJar with command
./gradlew shadowJar
-
Move shadow jar from
build/libs/cc-scala-all.jar
to your spark gateway -
Run code with following format:
spark-submit --class com.ntent.commoncrawl.HostLinksToGraph cc-scala-all.jar args[]
where args:- args[0] InputParquet
- args[1] EdgesOutput
- args[2] VerticesOutput
- args[3] ValidateHosts
- args[4] SaveAsText
- args[5] NumPartitions
- args[6] VertexIDs
Please refer to original code for the details of the parameters.
If you use shadow JAR file, all dependencies are included in the final jar file otherwise you need to import required dependencies to your project. Please refer to build.gradle file for all the dependencies.
This project doesn't have any unit tests yet
This project is licensed under the Apache 2.0 License - see the LICENSE file for details