Hive and Pig on top of same dataset

By : Fritz Luke
Date : November 19 2020, 03:01 PM
like below fixes the issue Yes. You will need HCatalog.In Pig Shell run the below command to import the necessary jars.
code :
pig -useHCatalog
A = LOAD 'tablename' USING org.apache.hive.hcatalog.pig.HCatLoader();

What will be DataSet size in hive

By : user3704173
Date : March 29 2020, 07:55 AM
I wish this help you If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory. Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
Hive query on small dataset never finishes (or OOM)

By : Qazi Iqbal
Date : March 29 2020, 07:55 AM
help you fix your problem Took me too long to find the answer, hopefully this will help someone else...
So this breaks down to 2 problems:
code :
hive -hiveconf hive.tez.container.size=512 -hiveconf hive.tez.java.opts="-server -Xmx512m -Djava.net.preferIPv4Stack=true" -e "select * lag(status, 1, null) over (partition by type_id order by time) as status_prev from sample_table"
Get hive partition from Spark dataset

By : Chieu tran van
Date : March 29 2020, 07:55 AM
seems to work fine After reading the Spark source code, specially AlterTableRecoverPartitionsCommand in org.apache.spark.sql.execution.command.ddl.scala, which is the Spark implementation of ALTER TABLE RECOVER PARTITIONS. It's scan all the partitions, then register them.
So, here is the same idea, scan all the partitions from the location that we just wrote to.
code :
String location = "s3n://somebucket/somefolder/dateid=20171010/";
Path root = new Path(location);

Configuration hadoopConf = sparkSession.sessionState().newHadoopConf();
FileSystem fs = root.getFileSystem(hadoopConf);

JobConf jobConf = new JobConf(hadoopConf, this.getClass());
final PathFilter pathFilter = FileInputFormat.getInputPathFilter(jobConf);

FileStatus[] fileStatuses = fs.listStatus(root, path -> {
    String name = path.getName();
    if (name != "_SUCCESS" && name != "_temporary" && !name.startsWith(".")) {
        return pathFilter == null || pathFilter.accept(path);
    } else {
        return false;

for(FileStatus fileStatus: fileStatuses) {
Reading Hive table from Spark as a Dataset

By : Deckard Cain
Date : March 29 2020, 07:55 AM
wish helps you TL;DR Lack of partition pruning in the first case is the expected behavior.
It happens because any operation on an object, unlike operations used with DataFrame DSL / SQL, is a black box, from the the optimizer perspective. To be able to optimize function like x=> x._1 == "US" or x => x.country Spark would have to apply complex and unreliable static analysis, and functionality like this is neither present nor (as far as I know) planned for the future.
code :
hiveDF.groupBy($"country").count().filter($"country" =!= "US")
Spark Dataset on Hive vs Parquet file

By : Soundar
Date : March 29 2020, 07:55 AM
hop of those help? Hive serves as a storage for metadata about the Parquet file. Spark can leverage the information contained therein to perform interesting optimizations. Since the backing storage is the same you'll probably not see much difference, but the optimizations based on the metadata in Hive can give an edge.
